[PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2
@ 2017-03-09  7:56 Mel Gorman
  2017-03-09  7:56 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Mel Gorman @ 2017-03-09  7:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM,
	Mel Gorman

Changelog since v1
o Rebase to 4.11-rc1
o Add small clarifying comment based on review

The series is unusual in that the first patch fixes one problem and
introduces of other issues that are noted in the changelog. Patch 2 makes
a minor modification that is worth considering on its own but leaves the
kernel in a state where it behaves badly. It's not until patch 3 that
there is an improvement against baseline.

This was mostly motivated by examining Chris Mason's "simoop" benchmark
which puts the VM under similar pressure to HADOOP. It has been reported
that the benchmark has regressed severely during the last number of
releases. While I cannot reproduce all the same problems Chris experienced
due to hardware limitations, there was a number of problems on a 2-socket
machine with a single disk.

simoop latencies
                                         4.11.0-rc1            4.11.0-rc1
                                            vanilla          keepawake-v2
Amean    p50-Read             21670074.18 (  0.00%) 22668332.52 ( -4.61%)
Amean    p95-Read             25456267.64 (  0.00%) 26738688.00 ( -5.04%)
Amean    p99-Read             29369064.73 (  0.00%) 30991404.52 ( -5.52%)
Amean    p50-Write                1390.30 (  0.00%)      924.91 ( 33.47%)
Amean    p95-Write              412901.57 (  0.00%)     1362.62 ( 99.67%)
Amean    p99-Write             6668722.09 (  0.00%)    16854.04 ( 99.75%)
Amean    p50-Allocation          78714.31 (  0.00%)    74729.74 (  5.06%)
Amean    p95-Allocation         175533.51 (  0.00%)   101609.74 ( 42.11%)
Amean    p99-Allocation         247003.02 (  0.00%)   125765.57 ( 49.08%)

These are latencies. Read/write are threads reading fixed-size random blocks
from a simulated database. The allocation latency is mmaping and faulting
regions of memory. The p50, 95 and p99 reports the worst latencies for 50%
of the samples, 95% and 99% respectively.

For example, the report indicates that while the test was running 99% of
writes completed 99.75% faster. It's worth noting that on a UMA machine that
no difference in performance with simoop was observed so milage will vary.

It's noted that there is a slight impact to read latencies but it's mostly
due to IO scheduler decisions and offset by the large reduction in other
latencies.

 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 136 ++++++++++++++++++++++++++++++----------------------
 2 files changed, 79 insertions(+), 59 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-03-09  7:56 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2 Mel Gorman
@ 2017-03-09  7:56 ` Mel Gorman
  2017-03-10  9:06   ` Vlastimil Babka
  2017-03-09  7:56 ` [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced Mel Gorman
  2017-03-09  7:56 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman
  2 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2017-03-09  7:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM,
	Mel Gorman

From: Shantanu Goel <sgoel01@yahoo.com>

The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
since the latter will return as soon as any one of the zones in the
classzone is above the watermark.  This is specially important for higher
order allocations since balance_pgdat will typically reset the order to
zero relying on compaction to create the higher order pages.  Without this
patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
balance check fails.

It was first reported against 4.9.7 that kswapd is failing to wake up
kcompactd due to a mismatch in the zone balance check between balance_pgdat()
and prepare_kswapd_sleep().  balance_pgdat() returns as soon as a single
zone satisfies the allocation but prepare_kswapd_sleep() requires all zones
to do +the same.  This causes prepare_kswapd_sleep() to never succeed except
in the order == 0 case and consequently, wakeup_kcompactd() is never called.
For the machine that originally motivated this patch, the state of compaction
from /proc/vmstat looked this way after a day and a half +of uptime:

compact_migrate_scanned 240496
compact_free_scanned 76238632
compact_isolated 123472
compact_stall 1791
compact_fail 29
compact_success 1762
compact_daemon_wake 0

After applying the patch and about 10 hours of uptime the state looks
like this:

compact_migrate_scanned 59927299
compact_free_scanned 2021075136
compact_isolated 640926
compact_stall 4
compact_fail 2
compact_success 2
compact_daemon_wake 5160

Further notes from Mel that motivated him to pick this patch up and
resend it;

It was observed for the simoop workload (pressures the VM similar to HADOOP)
that kswapd was failing to keep ahead of direct reclaim. The investigation
noted that there was a need to rationalise kswapd decisions to reclaim
with kswapd decisions to sleep. With this patch on a 2-socket box, there
was a 49% reduction in direct reclaim scanning.

However, the impact otherwise is extremely negative. Kswapd reclaim
efficiency dropped from 98% to 76%. simoop has three latency-related
metrics for read, write and allocation (an anonymous mmap and fault).

                                         4.11.0-rc1            4.11.0-rc1
                                            vanilla           fixcheck-v2
Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%)
Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%)
Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%)
Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)
Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)
Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)
Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)
Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)
Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%)

Of greater concern is that the patch causes swapping and page writes
from kswapd context rose from 0 pages to 4189753 pages during the hour
the workload ran for. By and large, the patch has very bad behaviour but
easily missed as the impact on a UMA machine is negligible.

This patch is included with the data in case a bisection leads to this area.
This patch is also a pre-requisite for the rest of the series.

Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/vmscan.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc8031ef994d..4ea444142c2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3134,11 +3134,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;

-		if (!zone_balanced(zone, order, classzone_idx))
-			return false;
+		if (zone_balanced(zone, order, classzone_idx))
+			return true;
 	}

-	return true;
+	return false;
 }

 /*
@@ -3331,7 +3331,13 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o

 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);

-	/* Try to sleep for a short interval */
+	/*
+	 * Try to sleep for a short interval. Note that kcompactd will only be
+	 * woken if it is possible to sleep for a short interval. This is
+	 * deliberate on the assumption that if reclaim cannot keep an
+	 * eligible zone balanced that it's also unlikely that compaction will
+	 * succeed.
+	 */
 	if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
 		/*
 		 * Compaction records what page blocks it recently failed to
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-03-09  7:56 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
@ 2017-03-10  9:06   ` Vlastimil Babka
  0 siblings, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2017-03-10  9:06 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Shantanu Goel, Johannes Weiner, LKML, Linux-MM

On 03/09/2017 08:56 AM, Mel Gorman wrote:
> From: Shantanu Goel <sgoel01@yahoo.com>
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> It was first reported against 4.9.7 that kswapd is failing to wake up
> kcompactd due to a mismatch in the zone balance check between balance_pgdat()
> and prepare_kswapd_sleep().  balance_pgdat() returns as soon as a single
> zone satisfies the allocation but prepare_kswapd_sleep() requires all zones
> to do +the same.  This causes prepare_kswapd_sleep() to never succeed except
> in the order == 0 case and consequently, wakeup_kcompactd() is never called.
> For the machine that originally motivated this patch, the state of compaction
> from /proc/vmstat looked this way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 49% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>                                          4.11.0-rc1            4.11.0-rc1
>                                             vanilla           fixcheck-v2
> Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%)
> Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%)
> Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%)
> Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)
> Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)
> Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)
> Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)
> Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)
> Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%)
> 
> Of greater concern is that the patch causes swapping and page writes
> from kswapd context rose from 0 pages to 4189753 pages during the hour
> the workload ran for. By and large, the patch has very bad behaviour but
> easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced
  2017-03-09  7:56 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2 Mel Gorman
  2017-03-09  7:56 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
@ 2017-03-09  7:56 ` Mel Gorman
  2017-03-10  9:06   ` Vlastimil Babka
  2017-03-09  7:56 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman
  2 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2017-03-09  7:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM,
	Mel Gorman

A pgdat tracks if recent reclaim encountered too many dirty, writeback
or congested pages. The flags control whether kswapd writes pages back
from reclaim context, tags pages for immediate reclaim when IO completes,
whether processes block on wait_iff_congested and whether kswapd blocks
when too many pages marked for immediate reclaim are encountered.

The state is cleared in a check function with side-effects. With the patch
"mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the timing
of when the bits get cleared changed. Due to the way the check works,
it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA allocation
because it does not account for lowmem reserves properly.

For the simoop workload, kswapd is not stalling when it should due to
the premature clearing, writing pages from reclaim context like crazy and
generally being unhelpful.

This patch resets the pgdat bits related to page reclaim only when kswapd
is going to sleep. The comparison with simoop is then

                                         4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                            vanilla           fixcheck-v2              clear-v2
Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%) 19786774.76 (  8.69%)
Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%) 24101956.27 (  5.32%)
Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%) 27691872.71 (  5.71%)
Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)     1011.91 ( 27.22%)
Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)    34874.98 ( 91.55%)
Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)   575449.60 ( 91.37%)
Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)    84246.26 ( -7.03%)
Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)   400058.43 (-127.91%)
Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%) 10905600.00 (-4315.17%)

Read latency is improved, write latency is mostly improved but allocation
latency is regressed.  kswapd is still reclaiming inefficiently,
pages are being written back from writeback context and a host of other
issues. However, given the change, it needed to be spelled out why the
side-effect was moved.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4ea444142c2e..17b1afbce88e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3091,17 +3091,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
 		return false;
 
-	/*
-	 * If any eligible zone is balanced then the node is not considered
-	 * to be congested or dirty
-	 */
-	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
-	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
-	clear_bit(PGDAT_WRITEBACK, &zone->zone_pgdat->flags);
-
 	return true;
 }
 
+/* Clear pgdat state for congested, dirty or under writeback. */
+static void clear_pgdat_congested(pg_data_t *pgdat)
+{
+	clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+	clear_bit(PGDAT_DIRTY, &pgdat->flags);
+	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+}
+
 /*
  * Prepare kswapd for sleeping. This verifies that there are no processes
  * waiting in throttle_direct_reclaim() and that watermarks have been met.
@@ -3134,8 +3134,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (zone_balanced(zone, order, classzone_idx))
+		if (zone_balanced(zone, order, classzone_idx)) {
+			clear_pgdat_congested(pgdat);
 			return true;
+		}
 	}
 
 	return false;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced
  2017-03-09  7:56 ` [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced Mel Gorman
@ 2017-03-10  9:06   ` Vlastimil Babka
  0 siblings, 0 replies; 13+ messages in thread
From: Vlastimil Babka @ 2017-03-10  9:06 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: Shantanu Goel, Johannes Weiner, LKML, Linux-MM

On 03/09/2017 08:56 AM, Mel Gorman wrote:
> A pgdat tracks if recent reclaim encountered too many dirty, writeback
> or congested pages. The flags control whether kswapd writes pages back
> from reclaim context, tags pages for immediate reclaim when IO completes,
> whether processes block on wait_iff_congested and whether kswapd blocks
> when too many pages marked for immediate reclaim are encountered.
> 
> The state is cleared in a check function with side-effects. With the patch
> "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the timing
> of when the bits get cleared changed. Due to the way the check works,
> it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA allocation
> because it does not account for lowmem reserves properly.
> 
> For the simoop workload, kswapd is not stalling when it should due to
> the premature clearing, writing pages from reclaim context like crazy and
> generally being unhelpful.
> 
> This patch resets the pgdat bits related to page reclaim only when kswapd
> is going to sleep. The comparison with simoop is then
> 
>                                          4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
>                                             vanilla           fixcheck-v2              clear-v2
> Amean    p50-Read             21670074.18 (  0.00%) 20464344.18 (  5.56%) 19786774.76 (  8.69%)
> Amean    p95-Read             25456267.64 (  0.00%) 25721423.64 ( -1.04%) 24101956.27 (  5.32%)
> Amean    p99-Read             29369064.73 (  0.00%) 30174230.76 ( -2.74%) 27691872.71 (  5.71%)
> Amean    p50-Write                1390.30 (  0.00%)     1395.28 ( -0.36%)     1011.91 ( 27.22%)
> Amean    p95-Write              412901.57 (  0.00%)    37737.74 ( 90.86%)    34874.98 ( 91.55%)
> Amean    p99-Write             6668722.09 (  0.00%)   666489.04 ( 90.01%)   575449.60 ( 91.37%)
> Amean    p50-Allocation          78714.31 (  0.00%)    86286.22 ( -9.62%)    84246.26 ( -7.03%)
> Amean    p95-Allocation         175533.51 (  0.00%)   351812.27 (-100.42%)   400058.43 (-127.91%)
> Amean    p99-Allocation         247003.02 (  0.00%)  6291171.56 (-2447.00%) 10905600.00 (-4315.17%)
> 
> Read latency is improved, write latency is mostly improved but allocation
> latency is regressed.  kswapd is still reclaiming inefficiently,
> pages are being written back from writeback context and a host of other
> issues. However, given the change, it needed to be spelled out why the
> side-effect was moved.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-03-09  7:56 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2 Mel Gorman
  2017-03-09  7:56 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
  2017-03-09  7:56 ` [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced Mel Gorman
@ 2017-03-09  7:56 ` Mel Gorman
  2 siblings, 0 replies; 13+ messages in thread
From: Mel Gorman @ 2017-03-09  7:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM,
	Mel Gorman

kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken for
in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
zone will be 0.

However, the decision on whether to sleep is made on kswapd_classzone_idx
which is 0 without a recent wakeup request and that classzone does not
account for lowmem reserves.  This allows kswapd to sleep when a low
small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
a stream of allocations cannot use that zone. While kswapd may be woken
again shortly in the near future there are two consequences -- the pgdat
bits that control congestion are cleared prematurely and direct reclaim
is more likely as kswapd slept prematurely.

This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
index) when there has been no recent wakeups. If there are no wakeups,
it'll decide whether to sleep based on the highest possible zone available
(MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
decisions during reclaim and when deciding to sleep are the same. If there is
a mismatch, kswapd can stay awake continually trying to balance tiny zones.

simoop was used to evaluate it again. Two of the preparation patches
regressed the workload so they are included as the second set of
results. Otherwise this patch looks artifically excellent

                                         4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                            vanilla              clear-v2          keepawake-v2
Amean    p50-Read             21670074.18 (  0.00%) 19786774.76 (  8.69%) 22668332.52 ( -4.61%)
Amean    p95-Read             25456267.64 (  0.00%) 24101956.27 (  5.32%) 26738688.00 ( -5.04%)
Amean    p99-Read             29369064.73 (  0.00%) 27691872.71 (  5.71%) 30991404.52 ( -5.52%)
Amean    p50-Write                1390.30 (  0.00%)     1011.91 ( 27.22%)      924.91 ( 33.47%)
Amean    p95-Write              412901.57 (  0.00%)    34874.98 ( 91.55%)     1362.62 ( 99.67%)
Amean    p99-Write             6668722.09 (  0.00%)   575449.60 ( 91.37%)    16854.04 ( 99.75%)
Amean    p50-Allocation          78714.31 (  0.00%)    84246.26 ( -7.03%)    74729.74 (  5.06%)
Amean    p95-Allocation         175533.51 (  0.00%)   400058.43 (-127.91%)   101609.74 ( 42.11%)
Amean    p99-Allocation         247003.02 (  0.00%) 10905600.00 (-4315.17%)   125765.57 ( 49.08%)

With this patch on top, write and allocation latencies are massively
improved.  The read latencies are slightly impaired but it's worth noting
that this is mostly due to the IO scheduler and not directly related to
reclaim. The vmstats are a bit of a mix but the relevant ones are as follows;

                            4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
                          mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins                             0           0           0
Swap Outs                            0         608           0
Direct pages scanned           6910672     3132699     6357298
Kswapd pages scanned          57036946    82488665    56986286
Kswapd pages reclaimed        55993488    63474329    55939113
Direct pages reclaimed         6905990     2964843     6352115
Kswapd efficiency                  98%         76%         98%
Kswapd velocity              12494.375   17597.507   12488.065
Direct efficiency                  99%         94%         99%
Direct velocity               1513.835     668.306    1393.148
Page writes by reclaim           0.000 4410243.000       0.000
Page writes file                     0     4409635           0
Page writes anon                     0         608           0
Page reclaim immediate         1036792    14175203     1042571

                            4.11.0-rc1  4.11.0-rc1  4.11.0-rc1
                               vanilla  clear-v2  keepawake-v2
Swap Ins                             0          12           0
Swap Outs                            0         838           0
Direct pages scanned           6579706     3237270     6256811
Kswapd pages scanned          61853702    79961486    54837791
Kswapd pages reclaimed        60768764    60755788    53849586
Direct pages reclaimed         6579055     2987453     6256151
Kswapd efficiency                  98%         75%         98%
Page writes by reclaim           0.000 4389496.000       0.000
Page writes file                     0     4388658           0
Page writes anon                     0         838           0
Page reclaim immediate         1073573    14473009      982507

Swap-outs are equivalent to baseline.
Direct reclaim is reduced but not eliminated. It's worth noting
	that there are two periods of direct reclaim for this workload. The
	first is when it switches from preparing the files for the actual
	test itself. It's a lot of file IO followed by a lot of allocs
	that reclaims heavily for a brief window. While direct reclaim
	is lower with clear-v2, it is due to kswapd scanning aggressively
	and trying to reclaim the world which is not the right thing to do.
	With the patches applied, there is still direct reclaim but the phase
	change from "creating work files" to starting multiple threads that
	allocate a lot of anonymous memory faster than kswapd can reclaim.
Scanning/reclaim efficiency is restored by this patch.
Page writes from reclaim context are back at 0 which is ideal.
Pages immediately reclaimed after IO completes is slightly improved but it
	is expected this will vary slightly.

On UMA, there is almost no change so this is not expected to be a universal
win.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 118 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 66 insertions(+), 54 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 295479b792ec..edff09061e32 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1207,7 +1207,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 		/* Reset the nr_zones, order and classzone_idx before reuse */
 		pgdat->nr_zones = 0;
 		pgdat->kswapd_order = 0;
-		pgdat->kswapd_classzone_idx = 0;
+		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	}
 
 	/* we can use NODE_DATA(nid) from here */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 17b1afbce88e..6b09ed5e4bda 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3084,14 +3084,36 @@ static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
+/*
+ * Returns true if there is an eligible zone balanced for the request order
+ * and classzone_idx
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
-	unsigned long mark = high_wmark_pages(zone);
+	int i;
+	unsigned long mark = -1;
+	struct zone *zone;
 
-	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
-		return false;
+	for (i = 0; i <= classzone_idx; i++) {
+		zone = pgdat->node_zones + i;
 
-	return true;
+		if (!managed_zone(zone))
+			continue;
+
+		mark = high_wmark_pages(zone);
+		if (zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+			return true;
+	}
+
+	/*
+	 * If a node has no populated zone within classzone_idx, it does not
+	 * need balancing by definition. This can happen if a zone-restricted
+	 * allocation tries to wake a remote kswapd.
+	 */
+	if (mark == -1)
+		return true;
+
+	return false;
 }
 
 /* Clear pgdat state for congested, dirty or under writeback. */
@@ -3110,8 +3132,6 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
  */
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
-	int i;
-
 	/*
 	 * The throttled processes are normally woken up in balance_pgdat() as
 	 * soon as pfmemalloc_watermark_ok() is true. But there is a potential
@@ -3128,16 +3148,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
-	for (i = 0; i <= classzone_idx; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		if (!managed_zone(zone))
-			continue;
-
-		if (zone_balanced(zone, order, classzone_idx)) {
-			clear_pgdat_congested(pgdat);
-			return true;
-		}
+	if (pgdat_balanced(pgdat, order, classzone_idx)) {
+		clear_pgdat_congested(pgdat);
+		return true;
 	}
 
 	return false;
@@ -3243,23 +3256,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		}
 
 		/*
-		 * Only reclaim if there are no eligible zones. Check from
-		 * high to low zone as allocations prefer higher zones.
-		 * Scanning from low to high zone would allow congestion to be
-		 * cleared during a very small window when a small low
-		 * zone was balanced even under extreme pressure when the
-		 * overall node may be congested. Note that sc.reclaim_idx
-		 * is not used as buffer_heads_over_limit may have adjusted
-		 * it.
+		 * Only reclaim if there are no eligible zones. Note that
+		 * sc.reclaim_idx is not used as buffer_heads_over_limit may
+		 * have adjusted it.
 		 */
-		for (i = classzone_idx; i >= 0; i--) {
-			zone = pgdat->node_zones + i;
-			if (!managed_zone(zone))
-				continue;
-
-			if (zone_balanced(zone, sc.order, classzone_idx))
-				goto out;
-		}
+		if (pgdat_balanced(pgdat, sc.order, classzone_idx))
+			goto out;
 
 		/*
 		 * Do some background aging of the anon list, to give
@@ -3322,6 +3324,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	return sc.order;
 }
 
+/*
+ * pgdat->kswapd_classzone_idx is the highest zone index that a recent
+ * allocation request woke kswapd for. When kswapd has not woken recently,
+ * the value is MAX_NR_ZONES which is not a valid index. This compares a
+ * given classzone and returns it or the highest classzone index kswapd
+ * was recently woke for.
+ */
+static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
+					   enum zone_type classzone_idx)
+{
+	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
+		return classzone_idx;
+
+	return max(pgdat->kswapd_classzone_idx, classzone_idx);
+}
+
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
 				unsigned int classzone_idx)
 {
@@ -3363,7 +3381,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 		 * the previous request that slept prematurely.
 		 */
 		if (remaining) {
-			pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+			pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
 			pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
 		}
 
@@ -3417,7 +3435,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
  */
 static int kswapd(void *p)
 {
-	unsigned int alloc_order, reclaim_order, classzone_idx;
+	unsigned int alloc_order, reclaim_order;
+	unsigned int classzone_idx = MAX_NR_ZONES - 1;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 
@@ -3447,20 +3466,23 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
-	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
-	pgdat->kswapd_classzone_idx = classzone_idx = 0;
+	pgdat->kswapd_order = 0;
+	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
 		bool ret;
 
+		alloc_order = reclaim_order = pgdat->kswapd_order;
+		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
+
 kswapd_try_sleep:
 		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
 					classzone_idx);
 
 		/* Read the new order and classzone_idx */
 		alloc_order = reclaim_order = pgdat->kswapd_order;
-		classzone_idx = pgdat->kswapd_classzone_idx;
+		classzone_idx = kswapd_classzone_idx(pgdat, 0);
 		pgdat->kswapd_order = 0;
-		pgdat->kswapd_classzone_idx = 0;
+		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -3486,9 +3508,6 @@ static int kswapd(void *p)
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
-
-		alloc_order = reclaim_order = pgdat->kswapd_order;
-		classzone_idx = pgdat->kswapd_classzone_idx;
 	}
 
 	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
@@ -3504,7 +3523,6 @@ static int kswapd(void *p)
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
-	int z;
 
 	if (!managed_zone(zone))
 		return;
@@ -3512,22 +3530,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
 		return;
 	pgdat = zone->zone_pgdat;
-	pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+	pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
 	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
 	/* Only wake kswapd if all zones are unbalanced */
-	for (z = 0; z <= classzone_idx; z++) {
-		zone = pgdat->node_zones + z;
-		if (!managed_zone(zone))
-			continue;
-
-		if (zone_balanced(zone, order, classzone_idx))
-			return;
-	}
+	if (pgdat_balanced(pgdat, order, classzone_idx))
+		return;
 
-	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, classzone_idx, order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
@ 2017-02-15  9:22 Mel Gorman
  2017-02-15  9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
  0 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

This patchset is based on mmots as of Feb 9th, 2016. The baseline is
important as there are a number of kswapd-related fixes in that tree and
a comparison against v4.10-rc7 would be almost meaningless as a result.

The series is unusual in that the first patch fixes one problem and
introduces a host of other issues and is incomplete. It was not developed
by me but it appears to have gotten lost so I picked it up and added to the
changelog. Patch 2 makes a minor modification that is worth considering
on its own but leaves the kernel in a state where it behaves badly. It's
not until patch 3 that there is an improvement against baseline.

This was mostly motivated by examining Chris Mason's "simoop" benchmark
which puts the VM under similar pressure to HADOOP. It has been reported
that the benchmark has regressed severely during the last number of
releases. While I cannot reproduce all the same problems Chris experienced
due to hardware limitations, there was a number of problems on a 2-socket
machine with a single disk.

                                         4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209       keepawake-v1r25
Amean    p50-Read             22325202.49 (  0.00%) 22092755.48 (  1.04%)
Amean    p95-Read             26102988.80 (  0.00%) 26101849.04 (  0.00%)
Amean    p99-Read             30935176.53 (  0.00%) 29746220.52 (  3.84%)
Amean    p50-Write                 976.44 (  0.00%)      952.73 (  2.43%)
Amean    p95-Write               15471.29 (  0.00%)     3140.27 ( 79.70%)
Amean    p99-Write               35108.62 (  0.00%)     8843.73 ( 74.81%)
Amean    p50-Allocation          76382.61 (  0.00%)    76349.22 (  0.04%)
Amean    p95-Allocation         127777.39 (  0.00%)   108630.26 ( 14.98%)
Amean    p99-Allocation         187937.39 (  0.00%)   139094.26 ( 25.99%)

These are latencies. Read/write are threads reading fixed-size random blocks
from a simulated database. The allocation latency is mmaping and faulting
regions of memory. The p50, 95 and p99 reports the worst latencies for 50%
of the samples, 95% and 99% respectively.

For example, the report indicates that while the test was running 99% of
writes completed 74.81% faster. It's worth noting that on a UMA machine that
no difference in performance with simoop was observed so milage will vary.

On UMA, there was a notable difference in the "stutter" benchmark which
measures the latency of mmap while large files are being copied. This has
been used as a proxy measure for desktop jitter while large amounts of IO
were taking place

                            4.10.0-rc7            4.10.0-rc7
                        mmots-20170209          keepawake-v1
Min         mmap      6.3847 (  0.00%)      5.9785 (  6.36%)
1st-qrtle   mmap      7.6310 (  0.00%)      7.4086 (  2.91%)
2nd-qrtle   mmap      9.9959 (  0.00%)      7.7052 ( 22.92%)
3rd-qrtle   mmap     14.8180 (  0.00%)      8.5895 ( 42.03%)
Max-90%     mmap     15.8397 (  0.00%)     13.6974 ( 13.52%)
Max-93%     mmap     16.4268 (  0.00%)     14.3175 ( 12.84%)
Max-95%     mmap     18.3295 (  0.00%)     16.9233 (  7.67%)
Max-99%     mmap     24.2042 (  0.00%)     20.6182 ( 14.82%)
Max         mmap    255.0688 (  0.00%)    265.5818 ( -4.12%)
Mean        mmap     11.2192 (  0.00%)      9.1811 ( 18.17%)

Latency is measured in milliseconds and indicates that 99% of mmap
operations complete 14.82% faster and are 18.17% faster on average with
these patches applied.

 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 128 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 72 insertions(+), 58 deletions(-)

-- 
2.11.0

Mel Gorman (2):
  mm, vmscan: Only clear pgdat congested/dirty/writeback state when
    balanced
  mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched
    classzone_idx

Shantanu Goel (1):
  mm, vmscan: fix zone balance check in prepare_kswapd_sleep

 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 128 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 72 insertions(+), 58 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-15  9:22 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Mel Gorman
@ 2017-02-15  9:22 ` Mel Gorman
  2017-02-16  2:50   ` Hillf Danton
  2017-02-22  7:00   ` Minchan Kim
  0 siblings, 2 replies; 13+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

From: Shantanu Goel <sgoel01@yahoo.com>

The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
since the latter will return as soon as any one of the zones in the
classzone is above the watermark.  This is specially important for higher
order allocations since balance_pgdat will typically reset the order to
zero relying on compaction to create the higher order pages.  Without this
patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
balance check fails.

On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
zone balance check between balance_pgdat() and prepare_kswapd_sleep().
balance_pgdat() returns as soon as a single zone satisfies the allocation
but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
prepare_kswapd_sleep() to never succeed except in the order == 0 case and
consequently, wakeup_kcompactd() is never called.  On my machine prior to
apply this patch, the state of compaction from /proc/vmstat looked this
way after a day and a half +of uptime:

compact_migrate_scanned 240496
compact_free_scanned 76238632
compact_isolated 123472
compact_stall 1791
compact_fail 29
compact_success 1762
compact_daemon_wake 0

After applying the patch and about 10 hours of uptime the state looks
like this:

compact_migrate_scanned 59927299
compact_free_scanned 2021075136
compact_isolated 640926
compact_stall 4
compact_fail 2
compact_success 2
compact_daemon_wake 5160

Further notes from Mel that motivated him to pick this patch up and
resend it;

It was observed for the simoop workload (pressures the VM similar to HADOOP)
that kswapd was failing to keep ahead of direct reclaim. The investigation
noted that there was a need to rationalise kswapd decisions to reclaim
with kswapd decisions to sleep. With this patch on a 2-socket box, there
was a 43% reduction in direct reclaim scanning.

However, the impact otherwise is extremely negative. Kswapd reclaim
efficiency dropped from 98% to 76%. simoop has three latency-related
metrics for read, write and allocation (an anonymous mmap and fault).

                                         4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209           fixcheck-v1
Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)

There are also more allocation stalls. One of the largest impacts was due
to pages written back from kswapd context rising from 0 pages to 4516642
pages during the hour the workload ran for. By and large, the patch has very
bad behaviour but easily missed as the impact on a UMA machine is negligible.

This patch is included with the data in case a bisection leads to this area.
This patch is also a pre-requisite for the rest of the series.

Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..92fc66bd52bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;

-		if (!zone_balanced(zone, order, classzone_idx))
-			return false;
+		if (zone_balanced(zone, order, classzone_idx))
+			return true;
 	}

-	return true;
+	return false;
 }

 /*
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-15  9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
@ 2017-02-16  2:50   ` Hillf Danton
  2017-02-22  7:00   ` Minchan Kim
  1 sibling, 0 replies; 13+ messages in thread
From: Hillf Danton @ 2017-02-16  2:50 UTC (permalink / raw)
  To: 'Mel Gorman', 'Andrew Morton'
  Cc: 'Shantanu Goel', 'Chris Mason',
	'Johannes Weiner', 'Vlastimil Babka',
	'LKML', 'Linux-MM'

On February 15, 2017 5:23 PM Mel Gorman wrote: 
> 
> From: Shantanu Goel <sgoel01@yahoo.com>
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
> zone balance check between balance_pgdat() and prepare_kswapd_sleep().
> balance_pgdat() returns as soon as a single zone satisfies the allocation
> but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
> prepare_kswapd_sleep() to never succeed except in the order == 0 case and
> consequently, wakeup_kcompactd() is never called.  On my machine prior to
> apply this patch, the state of compaction from /proc/vmstat looked this
> way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 43% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>                                          4.10.0-rc7            4.10.0-rc7
>                                      mmots-20170209           fixcheck-v1
> Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
> Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
> Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
> Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
> Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
> Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
> Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
> Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
> Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)
> 
> There are also more allocation stalls. One of the largest impacts was due
> to pages written back from kswapd context rising from 0 pages to 4516642
> pages during the hour the workload ran for. By and large, the patch has very
> bad behaviour but easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..92fc66bd52bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		if (!managed_zone(zone))
>  			continue;
> 
> -		if (!zone_balanced(zone, order, classzone_idx))
> -			return false;
> +		if (zone_balanced(zone, order, classzone_idx))
> +			return true;
>  	}
> 
> -	return true;
> +	return false;
>  }
> 
>  /*
> --
> 2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-15  9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
  2017-02-16  2:50   ` Hillf Danton
@ 2017-02-22  7:00   ` Minchan Kim
  2017-02-23 15:05     ` Mel Gorman
  1 sibling, 1 reply; 13+ messages in thread
From: Minchan Kim @ 2017-02-22  7:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi,

On Wed, Feb 15, 2017 at 09:22:45AM +0000, Mel Gorman wrote:
> From: Shantanu Goel <sgoel01@yahoo.com>
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
> zone balance check between balance_pgdat() and prepare_kswapd_sleep().
> balance_pgdat() returns as soon as a single zone satisfies the allocation
> but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
> prepare_kswapd_sleep() to never succeed except in the order == 0 case and
> consequently, wakeup_kcompactd() is never called.  On my machine prior to
> apply this patch, the state of compaction from /proc/vmstat looked this
> way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 43% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>                                          4.10.0-rc7            4.10.0-rc7
>                                      mmots-20170209           fixcheck-v1
> Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
> Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
> Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
> Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
> Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
> Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
> Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
> Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
> Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)
> 
> There are also more allocation stalls. One of the largest impacts was due
> to pages written back from kswapd context rising from 0 pages to 4516642
> pages during the hour the workload ran for. By and large, the patch has very
> bad behaviour but easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
short sleep point where every eligible zones are balanced.
What's the correlation between them?

Can't we wake up kcompactd once we found a zone has enough free pages
above high watermark like this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..f4f0ad0e9ede 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 		 * that pages and compaction may succeed so reset the cache.
 		 */
 		reset_isolation_suitable(pgdat);
-
-		/*
-		 * We have freed the memory, now we should compact it to make
-		 * allocation of the requested order possible.
-		 */
-		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
-
 		remaining = schedule_timeout(HZ/10);
 
 		/*
@@ -3451,6 +3444,14 @@ static int kswapd(void *p)
 		bool ret;
 
 kswapd_try_sleep:
+		/*
+		 * We have freed the memory, now we should compact it to make
+		 * allocation of the requested order possible.
+		 */
+		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
+							classzone_idx))
+			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+
 		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
 					classzone_idx);
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-22  7:00   ` Minchan Kim
@ 2017-02-23 15:05     ` Mel Gorman
  2017-02-24  1:17       ` Minchan Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2017-02-23 15:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > There are also more allocation stalls. One of the largest impacts was due
> > to pages written back from kswapd context rising from 0 pages to 4516642
> > pages during the hour the workload ran for. By and large, the patch has very
> > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > 
> > This patch is included with the data in case a bisection leads to this area.
> > This patch is also a pre-requisite for the rest of the series.
> > 
> > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> short sleep point where every eligible zones are balanced.
> What's the correlation between them?
> 

If kswapd is ready for a short sleep, eligible zones are balanced for
order-0 but not necessarily the originally requested order if kswapd
gave up reclaiming as compaction was ready to start. As kswapd is ready
to sleep for a short period, it's a suitable time for kcompactd to decide
if it should start working or not. There is no need for kswapd to be aware
of kcompactd's wakeup criteria.

> Can't we wake up kcompactd once we found a zone has enough free pages
> above high watermark like this?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..f4f0ad0e9ede 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
>  		 * that pages and compaction may succeed so reset the cache.
>  		 */
>  		reset_isolation_suitable(pgdat);
> -
> -		/*
> -		 * We have freed the memory, now we should compact it to make
> -		 * allocation of the requested order possible.
> -		 */
> -		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> -
>  		remaining = schedule_timeout(HZ/10);
>  
>  		/*
> @@ -3451,6 +3444,14 @@ static int kswapd(void *p)
>  		bool ret;
>  
>  kswapd_try_sleep:
> +		/*
> +		 * We have freed the memory, now we should compact it to make
> +		 * allocation of the requested order possible.
> +		 */
> +		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
> +							classzone_idx))
> +			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> +
>  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>  					classzone_idx);

That's functionally very similar to what happens already.  wakeup_kcompactd
checks the order and does not wake for order-0. It also makes its own
decisions that include zone_balanced on whether it is safe to wakeup.

I doubt there would be any measurable difference from a patch like this
and to my mind at least, it does not improve the readability or flow of
the code.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-23 15:05     ` Mel Gorman
@ 2017-02-24  1:17       ` Minchan Kim
  2017-02-24  9:11         ` Mel Gorman
  0 siblings, 1 reply; 13+ messages in thread
From: Minchan Kim @ 2017-02-24  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi Mel,

On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > There are also more allocation stalls. One of the largest impacts was due
> > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > pages during the hour the workload ran for. By and large, the patch has very
> > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > 
> > > This patch is included with the data in case a bisection leads to this area.
> > > This patch is also a pre-requisite for the rest of the series.
> > > 
> > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > 
> > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > short sleep point where every eligible zones are balanced.
> > What's the correlation between them?
> > 
> 
> If kswapd is ready for a short sleep, eligible zones are balanced for
> order-0 but not necessarily the originally requested order if kswapd
> gave up reclaiming as compaction was ready to start. As kswapd is ready
> to sleep for a short period, it's a suitable time for kcompactd to decide
> if it should start working or not. There is no need for kswapd to be aware
> of kcompactd's wakeup criteria.

If all eligible zones are balanced for order-0, I agree it's good timing
because high-order alloc's ratio would be higher since kcompactd can compact
eligible zones, not that only classzone.
However, this patch breaks it as well as long time kswapd behavior which
continues to balance eligible zones for order-0.
Is it really okay now?

> 
> > Can't we wake up kcompactd once we found a zone has enough free pages
> > above high watermark like this?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 26c3b405ef34..f4f0ad0e9ede 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
> >  		 * that pages and compaction may succeed so reset the cache.
> >  		 */
> >  		reset_isolation_suitable(pgdat);
> > -
> > -		/*
> > -		 * We have freed the memory, now we should compact it to make
> > -		 * allocation of the requested order possible.
> > -		 */
> > -		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > -
> >  		remaining = schedule_timeout(HZ/10);
> >  
> >  		/*
> > @@ -3451,6 +3444,14 @@ static int kswapd(void *p)
> >  		bool ret;
> >  
> >  kswapd_try_sleep:
> > +		/*
> > +		 * We have freed the memory, now we should compact it to make
> > +		 * allocation of the requested order possible.
> > +		 */
> > +		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
> > +							classzone_idx))
> > +			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > +
> >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> >  					classzone_idx);
> 
> That's functionally very similar to what happens already.  wakeup_kcompactd
> checks the order and does not wake for order-0. It also makes its own
> decisions that include zone_balanced on whether it is safe to wakeup.

Agree.

> 
> I doubt there would be any measurable difference from a patch like this
> and to my mind at least, it does not improve the readability or flow of
> the code.

However, my concern is premature kswapd sleep for order-0 which has been
long time behavior so I hope it should be documented why it's okay now.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-24  1:17       ` Minchan Kim
@ 2017-02-24  9:11         ` Mel Gorman
  2017-02-27  6:16           ` Minchan Kim
  0 siblings, 1 reply; 13+ messages in thread
From: Mel Gorman @ 2017-02-24  9:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

On Fri, Feb 24, 2017 at 10:17:06AM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > > There are also more allocation stalls. One of the largest impacts was due
> > > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > > pages during the hour the workload ran for. By and large, the patch has very
> > > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > > 
> > > > This patch is included with the data in case a bisection leads to this area.
> > > > This patch is also a pre-requisite for the rest of the series.
> > > > 
> > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > 
> > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > > short sleep point where every eligible zones are balanced.
> > > What's the correlation between them?
> > > 
> > 
> > If kswapd is ready for a short sleep, eligible zones are balanced for
> > order-0 but not necessarily the originally requested order if kswapd
> > gave up reclaiming as compaction was ready to start. As kswapd is ready
> > to sleep for a short period, it's a suitable time for kcompactd to decide
> > if it should start working or not. There is no need for kswapd to be aware
> > of kcompactd's wakeup criteria.
> 
> If all eligible zones are balanced for order-0, I agree it's good timing
> because high-order alloc's ratio would be higher since kcompactd can compact
> eligible zones, not that only classzone.
> However, this patch breaks it as well as long time kswapd behavior which
> continues to balance eligible zones for order-0.
> Is it really okay now?
> 

Reclaim stops in balance_pgdat() if any eligible zone for the requested
classzone is free. The initial sleep for kswapd is very different because
it'll sleep if all zones are balanced for order-0 which is a bad disconnect.
The way node balancing works means there is no guarantee at all that all
zones will be balanced even if there is little or no memory pressure and
one large zone in a node with multiple zones can be balanced quickly.

The short-sleep logic that kswapd uses to decide whether to go to sleep
is shortcut and it does not properly try the short sleep checking if the
high watermarks are quickly reached or not. Instead, it quickly fails the
first attempt at sleep, reenters balance_pgdat(), finds nothing to do and
rechecks sleeping based on order-0, classzone-0 which it can easily sleep
for but is *not* what kswapd was woken for in the first place.

For many allocation requests that initially woke kswapd, the impact is
marginal. kswapd sleeps early and is woken in the near future if there
is a continual stream of allocations with a risk that direct reclaim is
required. While the motivation for the patch was that kcompact is not woken
up, the existing behaviour is just wrong -- kswapd should be deciding to
sleep based on the classzone it was woken for and if possible, the order
it was woken for but the classzone is more important in the common case
for order-0 allocations.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-24  9:11         ` Mel Gorman
@ 2017-02-27  6:16           ` Minchan Kim
  0 siblings, 0 replies; 13+ messages in thread
From: Minchan Kim @ 2017-02-27  6:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi Mel,

On Fri, Feb 24, 2017 at 09:11:28AM +0000, Mel Gorman wrote:
> On Fri, Feb 24, 2017 at 10:17:06AM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> > > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > > > There are also more allocation stalls. One of the largest impacts was due
> > > > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > > > pages during the hour the workload ran for. By and large, the patch has very
> > > > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > > > 
> > > > > This patch is included with the data in case a bisection leads to this area.
> > > > > This patch is also a pre-requisite for the rest of the series.
> > > > > 
> > > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > > 
> > > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > > > short sleep point where every eligible zones are balanced.
> > > > What's the correlation between them?
> > > > 
> > > 
> > > If kswapd is ready for a short sleep, eligible zones are balanced for
> > > order-0 but not necessarily the originally requested order if kswapd
> > > gave up reclaiming as compaction was ready to start. As kswapd is ready
> > > to sleep for a short period, it's a suitable time for kcompactd to decide
> > > if it should start working or not. There is no need for kswapd to be aware
> > > of kcompactd's wakeup criteria.
> > 
> > If all eligible zones are balanced for order-0, I agree it's good timing
> > because high-order alloc's ratio would be higher since kcompactd can compact
> > eligible zones, not that only classzone.
> > However, this patch breaks it as well as long time kswapd behavior which
> > continues to balance eligible zones for order-0.
> > Is it really okay now?
> > 
> 
> Reclaim stops in balance_pgdat() if any eligible zone for the requested
> classzone is free. The initial sleep for kswapd is very different because
> it'll sleep if all zones are balanced for order-0 which is a bad disconnect.
> The way node balancing works means there is no guarantee at all that all
> zones will be balanced even if there is little or no memory pressure and
> one large zone in a node with multiple zones can be balanced quickly.

Indeed but it would tip toward direct relcaim more so it could make more
failure for allocation relies on kswapd like atomic allocation
However, if VM balance all of zones for order-0, it would make excessive
reclaim with node-based LRU unlike zone-based, which is bad, too.

> 
> The short-sleep logic that kswapd uses to decide whether to go to sleep
> is shortcut and it does not properly try the short sleep checking if the
> high watermarks are quickly reached or not. Instead, it quickly fails the
> first attempt at sleep, reenters balance_pgdat(), finds nothing to do and
> rechecks sleeping based on order-0, classzone-0 which it can easily sleep
> for but is *not* what kswapd was woken for in the first place.
> 
> For many allocation requests that initially woke kswapd, the impact is
> marginal. kswapd sleeps early and is woken in the near future if there
> is a continual stream of allocations with a risk that direct reclaim is
> required. While the motivation for the patch was that kcompact is not woken
> up, the existing behaviour is just wrong -- kswapd should be deciding to
> sleep based on the classzone it was woken for and if possible, the order
> it was woken for but the classzone is more important in the common case
> for order-0 allocations.

I agree but I think it's rather risky to paper over order-0 zone-balancing
problem by kcompactd missing problem so at least, it should be documented.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-03-10  9:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-09  7:56 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2 Mel Gorman
2017-03-09  7:56 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
2017-03-10  9:06   ` Vlastimil Babka
2017-03-09  7:56 ` [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced Mel Gorman
2017-03-10  9:06   ` Vlastimil Babka
2017-03-09  7:56 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2017-02-15  9:22 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Mel Gorman
2017-02-15  9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
2017-02-16  2:50   ` Hillf Danton
2017-02-22  7:00   ` Minchan Kim
2017-02-23 15:05     ` Mel Gorman
2017-02-24  1:17       ` Minchan Kim
2017-02-24  9:11         ` Mel Gorman
2017-02-27  6:16           ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).