* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
[not found] ` <1467970510-21195-8-git-send-email-mgorman@techsingularity.net>
@ 2016-08-29 9:38 ` Srikar Dronamraju
2016-08-30 12:07 ` Mel Gorman
0 siblings, 1 reply; 9+ messages in thread
From: Srikar Dronamraju @ 2016-08-29 9:38 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman,
linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
> Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
> thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
> patch gets rid of many of the node-based versus zone-based decisions.
>
> o A node is considered balanced when any eligible lower zone is balanced.
> This eliminates one class of age-inversion problem because we avoid
> reclaiming a newer page just because it's in the wrong zone
> o pgdat_balanced disappears because we now only care about one zone being
> balanced.
> o Some anomalies related to writeback and congestion tracking being based on
> zones disappear.
> o kswapd no longer has to take care to reclaim zones in the reverse order
> that the page allocator uses.
> o Most importantly of all, reclaim from node 0 with multiple zones will
> have similar aging and reclaiming characteristics as every
> other node.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
This patch seems to hurt FA_DUMP functionality. This behaviour is not
seen on v4.7 but only after this patch.
So when a kernel on a multinode machine with memblock_reserve() such
that most of the nodes have zero available memory, kswapd seems to be
consuming 100% of the time.
This is independent of CONFIG_DEFERRED_STRUCT_PAGE, i.e this problem is
seen even with parallel page struct initialization disabled.
top - 13:48:52 up 1:07, 3 users, load average: 15.25, 15.32, 21.18
Tasks: 11080 total, 16 running, 11064 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 2.7 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 15929941+total, 8637824 used, 15843563+free, 2304 buffers
KiB Swap: 91898816 total, 0 used, 91898816 free. 1381312 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10824 root 20 0 0 0 0 R 100.000 0.000 65:30.76 kswapd2
10837 root 20 0 0 0 0 R 100.000 0.000 65:31.17 kswapd15
10823 root 20 0 0 0 0 R 97.059 0.000 65:30.85 kswapd1
10825 root 20 0 0 0 0 R 97.059 0.000 65:31.10 kswapd3
10826 root 20 0 0 0 0 R 97.059 0.000 65:31.18 kswapd4
10827 root 20 0 0 0 0 R 97.059 0.000 65:31.08 kswapd5
10828 root 20 0 0 0 0 R 97.059 0.000 65:30.91 kswapd6
10829 root 20 0 0 0 0 R 97.059 0.000 65:31.17 kswapd7
10830 root 20 0 0 0 0 R 97.059 0.000 65:31.17 kswapd8
10831 root 20 0 0 0 0 R 97.059 0.000 65:31.18 kswapd9
10832 root 20 0 0 0 0 R 97.059 0.000 65:31.12 kswapd10
10833 root 20 0 0 0 0 R 97.059 0.000 65:31.19 kswapd11
10834 root 20 0 0 0 0 R 97.059 0.000 65:31.13 kswapd12
10835 root 20 0 0 0 0 R 97.059 0.000 65:31.09 kswapd13
10836 root 20 0 0 0 0 R 97.059 0.000 65:31.18 kswapd14
277155 srikar 20 0 16960 13760 3264 R 52.941 0.001 0:00.37 top
top - 13:48:55 up 1:07, 3 users, load average: 15.23, 15.32, 21.15
Tasks: 11080 total, 16 running, 11064 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 1.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 15929941+total, 8637824 used, 15843563+free, 2304 buffers
KiB Swap: 91898816 total, 0 used, 91898816 free. 1381312 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10836 root 20 0 0 0 0 R 100.000 0.000 65:33.39 kswapd14
10823 root 20 0 0 0 0 R 100.000 0.000 65:33.05 kswapd1
10824 root 20 0 0 0 0 R 100.000 0.000 65:32.96 kswapd2
10825 root 20 0 0 0 0 R 100.000 0.000 65:33.30 kswapd3
10826 root 20 0 0 0 0 R 100.000 0.000 65:33.38 kswapd4
10827 root 20 0 0 0 0 R 100.000 0.000 65:33.28 kswapd5
10828 root 20 0 0 0 0 R 100.000 0.000 65:33.11 kswapd6
10829 root 20 0 0 0 0 R 100.000 0.000 65:33.37 kswapd7
10830 root 20 0 0 0 0 R 100.000 0.000 65:33.37 kswapd8
10831 root 20 0 0 0 0 R 100.000 0.000 65:33.38 kswapd9
10832 root 20 0 0 0 0 R 100.000 0.000 65:33.32 kswapd10
10833 root 20 0 0 0 0 R 100.000 0.000 65:33.39 kswapd11
10834 root 20 0 0 0 0 R 100.000 0.000 65:33.33 kswapd12
10835 root 20 0 0 0 0 R 100.000 0.000 65:33.29 kswapd13
10837 root 20 0 0 0 0 R 100.000 0.000 65:33.37 kswapd15
277155 srikar 20 0 17536 14912 3264 R 9.091 0.001 0:00.57 top
1092 root rt 0 0 0 0 S 0.455 0.000 0:00.08 watchdog/178
Please see that there is no used swap space. However 15 kswapd threads
corresponding to 15 out of the 16 nodes are running full throttle. The
only node 0 has memory, other nodes memory is fully reserved.
git bisect output
I tried git bisect between v4.7 and v4.8-rc3 filtered to mm/vmscan.c
# bad: [d7f05528eedb047efe2288cff777676b028747b6] mm, vmscan: account for skipped pages as a partial scan
# good: [b1123ea6d3b3da25af5c8a9d843bd07ab63213f4] mm: balloon: use general non-lru movable page feature
git bisect start 'HEAD' 'b1123ea6' '--' 'mm/vmscan.c'
# bad: [c4a25635b60d08853a3e4eaae3ab34419a36cfa2] mm: move vmscan writes and file write accounting to the node
git bisect bad c4a25635b60d08853a3e4eaae3ab34419a36cfa2
# bad: [38087d9b0360987a6db46c2c2c4ece37cd048abe] mm, vmscan: simplify the logic deciding whether kswapd sleeps
git bisect bad 38087d9b0360987a6db46c2c2c4ece37cd048abe
# good: [b2e18757f2c9d1cdd746a882e9878852fdec9501] mm, vmscan: begin reclaiming pages on a per-node basis
git bisect good b2e18757f2c9d1cdd746a882e9878852fdec9501
# bad: [1d82de618ddde0f1164e640f79af152f01994c18] mm, vmscan: make kswapd reclaim in terms of nodes
git bisect bad 1d82de618ddde0f1164e640f79af152f01994c18
# good: [f7b60926ebc05944f73d93ffaf6690503b796a88] mm, vmscan: have kswapd only scan based on the highest requested zone
git bisect good f7b60926ebc05944f73d93ffaf6690503b796a88
# first bad commit: [1d82de618ddde0f1164e640f79af152f01994c18] mm, vmscan: make kswapd reclaim in terms of nodes
Here is perf top output on the kernel where kswapd is hogging cpu.
- 93.50% 0.01% [kernel] [k] kswapd
- kswapd
- 114.31% shrink_node
- 111.51% shrink_node_memcg
- pgdat_reclaimable
- 95.51% pgdat_reclaimable_pages
- 86.34% pgdat_reclaimable_pages
- 6.69% _find_next_bit.part.0
- 2.47% find_next_bit
- 14.46% pgdat_reclaimable
1.13% _find_next_bit.part.0
+ 0.30% find_next_bit
- 2.38% shrink_slab
- super_cache_count
- 0
- __list_lru_count_one.isra.1
_raw_spin_lock
- 28.04% pgdat_reclaimable
- 23.97% pgdat_reclaimable_pages
- 21.66% pgdat_reclaimable_pages
- 1.69% _find_next_bit.part.0
0.63% find_next_bit
- 3.70% pgdat_reclaimable
0.29% _find_next_bit.part.0
- 16.33% zone_balanced
- zone_watermark_ok_safe
- 14.86% zone_watermark_ok_safe
1.15% _find_next_bit.part.0
0.31% find_next_bit
- 2.72% prepare_kswapd_sleep
- zone_balanced
- zone_watermark_ok_safe
zone_watermark_ok_safe
- 80.72% 10.51% [kernel] [k] pgdat_reclaimable
- 140.49% pgdat_reclaimable
- 138.40% pgdat_reclaimable_pages
- 125.10% pgdat_reclaimable_pages
- 9.71% _find_next_bit.part.0
- 3.59% find_next_bit
1.64% _find_next_bit.part.0
+ 0.44% find_next_bit
- 21.03% ret_from_kernel_thread
kthread
- kswapd
- 16.75% shrink_node
shrink_node_memcg
- 4.28% pgdat_reclaimable
pgdat_reclaimable
- 69.17% 62.48% [kernel] [k] pgdat_reclaimable_pages
- 145.91% ret_from_kernel_thread
kthread
- 15.61% pgdat_reclaimable_pages
- 11.33% _find_next_bit.part.0
- 4.19% find_next_bit
- 66.18% 0.01% [kernel] [k] shrink_node
- shrink_node
- 157.54% shrink_node_memcg
- pgdat_reclaimable
- 134.94% pgdat_reclaimable_pages
- 121.99% pgdat_reclaimable_pages
- 9.46% _find_next_bit.part.0
- 3.49% find_next_bit
- 20.44% pgdat_reclaimable
1.59% _find_next_bit.part.0
+ 0.42% find_next_bit
- 3.37% shrink_slab
- super_cache_count
- 0
- __list_lru_count_one.isra.1
_raw_spin_lock
- 64.56% 0.03% [kernel] [k] shrink_node_memcg
- shrink_node_memcg
- pgdat_reclaimable
- 138.31% pgdat_reclaimable_pages
- 125.04% pgdat_reclaimable_pages
- 9.69% _find_next_bit.part.0
- 3.58% find_next_bit
- 20.95% pgdat_reclaimable
1.63% _find_next_bit.part.0
+ 0.43% find_next_bit
53.73% 0.00% [kernel] [k] kthread
53.73% 0.00% [kernel] [k] ret_from_kernel_thread
- 11.04% 10.04% [kernel] [k] zone_watermark_ok_safe
- 146.80% ret_from_kernel_thread
kthread
- kswapd
- 125.81% zone_balanced
zone_watermark_ok_safe
- 20.97% prepare_kswapd_sleep
zone_balanced
zone_watermark_ok_safe
- 14.55% zone_watermark_ok_safe
11.38% _find_next_bit.part.0
3.06% find_next_bit
- 11.03% 0.00% [kernel] [k] zone_balanced
- zone_balanced
- zone_watermark_ok_safe
145.84% zone_watermark_ok_safe
11.31% _find_next_bit.part.0
3.04% find_next_bit
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
2016-08-29 9:38 ` [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes Srikar Dronamraju
@ 2016-08-30 12:07 ` Mel Gorman
2016-08-30 14:25 ` Srikar Dronamraju
0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2016-08-30 12:07 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman,
linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
On Mon, Aug 29, 2016 at 03:08:44PM +0530, Srikar Dronamraju wrote:
> > Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
> > thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
> > patch gets rid of many of the node-based versus zone-based decisions.
> >
> > o A node is considered balanced when any eligible lower zone is balanced.
> > This eliminates one class of age-inversion problem because we avoid
> > reclaiming a newer page just because it's in the wrong zone
> > o pgdat_balanced disappears because we now only care about one zone being
> > balanced.
> > o Some anomalies related to writeback and congestion tracking being based on
> > zones disappear.
> > o kswapd no longer has to take care to reclaim zones in the reverse order
> > that the page allocator uses.
> > o Most importantly of all, reclaim from node 0 with multiple zones will
> > have similar aging and reclaiming characteristics as every
> > other node.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> This patch seems to hurt FA_DUMP functionality. This behaviour is not
> seen on v4.7 but only after this patch.
>
> So when a kernel on a multinode machine with memblock_reserve() such
> that most of the nodes have zero available memory, kswapd seems to be
> consuming 100% of the time.
>
Why is FA_DUMP specifically the trigger? If the nodes have zero available
memory then is the zone_populated() check failing when FA_DUMP is enabled? If
so, that would both allow kswapd to wake and stay awake.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
2016-08-30 12:07 ` Mel Gorman
@ 2016-08-30 14:25 ` Srikar Dronamraju
2016-08-30 15:00 ` Mel Gorman
0 siblings, 1 reply; 9+ messages in thread
From: Srikar Dronamraju @ 2016-08-30 14:25 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman,
linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
> >
> > This patch seems to hurt FA_DUMP functionality. This behaviour is not
> > seen on v4.7 but only after this patch.
> >
> > So when a kernel on a multinode machine with memblock_reserve() such
> > that most of the nodes have zero available memory, kswapd seems to be
> > consuming 100% of the time.
> >
>
> Why is FA_DUMP specifically the trigger? If the nodes have zero available
> memory then is the zone_populated() check failing when FA_DUMP is enabled? If
> so, that would both allow kswapd to wake and stay awake.
>
The trigger is memblock_reserve() for the complete node memory. And
this is exactly what FA_DUMP does. Here again the node has memory but
its all reserved so there is no free memory in the node.
Did you mean populated_zone() when you said zone_populated or have I
mistaken? populated_zone() does return 1 since it checks for
zone->present_pages.
Here is revelant log from the dmesg log at boot
ppc64_pft_size = 0x26
phys_mem_size = 0x1e4600000000
dcache_bsize = 0x80
icache_bsize = 0x80
cpu_features = 0x27fc7aec18500249
possible = 0x3fffffff18500649
always = 0x0000000018100040
cpu_user_features = 0xdc0065c2 0xef000000
mmu_features = 0x7c000001
firmware_features = 0x00000003c45bfc57
htab_hash_mask = 0x7fffffff
-----------------------------------------------------
Node 0 Memory: 0x0-0x1fb50000000
Node 1 Memory: 0x1fb50000000-0x3fa90000000
Node 2 Memory: 0x3fa90000000-0x5f9b0000000
Node 3 Memory: 0x5f9b0000000-0x76850000000
Node 4 Memory: 0x76850000000-0x95020000000
Node 5 Memory: 0x95020000000-0xb37f0000000
Node 6 Memory: 0xb37f0000000-0xd1fc0000000
Node 7 Memory: 0xd1fc0000000-0xf0790000000
Node 8 Memory: 0xf0790000000-0x10ef60000000
Node 9 Memory: 0x10ef60000000-0x12d730000000
Node 10 Memory: 0x12d730000000-0x14bf00000000
Node 11 Memory: 0x14bf00000000-0x16a6d0000000
Node 12 Memory: 0x16a6d0000000-0x188ea0000000
Node 13 Memory: 0x188ea0000000-0x1a7660000000
Node 14 Memory: 0x1a7660000000-0x1c5e30000000
Node 15 Memory: 0x1c5e30000000-0x1e4600000000
numa: Initmem setup node 0 [mem 0x00000000-0x1fb4fffffff]
numa: NODE_DATA [mem 0x1837fe23680-0x1837fe2d37f]
numa: Initmem setup node 1 [mem 0x1fb50000000-0x3fa8fffffff]
numa: NODE_DATA [mem 0x1837fa19980-0x1837fa2367f]
numa: NODE_DATA(1) on node 0
numa: Initmem setup node 2 [mem 0x3fa90000000-0x5f9afffffff]
numa: NODE_DATA [mem 0x1837f60fc80-0x1837f61997f]
numa: NODE_DATA(2) on node 0
numa: Initmem setup node 3 [mem 0x5f9b0000000-0x7684fffffff]
numa: NODE_DATA [mem 0x1837f205f80-0x1837f20fc7f]
numa: NODE_DATA(3) on node 0
numa: Initmem setup node 4 [mem 0x76850000000-0x9501fffffff]
numa: NODE_DATA [mem 0x1837ef1c280-0x1837ef25f7f]
numa: NODE_DATA(4) on node 0
numa: Initmem setup node 5 [mem 0x95020000000-0xb37efffffff]
numa: NODE_DATA [mem 0x1837eb42580-0x1837eb4c27f]
numa: NODE_DATA(5) on node 0
numa: Initmem setup node 6 [mem 0xb37f0000000-0xd1fbfffffff]
numa: NODE_DATA [mem 0x1837e778880-0x1837e78257f]
numa: NODE_DATA(6) on node 0
numa: Initmem setup node 7 [mem 0xd1fc0000000-0xf078fffffff]
numa: NODE_DATA [mem 0x1837e39eb80-0x1837e3a887f]
numa: NODE_DATA(7) on node 0
numa: Initmem setup node 8 [mem 0xf0790000000-0x10ef5fffffff]
numa: NODE_DATA [mem 0x1837dfc4e80-0x1837dfceb7f]
numa: NODE_DATA(8) on node 0
numa: Initmem setup node 9 [mem 0x10ef60000000-0x12d72fffffff]
numa: NODE_DATA [mem 0x1837dbeb180-0x1837dbf4e7f]
numa: NODE_DATA(9) on node 0
numa: Initmem setup node 10 [mem 0x12d730000000-0x14beffffffff]
numa: NODE_DATA [mem 0x1837d811480-0x1837d81b17f]
numa: NODE_DATA(10) on node 0
numa: Initmem setup node 11 [mem 0x14bf00000000-0x16a6cfffffff]
numa: NODE_DATA [mem 0x1837d437780-0x1837d44147f]
numa: NODE_DATA(11) on node 0
numa: Initmem setup node 12 [mem 0x16a6d0000000-0x188e9fffffff]
numa: NODE_DATA [mem 0x1837d05da80-0x1837d06777f]
numa: NODE_DATA(12) on node 0
numa: Initmem setup node 13 [mem 0x188ea0000000-0x1a765fffffff]
numa: NODE_DATA [mem 0x1837cc83d80-0x1837cc8da7f]
numa: NODE_DATA(13) on node 0
numa: Initmem setup node 14 [mem 0x1a7660000000-0x1c5e2fffffff]
numa: NODE_DATA [mem 0x1837c8aa080-0x1837c8b3d7f]
numa: NODE_DATA(14) on node 0
numa: Initmem setup node 15 [mem 0x1c5e30000000-0x1e45ffffffff]
numa: NODE_DATA [mem 0x1837c4d0380-0x1837c4da07f]
numa: NODE_DATA(15) on node 0
Section 99194 and 99199 (node 0) have a circular dependency on usemap and pgdat allocations
node 1 must be removed before remove section 99193
node 1 must be removed before remove section 99194
node 2 must be removed before remove section 99193
node 4 must be removed before remove section 99193
node 8 must be removed before remove section 99193
node 13 must be removed before remove section 99193
PCI host bridge /pci@800000020000032 ranges:
MEM 0x00003fd480000000..0x00003fd4feffffff -> 0x0000000080000000
MEM 0x0000329000000000..0x0000329fffffffff -> 0x0003d29000000000
PCI host bridge /pci@800000020000164 ranges:
MEM 0x00003fc2e0000000..0x00003fc2efffffff -> 0x00000000e0000000
MEM 0x0000305800000000..0x0000305bffffffff -> 0x0003d05800000000
PPC64 nvram contains 15360 bytes
Top of RAM: 0x1e4600000000, Total RAM: 0x1e4600000000
Memory hole size: 0MB
Zone ranges:
DMA [mem 0x0000000000000000-0x00001e45ffffffff]
DMA32 empty
Normal empty
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x0000000000000000-0x000001fb4fffffff]
node 1: [mem 0x000001fb50000000-0x000003fa8fffffff]
node 2: [mem 0x000003fa90000000-0x000005f9afffffff]
node 3: [mem 0x000005f9b0000000-0x000007684fffffff]
node 4: [mem 0x0000076850000000-0x000009501fffffff]
node 5: [mem 0x0000095020000000-0x00000b37efffffff]
node 6: [mem 0x00000b37f0000000-0x00000d1fbfffffff]
node 7: [mem 0x00000d1fc0000000-0x00000f078fffffff]
node 8: [mem 0x00000f0790000000-0x000010ef5fffffff]
node 9: [mem 0x000010ef60000000-0x000012d72fffffff]
node 10: [mem 0x000012d730000000-0x000014beffffffff]
node 11: [mem 0x000014bf00000000-0x000016a6cfffffff]
node 12: [mem 0x000016a6d0000000-0x0000188e9fffffff]
node 13: [mem 0x0000188ea0000000-0x00001a765fffffff]
node 14: [mem 0x00001a7660000000-0x00001c5e2fffffff]
node 15: [mem 0x00001c5e30000000-0x00001e45ffffffff]
Initmem setup node 0 [mem 0x0000000000000000-0x000001fb4fffffff]
On node 0 totalpages: 33247232
DMA zone: 32468 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 33247232 pages, LIFO batch:1
Initmem setup node 1 [mem 0x000001fb50000000-0x000003fa8fffffff]
On node 1 totalpages: 33505280
DMA zone: 32720 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 33505280 pages, LIFO batch:1
Initmem setup node 2 [mem 0x000003fa90000000-0x000005f9afffffff]
On node 2 totalpages: 33497088
DMA zone: 32712 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 33497088 pages, LIFO batch:1
Initmem setup node 3 [mem 0x000005f9b0000000-0x000007684fffffff]
On node 3 totalpages: 24027136
DMA zone: 23464 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 24027136 pages, LIFO batch:1
Initmem setup node 4 [mem 0x0000076850000000-0x000009501fffffff]
On node 4 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 5 [mem 0x0000095020000000-0x00000b37efffffff]
On node 5 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 6 [mem 0x00000b37f0000000-0x00000d1fbfffffff]
On node 6 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 7 [mem 0x00000d1fc0000000-0x00000f078fffffff]
On node 7 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 8 [mem 0x00000f0790000000-0x000010ef5fffffff]
On node 8 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 9 [mem 0x000010ef60000000-0x000012d72fffffff]
On node 9 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 10 [mem 0x000012d730000000-0x000014beffffffff]
On node 10 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 11 [mem 0x000014bf00000000-0x000016a6cfffffff]
On node 11 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 12 [mem 0x000016a6d0000000-0x0000188e9fffffff]
On node 12 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 13 [mem 0x0000188ea0000000-0x00001a765fffffff]
On node 13 totalpages: 31965184
DMA zone: 31216 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31965184 pages, LIFO batch:1
Initmem setup node 14 [mem 0x00001a7660000000-0x00001c5e2fffffff]
On node 14 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
Initmem setup node 15 [mem 0x00001c5e30000000-0x00001e45ffffffff]
On node 15 totalpages: 31969280
DMA zone: 31220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 31969280 pages, LIFO batch:1
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
2016-08-30 14:25 ` Srikar Dronamraju
@ 2016-08-30 15:00 ` Mel Gorman
2016-08-31 6:09 ` Srikar Dronamraju
0 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2016-08-30 15:00 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman,
linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
On Tue, Aug 30, 2016 at 07:55:08PM +0530, Srikar Dronamraju wrote:
> > >
> > > This patch seems to hurt FA_DUMP functionality. This behaviour is not
> > > seen on v4.7 but only after this patch.
> > >
> > > So when a kernel on a multinode machine with memblock_reserve() such
> > > that most of the nodes have zero available memory, kswapd seems to be
> > > consuming 100% of the time.
> > >
> >
> > Why is FA_DUMP specifically the trigger? If the nodes have zero available
> > memory then is the zone_populated() check failing when FA_DUMP is enabled? If
> > so, that would both allow kswapd to wake and stay awake.
> >
>
> The trigger is memblock_reserve() for the complete node memory. And
> this is exactly what FA_DUMP does. Here again the node has memory but
> its all reserved so there is no free memory in the node.
>
> Did you mean populated_zone() when you said zone_populated or have I
> mistaken? populated_zone() does return 1 since it checks for
> zone->present_pages.
>
Yes, I meant populated_zone(). Using present pages may have hidden
a long-lived corner case as it was unexpected that an entire node
would be reserved. The old code happened to survive *probably* because
pgdat_reclaimable would look false and kswapd checks for pgdat being
balanced would happen to do the right thing in this case.
Can you check if something like this works?
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..cf64a5456cf6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -830,7 +830,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
static inline int populated_zone(struct zone *zone)
{
- return (!!zone->present_pages);
+ return (!!zone->managed_pages);
}
extern int movable_zone;
--
Mel Gorman
SUSE Labs
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
2016-08-30 15:00 ` Mel Gorman
@ 2016-08-31 6:09 ` Srikar Dronamraju
2016-08-31 8:49 ` Mel Gorman
0 siblings, 1 reply; 9+ messages in thread
From: Srikar Dronamraju @ 2016-08-31 6:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman,
linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
> > The trigger is memblock_reserve() for the complete node memory. And
> > this is exactly what FA_DUMP does. Here again the node has memory but
> > its all reserved so there is no free memory in the node.
> >
> > Did you mean populated_zone() when you said zone_populated or have I
> > mistaken? populated_zone() does return 1 since it checks for
> > zone->present_pages.
> >
>
> Yes, I meant populated_zone(). Using present pages may have hidden
> a long-lived corner case as it was unexpected that an entire node
> would be reserved. The old code happened to survive *probably* because
> pgdat_reclaimable would look false and kswapd checks for pgdat being
> balanced would happen to do the right thing in this case.
>
> Can you check if something like this works?
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d572b78b65e1..cf64a5456cf6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -830,7 +830,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
>
> static inline int populated_zone(struct zone *zone)
> {
> - return (!!zone->present_pages);
> + return (!!zone->managed_pages);
> }
>
> extern int movable_zone;
>
This indeed fixes the problem.
Please add my
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
2016-08-31 6:09 ` Srikar Dronamraju
@ 2016-08-31 8:49 ` Mel Gorman
2016-08-31 11:09 ` Michal Hocko
2016-08-31 17:33 ` Srikar Dronamraju
0 siblings, 2 replies; 9+ messages in thread
From: Mel Gorman @ 2016-08-31 8:49 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman,
linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
On Wed, Aug 31, 2016 at 11:39:59AM +0530, Srikar Dronamraju wrote:
> This indeed fixes the problem.
> Please add my
> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>
Ok, thanks. Unfortunately we cannot do a wide conversion like this
because some users of populated_zone() really meant to check for
present_pages. In all cases, the expectation was that reserved pages
would be tiny but fadump messes that up. Can you verify this also works
please?
---8<---
mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator
Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
of memory when booting a secondary kernel. Srikar Dronamraju reported that
multiple nodes may have no memory managed by the buddy allocator but still
return true for populated_zone().
Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
was reported to cause kswapd to spin at 100% CPU usage when fadump was
enabled. The old code happened to deal with the situation of a populated
node with zero free pages by co-incidence but the current code tries to
reclaim populated zones without realising that is impossible.
We cannot just convert populated_zone() as many existing users really
need to check for present_pages. This patch introduces a managed_zone()
helper and uses it in the few cases where it is critical that the check
is made for managed pages -- zonelist constuction and page reclaim.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
include/linux/mmzone.h | 11 +++++++++--
mm/page_alloc.c | 4 ++--
mm/vmscan.c | 22 +++++++++++-----------
3 files changed, 22 insertions(+), 15 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d572b78b65e1..69f886b79656 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -828,9 +828,16 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
*/
#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones)
-static inline int populated_zone(struct zone *zone)
+/* Returns true if a zone has pages managed by the buddy allocator */
+static inline bool managed_zone(struct zone *zone)
{
- return (!!zone->present_pages);
+ return zone->managed_pages;
+}
+
+/* Returns true if a zone has memory */
+static inline bool populated_zone(struct zone *zone)
+{
+ return zone->present_pages;
}
extern int movable_zone;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c09d9f7f692..ea7558149ee5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4405,7 +4405,7 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
do {
zone_type--;
zone = pgdat->node_zones + zone_type;
- if (populated_zone(zone)) {
+ if (managed_zone(zone)) {
zoneref_set_zone(zone,
&zonelist->_zonerefs[nr_zones++]);
check_highest_zone(zone_type);
@@ -4643,7 +4643,7 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
for (j = 0; j < nr_nodes; j++) {
node = node_order[j];
z = &NODE_DATA(node)->node_zones[zone_type];
- if (populated_zone(z)) {
+ if (managed_zone(z)) {
zoneref_set_zone(z,
&zonelist->_zonerefs[pos++]);
check_highest_zone(zone_type);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 98774f45b04a..55943a284082 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1665,7 +1665,7 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec,
for (zid = sc->reclaim_idx; zid >= 0; zid--) {
zone = &pgdat->node_zones[zid];
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
if (zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE +
@@ -2036,7 +2036,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
struct zone *zone = &pgdat->node_zones[zid];
unsigned long inactive_zone, active_zone;
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
inactive_zone = zone_page_state(zone,
@@ -2171,7 +2171,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
for (z = 0; z < MAX_NR_ZONES; z++) {
struct zone *zone = &pgdat->node_zones[z];
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
total_high_wmark += high_wmark_pages(zone);
@@ -2508,7 +2508,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
/* If compaction would go ahead or the allocation would succeed, stop */
for (z = 0; z <= sc->reclaim_idx; z++) {
struct zone *zone = &pgdat->node_zones[z];
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) {
@@ -2835,7 +2835,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
for (i = 0; i <= ZONE_NORMAL; i++) {
zone = &pgdat->node_zones[i];
- if (!populated_zone(zone) ||
+ if (!managed_zone(zone) ||
pgdat_reclaimable_pages(pgdat) == 0)
continue;
@@ -3136,7 +3136,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
if (!zone_balanced(zone, order, classzone_idx))
@@ -3164,7 +3164,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
sc->nr_to_reclaim = 0;
for (z = 0; z <= sc->reclaim_idx; z++) {
zone = pgdat->node_zones + z;
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
@@ -3237,7 +3237,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
if (buffer_heads_over_limit) {
for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
zone = pgdat->node_zones + i;
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
sc.reclaim_idx = i;
@@ -3257,7 +3257,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
*/
for (i = classzone_idx; i >= 0; i--) {
zone = pgdat->node_zones + i;
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
if (zone_balanced(zone, sc.order, classzone_idx))
@@ -3503,7 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
pg_data_t *pgdat;
int z;
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
return;
if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
@@ -3517,7 +3517,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
/* Only wake kswapd if all zones are unbalanced */
for (z = 0; z <= classzone_idx; z++) {
zone = pgdat->node_zones + z;
- if (!populated_zone(zone))
+ if (!managed_zone(zone))
continue;
if (zone_balanced(zone, order, classzone_idx))
--
Mel Gorman
SUSE Labs
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
2016-08-31 8:49 ` Mel Gorman
@ 2016-08-31 11:09 ` Michal Hocko
2016-08-31 12:46 ` Mel Gorman
2016-08-31 17:33 ` Srikar Dronamraju
1 sibling, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2016-08-31 11:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Srikar Dronamraju, Andrew Morton, Linux-MM, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
On Wed 31-08-16 09:49:42, Mel Gorman wrote:
> On Wed, Aug 31, 2016 at 11:39:59AM +0530, Srikar Dronamraju wrote:
> > This indeed fixes the problem.
> > Please add my
> > Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> >
>
> Ok, thanks. Unfortunately we cannot do a wide conversion like this
> because some users of populated_zone() really meant to check for
> present_pages. In all cases, the expectation was that reserved pages
> would be tiny but fadump messes that up. Can you verify this also works
> please?
>
> ---8<---
> mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator
>
> Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
> of memory when booting a secondary kernel. Srikar Dronamraju reported that
> multiple nodes may have no memory managed by the buddy allocator but still
> return true for populated_zone().
>
> Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> was reported to cause kswapd to spin at 100% CPU usage when fadump was
> enabled. The old code happened to deal with the situation of a populated
> node with zero free pages by co-incidence but the current code tries to
> reclaim populated zones without realising that is impossible.
>
> We cannot just convert populated_zone() as many existing users really
> need to check for present_pages. This patch introduces a managed_zone()
> helper and uses it in the few cases where it is critical that the check
> is made for managed pages -- zonelist constuction and page reclaim.
OK, the patch makes sense to me. I am not happy about two very similar
functions, to be honest though. managed vs. present checks will be quite
subtle and it is not entirely clear when to use which one. I agree that
the reclaim path is the most critical one so the patch seems OK to me.
At least from a quick glance it should help with the reported issue so
feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>
I expect we might want to turn other places as well but they are far
from critical. I would appreciate some lead there and stick a clarifying
comment
[...]
> -static inline int populated_zone(struct zone *zone)
> +/* Returns true if a zone has pages managed by the buddy allocator */
/*
* Returns true if a zone has pages managed by the buddy allocator.
* All the reclaim decisions have to use this function rather than
* populated_zone(). If the whole zone is reserved then we can easily
* end up with populated_zone() && !managed_zone().
*/
What do you think?
> +static inline bool managed_zone(struct zone *zone)
> {
> - return (!!zone->present_pages);
> + return zone->managed_pages;
> +}
> +
> +/* Returns true if a zone has memory */
> +static inline bool populated_zone(struct zone *zone)
> +{
> + return zone->present_pages;
> }
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
2016-08-31 11:09 ` Michal Hocko
@ 2016-08-31 12:46 ` Mel Gorman
0 siblings, 0 replies; 9+ messages in thread
From: Mel Gorman @ 2016-08-31 12:46 UTC (permalink / raw)
To: Michal Hocko
Cc: Srikar Dronamraju, Andrew Morton, Linux-MM, Rik van Riel,
Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML,
Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
On Wed, Aug 31, 2016 at 01:09:33PM +0200, Michal Hocko wrote:
> > We cannot just convert populated_zone() as many existing users really
> > need to check for present_pages. This patch introduces a managed_zone()
> > helper and uses it in the few cases where it is critical that the check
> > is made for managed pages -- zonelist constuction and page reclaim.
>
> OK, the patch makes sense to me. I am not happy about two very similar
> functions, to be honest though. managed vs. present checks will be quite
> subtle and it is not entirely clear when to use which one.
In the vast majority of cases, the distinction is irrelevant. The patch
only updates the places where it really matters to minimise any
confusion.
> Acked-by: Michal Hocko <mhocko@suse.com>
Thanks.
> /*
> * Returns true if a zone has pages managed by the buddy allocator.
> * All the reclaim decisions have to use this function rather than
> * populated_zone(). If the whole zone is reserved then we can easily
> * end up with populated_zone() && !managed_zone().
> */
>
> What do you think?
>
This makes a lot of sense. I've updated the patch and will await a test
from Srikar before reposting.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes
2016-08-31 8:49 ` Mel Gorman
2016-08-31 11:09 ` Michal Hocko
@ 2016-08-31 17:33 ` Srikar Dronamraju
1 sibling, 0 replies; 9+ messages in thread
From: Srikar Dronamraju @ 2016-08-31 17:33 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman,
linuxppc-dev, Mahesh Salgaonkar, Hari Bathini
> mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator
>
> Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
> of memory when booting a secondary kernel. Srikar Dronamraju reported that
> multiple nodes may have no memory managed by the buddy allocator but still
> return true for populated_zone().
>
> Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> was reported to cause kswapd to spin at 100% CPU usage when fadump was
> enabled. The old code happened to deal with the situation of a populated
> node with zero free pages by co-incidence but the current code tries to
> reclaim populated zones without realising that is impossible.
>
> We cannot just convert populated_zone() as many existing users really
> need to check for present_pages. This patch introduces a managed_zone()
> helper and uses it in the few cases where it is critical that the check
> is made for managed pages -- zonelist constuction and page reclaim.
one nit
s/constuction/construction/
>
Verified that it works fine.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-08-31 17:33 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1467970510-21195-1-git-send-email-mgorman@techsingularity.net>
[not found] ` <1467970510-21195-8-git-send-email-mgorman@techsingularity.net>
2016-08-29 9:38 ` [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes Srikar Dronamraju
2016-08-30 12:07 ` Mel Gorman
2016-08-30 14:25 ` Srikar Dronamraju
2016-08-30 15:00 ` Mel Gorman
2016-08-31 6:09 ` Srikar Dronamraju
2016-08-31 8:49 ` Mel Gorman
2016-08-31 11:09 ` Michal Hocko
2016-08-31 12:46 ` Mel Gorman
2016-08-31 17:33 ` Srikar Dronamraju
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).