* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes [not found] ` <1467970510-21195-8-git-send-email-mgorman@techsingularity.net> @ 2016-08-29 9:38 ` Srikar Dronamraju 2016-08-30 12:07 ` Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Srikar Dronamraju @ 2016-08-29 9:38 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini > Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started > thinking of reclaim in terms of nodes but kswapd is still zone-centric. This > patch gets rid of many of the node-based versus zone-based decisions. > > o A node is considered balanced when any eligible lower zone is balanced. > This eliminates one class of age-inversion problem because we avoid > reclaiming a newer page just because it's in the wrong zone > o pgdat_balanced disappears because we now only care about one zone being > balanced. > o Some anomalies related to writeback and congestion tracking being based on > zones disappear. > o kswapd no longer has to take care to reclaim zones in the reverse order > that the page allocator uses. > o Most importantly of all, reclaim from node 0 with multiple zones will > have similar aging and reclaiming characteristics as every > other node. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > Acked-by: Vlastimil Babka <vbabka@suse.cz> This patch seems to hurt FA_DUMP functionality. This behaviour is not seen on v4.7 but only after this patch. So when a kernel on a multinode machine with memblock_reserve() such that most of the nodes have zero available memory, kswapd seems to be consuming 100% of the time. This is independent of CONFIG_DEFERRED_STRUCT_PAGE, i.e this problem is seen even with parallel page struct initialization disabled. top - 13:48:52 up 1:07, 3 users, load average: 15.25, 15.32, 21.18 Tasks: 11080 total, 16 running, 11064 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 2.7 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 15929941+total, 8637824 used, 15843563+free, 2304 buffers KiB Swap: 91898816 total, 0 used, 91898816 free. 1381312 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10824 root 20 0 0 0 0 R 100.000 0.000 65:30.76 kswapd2 10837 root 20 0 0 0 0 R 100.000 0.000 65:31.17 kswapd15 10823 root 20 0 0 0 0 R 97.059 0.000 65:30.85 kswapd1 10825 root 20 0 0 0 0 R 97.059 0.000 65:31.10 kswapd3 10826 root 20 0 0 0 0 R 97.059 0.000 65:31.18 kswapd4 10827 root 20 0 0 0 0 R 97.059 0.000 65:31.08 kswapd5 10828 root 20 0 0 0 0 R 97.059 0.000 65:30.91 kswapd6 10829 root 20 0 0 0 0 R 97.059 0.000 65:31.17 kswapd7 10830 root 20 0 0 0 0 R 97.059 0.000 65:31.17 kswapd8 10831 root 20 0 0 0 0 R 97.059 0.000 65:31.18 kswapd9 10832 root 20 0 0 0 0 R 97.059 0.000 65:31.12 kswapd10 10833 root 20 0 0 0 0 R 97.059 0.000 65:31.19 kswapd11 10834 root 20 0 0 0 0 R 97.059 0.000 65:31.13 kswapd12 10835 root 20 0 0 0 0 R 97.059 0.000 65:31.09 kswapd13 10836 root 20 0 0 0 0 R 97.059 0.000 65:31.18 kswapd14 277155 srikar 20 0 16960 13760 3264 R 52.941 0.001 0:00.37 top top - 13:48:55 up 1:07, 3 users, load average: 15.23, 15.32, 21.15 Tasks: 11080 total, 16 running, 11064 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 1.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 15929941+total, 8637824 used, 15843563+free, 2304 buffers KiB Swap: 91898816 total, 0 used, 91898816 free. 1381312 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10836 root 20 0 0 0 0 R 100.000 0.000 65:33.39 kswapd14 10823 root 20 0 0 0 0 R 100.000 0.000 65:33.05 kswapd1 10824 root 20 0 0 0 0 R 100.000 0.000 65:32.96 kswapd2 10825 root 20 0 0 0 0 R 100.000 0.000 65:33.30 kswapd3 10826 root 20 0 0 0 0 R 100.000 0.000 65:33.38 kswapd4 10827 root 20 0 0 0 0 R 100.000 0.000 65:33.28 kswapd5 10828 root 20 0 0 0 0 R 100.000 0.000 65:33.11 kswapd6 10829 root 20 0 0 0 0 R 100.000 0.000 65:33.37 kswapd7 10830 root 20 0 0 0 0 R 100.000 0.000 65:33.37 kswapd8 10831 root 20 0 0 0 0 R 100.000 0.000 65:33.38 kswapd9 10832 root 20 0 0 0 0 R 100.000 0.000 65:33.32 kswapd10 10833 root 20 0 0 0 0 R 100.000 0.000 65:33.39 kswapd11 10834 root 20 0 0 0 0 R 100.000 0.000 65:33.33 kswapd12 10835 root 20 0 0 0 0 R 100.000 0.000 65:33.29 kswapd13 10837 root 20 0 0 0 0 R 100.000 0.000 65:33.37 kswapd15 277155 srikar 20 0 17536 14912 3264 R 9.091 0.001 0:00.57 top 1092 root rt 0 0 0 0 S 0.455 0.000 0:00.08 watchdog/178 Please see that there is no used swap space. However 15 kswapd threads corresponding to 15 out of the 16 nodes are running full throttle. The only node 0 has memory, other nodes memory is fully reserved. git bisect output I tried git bisect between v4.7 and v4.8-rc3 filtered to mm/vmscan.c # bad: [d7f05528eedb047efe2288cff777676b028747b6] mm, vmscan: account for skipped pages as a partial scan # good: [b1123ea6d3b3da25af5c8a9d843bd07ab63213f4] mm: balloon: use general non-lru movable page feature git bisect start 'HEAD' 'b1123ea6' '--' 'mm/vmscan.c' # bad: [c4a25635b60d08853a3e4eaae3ab34419a36cfa2] mm: move vmscan writes and file write accounting to the node git bisect bad c4a25635b60d08853a3e4eaae3ab34419a36cfa2 # bad: [38087d9b0360987a6db46c2c2c4ece37cd048abe] mm, vmscan: simplify the logic deciding whether kswapd sleeps git bisect bad 38087d9b0360987a6db46c2c2c4ece37cd048abe # good: [b2e18757f2c9d1cdd746a882e9878852fdec9501] mm, vmscan: begin reclaiming pages on a per-node basis git bisect good b2e18757f2c9d1cdd746a882e9878852fdec9501 # bad: [1d82de618ddde0f1164e640f79af152f01994c18] mm, vmscan: make kswapd reclaim in terms of nodes git bisect bad 1d82de618ddde0f1164e640f79af152f01994c18 # good: [f7b60926ebc05944f73d93ffaf6690503b796a88] mm, vmscan: have kswapd only scan based on the highest requested zone git bisect good f7b60926ebc05944f73d93ffaf6690503b796a88 # first bad commit: [1d82de618ddde0f1164e640f79af152f01994c18] mm, vmscan: make kswapd reclaim in terms of nodes Here is perf top output on the kernel where kswapd is hogging cpu. - 93.50% 0.01% [kernel] [k] kswapd - kswapd - 114.31% shrink_node - 111.51% shrink_node_memcg - pgdat_reclaimable - 95.51% pgdat_reclaimable_pages - 86.34% pgdat_reclaimable_pages - 6.69% _find_next_bit.part.0 - 2.47% find_next_bit - 14.46% pgdat_reclaimable 1.13% _find_next_bit.part.0 + 0.30% find_next_bit - 2.38% shrink_slab - super_cache_count - 0 - __list_lru_count_one.isra.1 _raw_spin_lock - 28.04% pgdat_reclaimable - 23.97% pgdat_reclaimable_pages - 21.66% pgdat_reclaimable_pages - 1.69% _find_next_bit.part.0 0.63% find_next_bit - 3.70% pgdat_reclaimable 0.29% _find_next_bit.part.0 - 16.33% zone_balanced - zone_watermark_ok_safe - 14.86% zone_watermark_ok_safe 1.15% _find_next_bit.part.0 0.31% find_next_bit - 2.72% prepare_kswapd_sleep - zone_balanced - zone_watermark_ok_safe zone_watermark_ok_safe - 80.72% 10.51% [kernel] [k] pgdat_reclaimable - 140.49% pgdat_reclaimable - 138.40% pgdat_reclaimable_pages - 125.10% pgdat_reclaimable_pages - 9.71% _find_next_bit.part.0 - 3.59% find_next_bit 1.64% _find_next_bit.part.0 + 0.44% find_next_bit - 21.03% ret_from_kernel_thread kthread - kswapd - 16.75% shrink_node shrink_node_memcg - 4.28% pgdat_reclaimable pgdat_reclaimable - 69.17% 62.48% [kernel] [k] pgdat_reclaimable_pages - 145.91% ret_from_kernel_thread kthread - 15.61% pgdat_reclaimable_pages - 11.33% _find_next_bit.part.0 - 4.19% find_next_bit - 66.18% 0.01% [kernel] [k] shrink_node - shrink_node - 157.54% shrink_node_memcg - pgdat_reclaimable - 134.94% pgdat_reclaimable_pages - 121.99% pgdat_reclaimable_pages - 9.46% _find_next_bit.part.0 - 3.49% find_next_bit - 20.44% pgdat_reclaimable 1.59% _find_next_bit.part.0 + 0.42% find_next_bit - 3.37% shrink_slab - super_cache_count - 0 - __list_lru_count_one.isra.1 _raw_spin_lock - 64.56% 0.03% [kernel] [k] shrink_node_memcg - shrink_node_memcg - pgdat_reclaimable - 138.31% pgdat_reclaimable_pages - 125.04% pgdat_reclaimable_pages - 9.69% _find_next_bit.part.0 - 3.58% find_next_bit - 20.95% pgdat_reclaimable 1.63% _find_next_bit.part.0 + 0.43% find_next_bit 53.73% 0.00% [kernel] [k] kthread 53.73% 0.00% [kernel] [k] ret_from_kernel_thread - 11.04% 10.04% [kernel] [k] zone_watermark_ok_safe - 146.80% ret_from_kernel_thread kthread - kswapd - 125.81% zone_balanced zone_watermark_ok_safe - 20.97% prepare_kswapd_sleep zone_balanced zone_watermark_ok_safe - 14.55% zone_watermark_ok_safe 11.38% _find_next_bit.part.0 3.06% find_next_bit - 11.03% 0.00% [kernel] [k] zone_balanced - zone_balanced - zone_watermark_ok_safe 145.84% zone_watermark_ok_safe 11.31% _find_next_bit.part.0 3.04% find_next_bit -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes 2016-08-29 9:38 ` [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes Srikar Dronamraju @ 2016-08-30 12:07 ` Mel Gorman 2016-08-30 14:25 ` Srikar Dronamraju 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2016-08-30 12:07 UTC (permalink / raw) To: Srikar Dronamraju Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini On Mon, Aug 29, 2016 at 03:08:44PM +0530, Srikar Dronamraju wrote: > > Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started > > thinking of reclaim in terms of nodes but kswapd is still zone-centric. This > > patch gets rid of many of the node-based versus zone-based decisions. > > > > o A node is considered balanced when any eligible lower zone is balanced. > > This eliminates one class of age-inversion problem because we avoid > > reclaiming a newer page just because it's in the wrong zone > > o pgdat_balanced disappears because we now only care about one zone being > > balanced. > > o Some anomalies related to writeback and congestion tracking being based on > > zones disappear. > > o kswapd no longer has to take care to reclaim zones in the reverse order > > that the page allocator uses. > > o Most importantly of all, reclaim from node 0 with multiple zones will > > have similar aging and reclaiming characteristics as every > > other node. > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > This patch seems to hurt FA_DUMP functionality. This behaviour is not > seen on v4.7 but only after this patch. > > So when a kernel on a multinode machine with memblock_reserve() such > that most of the nodes have zero available memory, kswapd seems to be > consuming 100% of the time. > Why is FA_DUMP specifically the trigger? If the nodes have zero available memory then is the zone_populated() check failing when FA_DUMP is enabled? If so, that would both allow kswapd to wake and stay awake. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes 2016-08-30 12:07 ` Mel Gorman @ 2016-08-30 14:25 ` Srikar Dronamraju 2016-08-30 15:00 ` Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Srikar Dronamraju @ 2016-08-30 14:25 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini > > > > This patch seems to hurt FA_DUMP functionality. This behaviour is not > > seen on v4.7 but only after this patch. > > > > So when a kernel on a multinode machine with memblock_reserve() such > > that most of the nodes have zero available memory, kswapd seems to be > > consuming 100% of the time. > > > > Why is FA_DUMP specifically the trigger? If the nodes have zero available > memory then is the zone_populated() check failing when FA_DUMP is enabled? If > so, that would both allow kswapd to wake and stay awake. > The trigger is memblock_reserve() for the complete node memory. And this is exactly what FA_DUMP does. Here again the node has memory but its all reserved so there is no free memory in the node. Did you mean populated_zone() when you said zone_populated or have I mistaken? populated_zone() does return 1 since it checks for zone->present_pages. Here is revelant log from the dmesg log at boot ppc64_pft_size = 0x26 phys_mem_size = 0x1e4600000000 dcache_bsize = 0x80 icache_bsize = 0x80 cpu_features = 0x27fc7aec18500249 possible = 0x3fffffff18500649 always = 0x0000000018100040 cpu_user_features = 0xdc0065c2 0xef000000 mmu_features = 0x7c000001 firmware_features = 0x00000003c45bfc57 htab_hash_mask = 0x7fffffff ----------------------------------------------------- Node 0 Memory: 0x0-0x1fb50000000 Node 1 Memory: 0x1fb50000000-0x3fa90000000 Node 2 Memory: 0x3fa90000000-0x5f9b0000000 Node 3 Memory: 0x5f9b0000000-0x76850000000 Node 4 Memory: 0x76850000000-0x95020000000 Node 5 Memory: 0x95020000000-0xb37f0000000 Node 6 Memory: 0xb37f0000000-0xd1fc0000000 Node 7 Memory: 0xd1fc0000000-0xf0790000000 Node 8 Memory: 0xf0790000000-0x10ef60000000 Node 9 Memory: 0x10ef60000000-0x12d730000000 Node 10 Memory: 0x12d730000000-0x14bf00000000 Node 11 Memory: 0x14bf00000000-0x16a6d0000000 Node 12 Memory: 0x16a6d0000000-0x188ea0000000 Node 13 Memory: 0x188ea0000000-0x1a7660000000 Node 14 Memory: 0x1a7660000000-0x1c5e30000000 Node 15 Memory: 0x1c5e30000000-0x1e4600000000 numa: Initmem setup node 0 [mem 0x00000000-0x1fb4fffffff] numa: NODE_DATA [mem 0x1837fe23680-0x1837fe2d37f] numa: Initmem setup node 1 [mem 0x1fb50000000-0x3fa8fffffff] numa: NODE_DATA [mem 0x1837fa19980-0x1837fa2367f] numa: NODE_DATA(1) on node 0 numa: Initmem setup node 2 [mem 0x3fa90000000-0x5f9afffffff] numa: NODE_DATA [mem 0x1837f60fc80-0x1837f61997f] numa: NODE_DATA(2) on node 0 numa: Initmem setup node 3 [mem 0x5f9b0000000-0x7684fffffff] numa: NODE_DATA [mem 0x1837f205f80-0x1837f20fc7f] numa: NODE_DATA(3) on node 0 numa: Initmem setup node 4 [mem 0x76850000000-0x9501fffffff] numa: NODE_DATA [mem 0x1837ef1c280-0x1837ef25f7f] numa: NODE_DATA(4) on node 0 numa: Initmem setup node 5 [mem 0x95020000000-0xb37efffffff] numa: NODE_DATA [mem 0x1837eb42580-0x1837eb4c27f] numa: NODE_DATA(5) on node 0 numa: Initmem setup node 6 [mem 0xb37f0000000-0xd1fbfffffff] numa: NODE_DATA [mem 0x1837e778880-0x1837e78257f] numa: NODE_DATA(6) on node 0 numa: Initmem setup node 7 [mem 0xd1fc0000000-0xf078fffffff] numa: NODE_DATA [mem 0x1837e39eb80-0x1837e3a887f] numa: NODE_DATA(7) on node 0 numa: Initmem setup node 8 [mem 0xf0790000000-0x10ef5fffffff] numa: NODE_DATA [mem 0x1837dfc4e80-0x1837dfceb7f] numa: NODE_DATA(8) on node 0 numa: Initmem setup node 9 [mem 0x10ef60000000-0x12d72fffffff] numa: NODE_DATA [mem 0x1837dbeb180-0x1837dbf4e7f] numa: NODE_DATA(9) on node 0 numa: Initmem setup node 10 [mem 0x12d730000000-0x14beffffffff] numa: NODE_DATA [mem 0x1837d811480-0x1837d81b17f] numa: NODE_DATA(10) on node 0 numa: Initmem setup node 11 [mem 0x14bf00000000-0x16a6cfffffff] numa: NODE_DATA [mem 0x1837d437780-0x1837d44147f] numa: NODE_DATA(11) on node 0 numa: Initmem setup node 12 [mem 0x16a6d0000000-0x188e9fffffff] numa: NODE_DATA [mem 0x1837d05da80-0x1837d06777f] numa: NODE_DATA(12) on node 0 numa: Initmem setup node 13 [mem 0x188ea0000000-0x1a765fffffff] numa: NODE_DATA [mem 0x1837cc83d80-0x1837cc8da7f] numa: NODE_DATA(13) on node 0 numa: Initmem setup node 14 [mem 0x1a7660000000-0x1c5e2fffffff] numa: NODE_DATA [mem 0x1837c8aa080-0x1837c8b3d7f] numa: NODE_DATA(14) on node 0 numa: Initmem setup node 15 [mem 0x1c5e30000000-0x1e45ffffffff] numa: NODE_DATA [mem 0x1837c4d0380-0x1837c4da07f] numa: NODE_DATA(15) on node 0 Section 99194 and 99199 (node 0) have a circular dependency on usemap and pgdat allocations node 1 must be removed before remove section 99193 node 1 must be removed before remove section 99194 node 2 must be removed before remove section 99193 node 4 must be removed before remove section 99193 node 8 must be removed before remove section 99193 node 13 must be removed before remove section 99193 PCI host bridge /pci@800000020000032 ranges: MEM 0x00003fd480000000..0x00003fd4feffffff -> 0x0000000080000000 MEM 0x0000329000000000..0x0000329fffffffff -> 0x0003d29000000000 PCI host bridge /pci@800000020000164 ranges: MEM 0x00003fc2e0000000..0x00003fc2efffffff -> 0x00000000e0000000 MEM 0x0000305800000000..0x0000305bffffffff -> 0x0003d05800000000 PPC64 nvram contains 15360 bytes Top of RAM: 0x1e4600000000, Total RAM: 0x1e4600000000 Memory hole size: 0MB Zone ranges: DMA [mem 0x0000000000000000-0x00001e45ffffffff] DMA32 empty Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000000000-0x000001fb4fffffff] node 1: [mem 0x000001fb50000000-0x000003fa8fffffff] node 2: [mem 0x000003fa90000000-0x000005f9afffffff] node 3: [mem 0x000005f9b0000000-0x000007684fffffff] node 4: [mem 0x0000076850000000-0x000009501fffffff] node 5: [mem 0x0000095020000000-0x00000b37efffffff] node 6: [mem 0x00000b37f0000000-0x00000d1fbfffffff] node 7: [mem 0x00000d1fc0000000-0x00000f078fffffff] node 8: [mem 0x00000f0790000000-0x000010ef5fffffff] node 9: [mem 0x000010ef60000000-0x000012d72fffffff] node 10: [mem 0x000012d730000000-0x000014beffffffff] node 11: [mem 0x000014bf00000000-0x000016a6cfffffff] node 12: [mem 0x000016a6d0000000-0x0000188e9fffffff] node 13: [mem 0x0000188ea0000000-0x00001a765fffffff] node 14: [mem 0x00001a7660000000-0x00001c5e2fffffff] node 15: [mem 0x00001c5e30000000-0x00001e45ffffffff] Initmem setup node 0 [mem 0x0000000000000000-0x000001fb4fffffff] On node 0 totalpages: 33247232 DMA zone: 32468 pages used for memmap DMA zone: 0 pages reserved DMA zone: 33247232 pages, LIFO batch:1 Initmem setup node 1 [mem 0x000001fb50000000-0x000003fa8fffffff] On node 1 totalpages: 33505280 DMA zone: 32720 pages used for memmap DMA zone: 0 pages reserved DMA zone: 33505280 pages, LIFO batch:1 Initmem setup node 2 [mem 0x000003fa90000000-0x000005f9afffffff] On node 2 totalpages: 33497088 DMA zone: 32712 pages used for memmap DMA zone: 0 pages reserved DMA zone: 33497088 pages, LIFO batch:1 Initmem setup node 3 [mem 0x000005f9b0000000-0x000007684fffffff] On node 3 totalpages: 24027136 DMA zone: 23464 pages used for memmap DMA zone: 0 pages reserved DMA zone: 24027136 pages, LIFO batch:1 Initmem setup node 4 [mem 0x0000076850000000-0x000009501fffffff] On node 4 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 5 [mem 0x0000095020000000-0x00000b37efffffff] On node 5 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 6 [mem 0x00000b37f0000000-0x00000d1fbfffffff] On node 6 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 7 [mem 0x00000d1fc0000000-0x00000f078fffffff] On node 7 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 8 [mem 0x00000f0790000000-0x000010ef5fffffff] On node 8 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 9 [mem 0x000010ef60000000-0x000012d72fffffff] On node 9 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 10 [mem 0x000012d730000000-0x000014beffffffff] On node 10 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 11 [mem 0x000014bf00000000-0x000016a6cfffffff] On node 11 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 12 [mem 0x000016a6d0000000-0x0000188e9fffffff] On node 12 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 13 [mem 0x0000188ea0000000-0x00001a765fffffff] On node 13 totalpages: 31965184 DMA zone: 31216 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31965184 pages, LIFO batch:1 Initmem setup node 14 [mem 0x00001a7660000000-0x00001c5e2fffffff] On node 14 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 Initmem setup node 15 [mem 0x00001c5e30000000-0x00001e45ffffffff] On node 15 totalpages: 31969280 DMA zone: 31220 pages used for memmap DMA zone: 0 pages reserved DMA zone: 31969280 pages, LIFO batch:1 -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes 2016-08-30 14:25 ` Srikar Dronamraju @ 2016-08-30 15:00 ` Mel Gorman 2016-08-31 6:09 ` Srikar Dronamraju 0 siblings, 1 reply; 9+ messages in thread From: Mel Gorman @ 2016-08-30 15:00 UTC (permalink / raw) To: Srikar Dronamraju Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini On Tue, Aug 30, 2016 at 07:55:08PM +0530, Srikar Dronamraju wrote: > > > > > > This patch seems to hurt FA_DUMP functionality. This behaviour is not > > > seen on v4.7 but only after this patch. > > > > > > So when a kernel on a multinode machine with memblock_reserve() such > > > that most of the nodes have zero available memory, kswapd seems to be > > > consuming 100% of the time. > > > > > > > Why is FA_DUMP specifically the trigger? If the nodes have zero available > > memory then is the zone_populated() check failing when FA_DUMP is enabled? If > > so, that would both allow kswapd to wake and stay awake. > > > > The trigger is memblock_reserve() for the complete node memory. And > this is exactly what FA_DUMP does. Here again the node has memory but > its all reserved so there is no free memory in the node. > > Did you mean populated_zone() when you said zone_populated or have I > mistaken? populated_zone() does return 1 since it checks for > zone->present_pages. > Yes, I meant populated_zone(). Using present pages may have hidden a long-lived corner case as it was unexpected that an entire node would be reserved. The old code happened to survive *probably* because pgdat_reclaimable would look false and kswapd checks for pgdat being balanced would happen to do the right thing in this case. Can you check if something like this works? diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d572b78b65e1..cf64a5456cf6 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -830,7 +830,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); static inline int populated_zone(struct zone *zone) { - return (!!zone->present_pages); + return (!!zone->managed_pages); } extern int movable_zone; -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes 2016-08-30 15:00 ` Mel Gorman @ 2016-08-31 6:09 ` Srikar Dronamraju 2016-08-31 8:49 ` Mel Gorman 0 siblings, 1 reply; 9+ messages in thread From: Srikar Dronamraju @ 2016-08-31 6:09 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini > > The trigger is memblock_reserve() for the complete node memory. And > > this is exactly what FA_DUMP does. Here again the node has memory but > > its all reserved so there is no free memory in the node. > > > > Did you mean populated_zone() when you said zone_populated or have I > > mistaken? populated_zone() does return 1 since it checks for > > zone->present_pages. > > > > Yes, I meant populated_zone(). Using present pages may have hidden > a long-lived corner case as it was unexpected that an entire node > would be reserved. The old code happened to survive *probably* because > pgdat_reclaimable would look false and kswapd checks for pgdat being > balanced would happen to do the right thing in this case. > > Can you check if something like this works? > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index d572b78b65e1..cf64a5456cf6 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -830,7 +830,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); > > static inline int populated_zone(struct zone *zone) > { > - return (!!zone->present_pages); > + return (!!zone->managed_pages); > } > > extern int movable_zone; > This indeed fixes the problem. Please add my Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes 2016-08-31 6:09 ` Srikar Dronamraju @ 2016-08-31 8:49 ` Mel Gorman 2016-08-31 11:09 ` Michal Hocko 2016-08-31 17:33 ` Srikar Dronamraju 0 siblings, 2 replies; 9+ messages in thread From: Mel Gorman @ 2016-08-31 8:49 UTC (permalink / raw) To: Srikar Dronamraju Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini On Wed, Aug 31, 2016 at 11:39:59AM +0530, Srikar Dronamraju wrote: > This indeed fixes the problem. > Please add my > Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Ok, thanks. Unfortunately we cannot do a wide conversion like this because some users of populated_zone() really meant to check for present_pages. In all cases, the expectation was that reserved pages would be tiny but fadump messes that up. Can you verify this also works please? ---8<--- mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts of memory when booting a secondary kernel. Srikar Dronamraju reported that multiple nodes may have no memory managed by the buddy allocator but still return true for populated_zone(). Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") was reported to cause kswapd to spin at 100% CPU usage when fadump was enabled. The old code happened to deal with the situation of a populated node with zero free pages by co-incidence but the current code tries to reclaim populated zones without realising that is impossible. We cannot just convert populated_zone() as many existing users really need to check for present_pages. This patch introduces a managed_zone() helper and uses it in the few cases where it is critical that the check is made for managed pages -- zonelist constuction and page reclaim. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mmzone.h | 11 +++++++++-- mm/page_alloc.c | 4 ++-- mm/vmscan.c | 22 +++++++++++----------- 3 files changed, 22 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d572b78b65e1..69f886b79656 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -828,9 +828,16 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long); */ #define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones) -static inline int populated_zone(struct zone *zone) +/* Returns true if a zone has pages managed by the buddy allocator */ +static inline bool managed_zone(struct zone *zone) { - return (!!zone->present_pages); + return zone->managed_pages; +} + +/* Returns true if a zone has memory */ +static inline bool populated_zone(struct zone *zone) +{ + return zone->present_pages; } extern int movable_zone; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1c09d9f7f692..ea7558149ee5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4405,7 +4405,7 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, do { zone_type--; zone = pgdat->node_zones + zone_type; - if (populated_zone(zone)) { + if (managed_zone(zone)) { zoneref_set_zone(zone, &zonelist->_zonerefs[nr_zones++]); check_highest_zone(zone_type); @@ -4643,7 +4643,7 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes) for (j = 0; j < nr_nodes; j++) { node = node_order[j]; z = &NODE_DATA(node)->node_zones[zone_type]; - if (populated_zone(z)) { + if (managed_zone(z)) { zoneref_set_zone(z, &zonelist->_zonerefs[pos++]); check_highest_zone(zone_type); diff --git a/mm/vmscan.c b/mm/vmscan.c index 98774f45b04a..55943a284082 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1665,7 +1665,7 @@ static bool inactive_reclaimable_pages(struct lruvec *lruvec, for (zid = sc->reclaim_idx; zid >= 0; zid--) { zone = &pgdat->node_zones[zid]; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; if (zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE + @@ -2036,7 +2036,7 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, struct zone *zone = &pgdat->node_zones[zid]; unsigned long inactive_zone, active_zone; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; inactive_zone = zone_page_state(zone, @@ -2171,7 +2171,7 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, for (z = 0; z < MAX_NR_ZONES; z++) { struct zone *zone = &pgdat->node_zones[z]; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; total_high_wmark += high_wmark_pages(zone); @@ -2508,7 +2508,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, /* If compaction would go ahead or the allocation would succeed, stop */ for (z = 0; z <= sc->reclaim_idx; z++) { struct zone *zone = &pgdat->node_zones[z]; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; switch (compaction_suitable(zone, sc->order, 0, sc->reclaim_idx)) { @@ -2835,7 +2835,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) for (i = 0; i <= ZONE_NORMAL; i++) { zone = &pgdat->node_zones[i]; - if (!populated_zone(zone) || + if (!managed_zone(zone) || pgdat_reclaimable_pages(pgdat) == 0) continue; @@ -3136,7 +3136,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) for (i = 0; i <= classzone_idx; i++) { struct zone *zone = pgdat->node_zones + i; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; if (!zone_balanced(zone, order, classzone_idx)) @@ -3164,7 +3164,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat, sc->nr_to_reclaim = 0; for (z = 0; z <= sc->reclaim_idx; z++) { zone = pgdat->node_zones + z; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX); @@ -3237,7 +3237,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) if (buffer_heads_over_limit) { for (i = MAX_NR_ZONES - 1; i >= 0; i--) { zone = pgdat->node_zones + i; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; sc.reclaim_idx = i; @@ -3257,7 +3257,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) */ for (i = classzone_idx; i >= 0; i--) { zone = pgdat->node_zones + i; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; if (zone_balanced(zone, sc.order, classzone_idx)) @@ -3503,7 +3503,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) pg_data_t *pgdat; int z; - if (!populated_zone(zone)) + if (!managed_zone(zone)) return; if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL)) @@ -3517,7 +3517,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) /* Only wake kswapd if all zones are unbalanced */ for (z = 0; z <= classzone_idx; z++) { zone = pgdat->node_zones + z; - if (!populated_zone(zone)) + if (!managed_zone(zone)) continue; if (zone_balanced(zone, order, classzone_idx)) -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes 2016-08-31 8:49 ` Mel Gorman @ 2016-08-31 11:09 ` Michal Hocko 2016-08-31 12:46 ` Mel Gorman 2016-08-31 17:33 ` Srikar Dronamraju 1 sibling, 1 reply; 9+ messages in thread From: Michal Hocko @ 2016-08-31 11:09 UTC (permalink / raw) To: Mel Gorman Cc: Srikar Dronamraju, Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini On Wed 31-08-16 09:49:42, Mel Gorman wrote: > On Wed, Aug 31, 2016 at 11:39:59AM +0530, Srikar Dronamraju wrote: > > This indeed fixes the problem. > > Please add my > > Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > > > > Ok, thanks. Unfortunately we cannot do a wide conversion like this > because some users of populated_zone() really meant to check for > present_pages. In all cases, the expectation was that reserved pages > would be tiny but fadump messes that up. Can you verify this also works > please? > > ---8<--- > mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator > > Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts > of memory when booting a secondary kernel. Srikar Dronamraju reported that > multiple nodes may have no memory managed by the buddy allocator but still > return true for populated_zone(). > > Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") > was reported to cause kswapd to spin at 100% CPU usage when fadump was > enabled. The old code happened to deal with the situation of a populated > node with zero free pages by co-incidence but the current code tries to > reclaim populated zones without realising that is impossible. > > We cannot just convert populated_zone() as many existing users really > need to check for present_pages. This patch introduces a managed_zone() > helper and uses it in the few cases where it is critical that the check > is made for managed pages -- zonelist constuction and page reclaim. OK, the patch makes sense to me. I am not happy about two very similar functions, to be honest though. managed vs. present checks will be quite subtle and it is not entirely clear when to use which one. I agree that the reclaim path is the most critical one so the patch seems OK to me. At least from a quick glance it should help with the reported issue so feel free to add Acked-by: Michal Hocko <mhocko@suse.com> I expect we might want to turn other places as well but they are far from critical. I would appreciate some lead there and stick a clarifying comment [...] > -static inline int populated_zone(struct zone *zone) > +/* Returns true if a zone has pages managed by the buddy allocator */ /* * Returns true if a zone has pages managed by the buddy allocator. * All the reclaim decisions have to use this function rather than * populated_zone(). If the whole zone is reserved then we can easily * end up with populated_zone() && !managed_zone(). */ What do you think? > +static inline bool managed_zone(struct zone *zone) > { > - return (!!zone->present_pages); > + return zone->managed_pages; > +} > + > +/* Returns true if a zone has memory */ > +static inline bool populated_zone(struct zone *zone) > +{ > + return zone->present_pages; > } -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes 2016-08-31 11:09 ` Michal Hocko @ 2016-08-31 12:46 ` Mel Gorman 0 siblings, 0 replies; 9+ messages in thread From: Mel Gorman @ 2016-08-31 12:46 UTC (permalink / raw) To: Michal Hocko Cc: Srikar Dronamraju, Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini On Wed, Aug 31, 2016 at 01:09:33PM +0200, Michal Hocko wrote: > > We cannot just convert populated_zone() as many existing users really > > need to check for present_pages. This patch introduces a managed_zone() > > helper and uses it in the few cases where it is critical that the check > > is made for managed pages -- zonelist constuction and page reclaim. > > OK, the patch makes sense to me. I am not happy about two very similar > functions, to be honest though. managed vs. present checks will be quite > subtle and it is not entirely clear when to use which one. In the vast majority of cases, the distinction is irrelevant. The patch only updates the places where it really matters to minimise any confusion. > Acked-by: Michal Hocko <mhocko@suse.com> Thanks. > /* > * Returns true if a zone has pages managed by the buddy allocator. > * All the reclaim decisions have to use this function rather than > * populated_zone(). If the whole zone is reserved then we can easily > * end up with populated_zone() && !managed_zone(). > */ > > What do you think? > This makes a lot of sense. I've updated the patch and will await a test from Srikar before reposting. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes 2016-08-31 8:49 ` Mel Gorman 2016-08-31 11:09 ` Michal Hocko @ 2016-08-31 17:33 ` Srikar Dronamraju 1 sibling, 0 replies; 9+ messages in thread From: Srikar Dronamraju @ 2016-08-31 17:33 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, Minchan Kim, Joonsoo Kim, LKML, Michael Ellerman, linuxppc-dev, Mahesh Salgaonkar, Hari Bathini > mm, vmscan: Only allocate and reclaim from zones with pages managed by the buddy allocator > > Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts > of memory when booting a secondary kernel. Srikar Dronamraju reported that > multiple nodes may have no memory managed by the buddy allocator but still > return true for populated_zone(). > > Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes") > was reported to cause kswapd to spin at 100% CPU usage when fadump was > enabled. The old code happened to deal with the situation of a populated > node with zero free pages by co-incidence but the current code tries to > reclaim populated zones without realising that is impossible. > > We cannot just convert populated_zone() as many existing users really > need to check for present_pages. This patch introduces a managed_zone() > helper and uses it in the few cases where it is critical that the check > is made for managed pages -- zonelist constuction and page reclaim. one nit s/constuction/construction/ > Verified that it works fine. -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-08-31 17:33 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1467970510-21195-1-git-send-email-mgorman@techsingularity.net> [not found] ` <1467970510-21195-8-git-send-email-mgorman@techsingularity.net> 2016-08-29 9:38 ` [PATCH 07/34] mm, vmscan: make kswapd reclaim in terms of nodes Srikar Dronamraju 2016-08-30 12:07 ` Mel Gorman 2016-08-30 14:25 ` Srikar Dronamraju 2016-08-30 15:00 ` Mel Gorman 2016-08-31 6:09 ` Srikar Dronamraju 2016-08-31 8:49 ` Mel Gorman 2016-08-31 11:09 ` Michal Hocko 2016-08-31 12:46 ` Mel Gorman 2016-08-31 17:33 ` Srikar Dronamraju
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).