[RFC 0/4] Outsourcing page fault THP allocations to khugepaged

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/4] Outsourcing page fault THP allocations to khugepaged
@ 2015-05-11 14:35 Vlastimil Babka
  2015-05-11 14:35 ` [RFC 1/4] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Vlastimil Babka @ 2015-05-11 14:35 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Alex Thorlton, David Rientjes, Vlastimil Babka

This series is an updated subset of the "big khugepaged redesign" [1] which
was then discussed at LSF/MM [2]. Following some advice, I split the series
and this is supposedly the less controversial part :)

What it means that the patches don't move the collapse scanning to task_work
context (yet), but focus on reducing the reclaim and compaction done in page
fault context, by shifting this effort towards khugepaged. This is benefical
for two reasons:

- reclaim and compaction in page fault context adds to the page fault latency,
  which might offset any benefits of a THP, especially for short-lived
  allocations, which cannot be distinguished at the time of page fault anyway

- THP allocations in page fault use only asynchronous compaction, which
  reduces the latency, but also the probability of succeeding. Failures do not
  result in deferred compaction. Khugepaged will use the more thorough
  synchronous compaction, won't exit in the middle of the work due to
  need_resched() and will cooperate with the deferred compaction mechanism
  properly.

To achieve this:

* Patch 1 removes the THP preallocation from khugepaged in preparation for
  the next patch. It is restricted to !NUMA configurations and complicates the
  code.

* Patch 2 introduces a thp_avail_nodes nodemask where khugepaged clears bits
  for nodes where it failed to allocate a hugepage during collapse. Before
  scanning for collapse, it tries to allocate a hugepage from each such node
  and set the bit back. If all online nodes are cleared and cannot be re-set,
  it won't scan for collapse at all. In case the THP is going to be collapsed
  on one of the nodes that are cleared, it will skip such PMD ASAP.

* Patch 3 uses the nodemask introduced in Patch 2 also to determine whether
  page faults should skip the attempt to allocate THP. It will also clear the
  node where allocation is attempted and fails. Complementary, freeing of page
  of sufficient order from any context sets the node as THP-available.

* Patch 4 improves the reaction to THP page fault allocation attempts by waking
  khugepaged in case allocation is both failed or skipped due to cleared
  availability bit. The latter ensures that deferred compaction is tracked
  appropriately for each potentially-THP page fault.

For evaluation, the new thpscale benchmark from mmtests was used. This test
fragments memory between anonymous and file mappings and then tries to fault
aligned 2MB blocks in another anonymous mapping, using mincore(2) to determine
if the first fault has brought the whole block and thus it was a THP page
fault. The anonymous mappings should fit in the memory while the file mappings
are expected to be reclaimed during the process, The latency is measured for
the whole sequence of initial fault, mincore syscall, and memset of the whole
block. Latency is reported in microseconds, separately for blocks that were
faulted as THP and base pages. This is repeated with different numbers of
threads doing the faults in parallel.

The results are not particularly stable, but show the difference of this
patchset. This is on 4-core single-node machine:

thpscale Fault Latencies (microseconds)
                                     4.1-rc2               4.1-rc2
                                           0                     4
Min      fault-base-1      1562.00 (  0.00%)     1407.00 (  9.92%)
Min      fault-base-3      1855.00 (  0.00%)     1808.00 (  2.53%)
Min      fault-base-5      2091.00 (  0.00%)     1930.00 (  7.70%)
Min      fault-base-7      2082.00 (  0.00%)     2222.00 ( -6.72%)
Min      fault-base-12     2489.00 (  0.00%)     2292.00 (  7.91%)
Min      fault-base-16     2092.00 (  0.00%)     1928.00 (  7.84%)
Min      fault-huge-1       953.00 (  0.00%)     1282.00 (-34.52%)
Min      fault-huge-3      1319.00 (  0.00%)     1218.00 (  7.66%)
Min      fault-huge-5      1527.00 (  0.00%)     1268.00 ( 16.96%)
Min      fault-huge-7      1277.00 (  0.00%)     1276.00 (  0.08%)
Min      fault-huge-12     2286.00 (  0.00%)     1419.00 ( 37.93%)
Min      fault-huge-16     2395.00 (  0.00%)     2158.00 (  9.90%)
Amean    fault-base-1      3322.97 (  0.00%)     2130.35 ( 35.89%)
Amean    fault-base-3      3372.55 (  0.00%)     3331.46 (  1.22%)
Amean    fault-base-5      7684.34 (  0.00%)     4086.17 ( 46.82%)
Amean    fault-base-7     10010.14 (  0.00%)     5367.27 ( 46.38%)
Amean    fault-base-12    11000.00 (  0.00%)     8529.81 ( 22.46%)
Amean    fault-base-16    15021.71 (  0.00%)    14164.72 (  5.70%)
Amean    fault-huge-1      2534.19 (  0.00%)     2419.83 (  4.51%)
Amean    fault-huge-3      5312.42 (  0.00%)     4783.90 (  9.95%)
Amean    fault-huge-5      8086.82 (  0.00%)     7050.06 ( 12.82%)
Amean    fault-huge-7     11184.91 (  0.00%)     6359.74 ( 43.14%)
Amean    fault-huge-12    17218.58 (  0.00%)     9120.60 ( 47.03%)
Amean    fault-huge-16    18176.03 (  0.00%)    21161.54 (-16.43%)
Stddev   fault-base-1      3652.46 (  0.00%)     3197.59 ( 12.45%)
Stddev   fault-base-3      4960.05 (  0.00%)     5633.47 (-13.58%)
Stddev   fault-base-5      9309.31 (  0.00%)     6587.24 ( 29.24%)
Stddev   fault-base-7     11266.55 (  0.00%)     7629.93 ( 32.28%)
Stddev   fault-base-12    10899.31 (  0.00%)     9803.98 ( 10.05%)
Stddev   fault-base-16    17360.78 (  0.00%)    18654.45 ( -7.45%)
Stddev   fault-huge-1       764.26 (  0.00%)      379.14 ( 50.39%)
Stddev   fault-huge-3      6030.37 (  0.00%)     4231.11 ( 29.84%)
Stddev   fault-huge-5      5953.79 (  0.00%)     7069.40 (-18.74%)
Stddev   fault-huge-7      8557.60 (  0.00%)     5742.90 ( 32.89%)
Stddev   fault-huge-12    12563.23 (  0.00%)     7376.70 ( 41.28%)
Stddev   fault-huge-16    10370.34 (  0.00%)    14153.56 (-36.48%)
CoeffVar fault-base-1       109.92 (  0.00%)      150.10 (-36.56%)
CoeffVar fault-base-3       147.07 (  0.00%)      169.10 (-14.98%)
CoeffVar fault-base-5       121.15 (  0.00%)      161.21 (-33.07%)
CoeffVar fault-base-7       112.55 (  0.00%)      142.16 (-26.30%)
CoeffVar fault-base-12       99.08 (  0.00%)      114.94 (-16.00%)
CoeffVar fault-base-16      115.57 (  0.00%)      131.70 (-13.95%)
CoeffVar fault-huge-1        30.16 (  0.00%)       15.67 ( 48.05%)
CoeffVar fault-huge-3       113.51 (  0.00%)       88.44 ( 22.09%)
CoeffVar fault-huge-5        73.62 (  0.00%)      100.27 (-36.20%)
CoeffVar fault-huge-7        76.51 (  0.00%)       90.30 (-18.02%)
CoeffVar fault-huge-12       72.96 (  0.00%)       80.88 (-10.85%)
CoeffVar fault-huge-16       57.06 (  0.00%)       66.88 (-17.23%)
Max      fault-base-1     47334.00 (  0.00%)    49600.00 ( -4.79%)
Max      fault-base-3     65729.00 (  0.00%)    74554.00 (-13.43%)
Max      fault-base-5     64057.00 (  0.00%)    56862.00 ( 11.23%)
Max      fault-base-7     78693.00 (  0.00%)    63878.00 ( 18.83%)
Max      fault-base-12   129893.00 (  0.00%)    53485.00 ( 58.82%)
Max      fault-base-16   120831.00 (  0.00%)   155015.00 (-28.29%)
Max      fault-huge-1     12520.00 (  0.00%)     8713.00 ( 30.41%)
Max      fault-huge-3     56081.00 (  0.00%)    48753.00 ( 13.07%)
Max      fault-huge-5     37449.00 (  0.00%)    40032.00 ( -6.90%)
Max      fault-huge-7     46929.00 (  0.00%)    32946.00 ( 29.80%)
Max      fault-huge-12    73446.00 (  0.00%)    39423.00 ( 46.32%)
Max      fault-huge-16    51139.00 (  0.00%)    67562.00 (-32.11%)

The Amean lines show mostly reduction in latencies for both successful THP
faults and base page fallbacks, except for the 16-thread cases (on 4-core
machine) where the increased khugepaged activity might be usurping the CPU
time too much.

thpscale Percentage Faults Huge
                                 4.1-rc2               4.1-rc2
                                       0                     4
Percentage huge-1        78.23 (  0.00%)       71.27 ( -8.90%)
Percentage huge-3        11.41 (  0.00%)       35.23 (208.89%)
Percentage huge-5        57.72 (  0.00%)       28.99 (-49.78%)
Percentage huge-7        52.81 (  0.00%)       15.56 (-70.53%)
Percentage huge-12       22.69 (  0.00%)       51.03 (124.86%)
Percentage huge-16        7.65 (  0.00%)       12.50 ( 63.33%)

The THP success rates are too unstable to draw firm conclusions. Keep in mind
that reducing the page fault latency is likely more important than the THP
benefits, which can still be achieved for longer-running processes through
khugepaged collapses.

             4.1-rc2     4.1-rc2
                   0           4
User           15.14       14.93
System         56.75       51.12
Elapsed       199.85      196.71

                               4.1-rc2     4.1-rc2
                                     0           4
Minor Faults                   1721504     1891067
Major Faults                       315         317
Swap Ins                             0           0
Swap Outs                            0           0
Allocation stalls                 3191         691
DMA allocs                           0           0
DMA32 allocs                   7189739     7238693
Normal allocs                  2462965     2373646
Movable allocs                       0           0
Direct pages scanned            910953      619549
Kswapd pages scanned            302034      310422
Kswapd pages reclaimed           57791       89525
Direct pages reclaimed          182170       62029
Kswapd efficiency                  19%         28%
Kswapd velocity               1511.303    1578.069
Direct efficiency                  19%         10%
Direct velocity               4558.184    3149.555
Percentage direct scans            75%         66%
Zone normal velocity          1847.766    1275.426
Zone dma32 velocity           4221.721    3452.199
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate              20          11
Sector Reads                   4991812     4991228
Sector Writes                  3246508     3246912
Page rescued immediate               0           0
Slabs scanned                    62448       62080
Direct inode steals                 17          14
Kswapd inode steals                  0           0
Kswapd skipped wait                  0           0
THP fault alloc                  11385       11058
THP collapse alloc                   2         105
THP splits                        9568        9375
THP fault fallback                2937        3269
THP collapse fail                    0           1
Compaction stalls                 7551        1500
Compaction success                1611        1191
Compaction failures               5940         309
Page migrate success            569476      421021
Page migrate failure                 0           0
Compaction pages isolated      1451445      937675
Compaction migrate scanned     1416728      768084
Compaction free scanned        3800385     5859981
Compaction cost                    628         460
NUMA alloc hit                 3833019     3907129
NUMA alloc miss                      0           0
NUMA interleave hit                  0           0
NUMA alloc local               3833019     3907129
NUMA base PTE updates                0           0
NUMA huge PMD updates                0           0
NUMA page range updates              0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA hint local percent            100         100
NUMA pages migrated                  0           0
AutoNUMA cost                       0%          0%

Note that the THP stats are not that useful as they include the preparatory
phases of the benchmark. But notice the much improved compaction success
ratio. It appears that the compaction for THP page faults is already so
crippled in order to reduce latencies, that it's mostly not worth attempting
it at all...

Next, the test was repeated with system configured to not pass GFP_WAIT for
THP page faults by:

echo never > /sys/kernel/mm/transparent_hugepage/defrag

This means no reclaim and compaction in page fault context, while khugepaged
keeps using GFP_WAIT per /sys/kernel/mm/transparent_hugepage/khugepaged/defrag


thpscale Fault Latencies
                                     4.1-rc2               4.1-rc2
                                        0-nd                  4-nd
Min      fault-base-1      1378.00 (  0.00%)     1390.00 ( -0.87%)
Min      fault-base-3      1479.00 (  0.00%)     1623.00 ( -9.74%)
Min      fault-base-5      1440.00 (  0.00%)     1415.00 (  1.74%)
Min      fault-base-7      1379.00 (  0.00%)     1434.00 ( -3.99%)
Min      fault-base-12     1946.00 (  0.00%)     2132.00 ( -9.56%)
Min      fault-base-16     1913.00 (  0.00%)     2007.00 ( -4.91%)
Min      fault-huge-1      1031.00 (  0.00%)      964.00 (  6.50%)
Min      fault-huge-3      1535.00 (  0.00%)     1037.00 ( 32.44%)
Min      fault-huge-5      1261.00 (  0.00%)     1282.00 ( -1.67%)
Min      fault-huge-7      1265.00 (  0.00%)     1464.00 (-15.73%)
Min      fault-huge-12     1275.00 (  0.00%)     1179.00 (  7.53%)
Min      fault-huge-16     1231.00 (  0.00%)     1231.00 (  0.00%)
Amean    fault-base-1      1573.16 (  0.00%)     2095.32 (-33.19%)
Amean    fault-base-3      2544.30 (  0.00%)     3256.53 (-27.99%)
Amean    fault-base-5      3412.16 (  0.00%)     3687.55 ( -8.07%)
Amean    fault-base-7      4633.68 (  0.00%)     5329.99 (-15.03%)
Amean    fault-base-12     7794.71 (  0.00%)     8441.45 ( -8.30%)
Amean    fault-base-16    13747.18 (  0.00%)    11033.65 ( 19.74%)
Amean    fault-huge-1      1279.44 (  0.00%)     1300.09 ( -1.61%)
Amean    fault-huge-3      2300.40 (  0.00%)     2267.17 (  1.44%)
Amean    fault-huge-5      1929.86 (  0.00%)     2899.17 (-50.23%)
Amean    fault-huge-7      1803.33 (  0.00%)     3549.11 (-96.81%)
Amean    fault-huge-12     2714.91 (  0.00%)     6106.21 (-124.91%)
Amean    fault-huge-16     5166.36 (  0.00%)     9565.15 (-85.14%)
Stddev   fault-base-1      1986.46 (  0.00%)     1377.20 ( 30.67%)
Stddev   fault-base-3      5293.92 (  0.00%)     5594.88 ( -5.69%)
Stddev   fault-base-5      5291.19 (  0.00%)     5583.54 ( -5.53%)
Stddev   fault-base-7      5861.45 (  0.00%)     7460.34 (-27.28%)
Stddev   fault-base-12    10754.38 (  0.00%)    11992.12 (-11.51%)
Stddev   fault-base-16    17183.11 (  0.00%)    12995.81 ( 24.37%)
Stddev   fault-huge-1        71.03 (  0.00%)       54.49 ( 23.29%)
Stddev   fault-huge-3       441.09 (  0.00%)      730.62 (-65.64%)
Stddev   fault-huge-5      3291.41 (  0.00%)     4308.06 (-30.89%)
Stddev   fault-huge-7       713.08 (  0.00%)     1226.08 (-71.94%)
Stddev   fault-huge-12     2667.32 (  0.00%)     7780.83 (-191.71%)
Stddev   fault-huge-16     4618.22 (  0.00%)     8364.24 (-81.11%)
CoeffVar fault-base-1       126.27 (  0.00%)       65.73 ( 47.95%)
CoeffVar fault-base-3       208.07 (  0.00%)      171.81 ( 17.43%)
CoeffVar fault-base-5       155.07 (  0.00%)      151.42 (  2.36%)
CoeffVar fault-base-7       126.50 (  0.00%)      139.97 (-10.65%)
CoeffVar fault-base-12      137.97 (  0.00%)      142.06 ( -2.97%)
CoeffVar fault-base-16      124.99 (  0.00%)      117.78 (  5.77%)
CoeffVar fault-huge-1         5.55 (  0.00%)        4.19 ( 24.50%)
CoeffVar fault-huge-3        19.17 (  0.00%)       32.23 (-68.07%)
CoeffVar fault-huge-5       170.55 (  0.00%)      148.60 ( 12.87%)
CoeffVar fault-huge-7        39.54 (  0.00%)       34.55 ( 12.64%)
CoeffVar fault-huge-12       98.25 (  0.00%)      127.42 (-29.70%)
CoeffVar fault-huge-16       89.39 (  0.00%)       87.44 (  2.18%)
Max      fault-base-1     56069.00 (  0.00%)    37361.00 ( 33.37%)
Max      fault-base-3     75921.00 (  0.00%)    74860.00 (  1.40%)
Max      fault-base-5     53708.00 (  0.00%)    60756.00 (-13.12%)
Max      fault-base-7     43282.00 (  0.00%)    58071.00 (-34.17%)
Max      fault-base-12    86499.00 (  0.00%)    95819.00 (-10.77%)
Max      fault-base-16   106264.00 (  0.00%)    81830.00 ( 22.99%)
Max      fault-huge-1      1387.00 (  0.00%)     1365.00 (  1.59%)
Max      fault-huge-3      2831.00 (  0.00%)     3395.00 (-19.92%)
Max      fault-huge-5     19345.00 (  0.00%)    23269.00 (-20.28%)
Max      fault-huge-7      2811.00 (  0.00%)     5935.00 (-111.13%)
Max      fault-huge-12    10869.00 (  0.00%)    36037.00 (-231.56%)
Max      fault-huge-16    13614.00 (  0.00%)    40513.00 (-197.58%)

With no reclaim/compaction from page fault context, there's nothing to improve
here. Indeed it can be only worse due to extra khugepaged activity.

thpscale Percentage Faults Huge
                                 4.1-rc2               4.1-rc2
                                    0-nd                  4-nd
Percentage huge-1         2.28 (  0.00%)        7.09 (211.11%)
Percentage huge-3         0.63 (  0.00%)        8.11 (1180.00%)
Percentage huge-5         3.67 (  0.00%)        4.56 ( 24.14%)
Percentage huge-7         0.38 (  0.00%)        1.15 (200.00%)
Percentage huge-12        1.41 (  0.00%)        3.08 (118.18%)
Percentage huge-16        1.79 (  0.00%)       10.97 (514.29%)

Khugepaged does manage to free some hugepages for page faults, but with the
maximum possible fault frequency the benchmark induces, it can't keep up
obviously.  Could be better in a more realistic scenario.

             4.1-rc2     4.1-rc2
                0-nd        4-nd
User           13.61       14.10
System         50.16       48.65
Elapsed       195.12      194.67

                               4.1-rc2     4.1-rc2
                                  0-nd        4-nd
Minor Faults                   2916846     2738269
Major Faults                       205         203
Swap Ins                             0           0
Swap Outs                            0           0
Allocation stalls                  586         329
DMA allocs                           0           0
DMA32 allocs                   6965325     7256686
Normal allocs                  2577724     2454522
Movable allocs                       0           0
Direct pages scanned            443280      263574
Kswapd pages scanned            314174      233582
Kswapd pages reclaimed          108029       60679
Direct pages reclaimed           27267       40383
Kswapd efficiency                  34%         25%
Kswapd velocity               1610.158    1199.887
Direct efficiency                   6%         15%
Direct velocity               2271.833    1353.953
Percentage direct scans            58%         53%
Zone normal velocity           925.390     757.764
Zone dma32 velocity           2956.601    1796.075
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate              13           9
Sector Reads                   4976736     4977540
Sector Writes                  3246536     3246076
Page rescued immediate               0           0
Slabs scanned                    61802       62034
Direct inode steals                  0           0
Kswapd inode steals                 16           0
Kswapd skipped wait                  0           0
THP fault alloc                   9022        9375
THP collapse alloc                   0         377
THP splits                        8939        9150
THP fault fallback                5300        4953
THP collapse fail                    0           2
Compaction stalls                    0         434
Compaction success                   0         291
Compaction failures                  0         143
Page migrate success                 0      287093
Page migrate failure                 0           1
Compaction pages isolated            0      608761
Compaction migrate scanned           0      365724
Compaction free scanned              0     3588885
Compaction cost                      0         312
NUMA alloc hit                 4932019     4727109
NUMA alloc miss                      0           0
NUMA interleave hit                  0           0
NUMA alloc local               4932019     4727109
NUMA base PTE updates                0           0
NUMA huge PMD updates                0           0
NUMA page range updates              0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA hint local percent            100         100
NUMA pages migrated                  0           0
AutoNUMA cost                       0%          0%

Without the patchset, there's no compaction as the benchmark is too short for
the khugepaged collapses scanning to do anything. With the patchset, we wake
up khugepaged for the reclaim/compaction immediately.

To conclude, these results suggest that it's better tradeoff to keep page
faults attempt some light compaction, but the patchset reduces latencies and
improves compaction success rates by preventing these light attempts to
continue once they stop being successful. As much as I would like to see the
page faults to not use GFP_WAIT by default (i.e. echo never/madvise >
.../defrag), that test currently doesn't show much benefit, although I suspect
it's because the benchmark is too unrealistically fault-intensive as it is, so
khugepaged is doing much work and still can't keep up.

It probably also doesn't help that once khugepaged is woken up, it will try
both the THP allocations and then the scanning for collapses work, so that
scanning is done also more frequently than via the controlled sleeps. I'll
think about how to decouple that for the next version. Maybe just skip the
collapse scanning altogether when khugepaged was woken up for THP allocation,
since that is arguably higher priority.

It would be simpler if and more efficient if each node had own khugepaged just
for the THP allocation work, and scanning for collapse would be done in
task_work context. But that's for later. Thoughts?

[1] https://lwn.net/Articles/634384/
[2] https://lwn.net/Articles/636162/

Vlastimil Babka (4):
  mm, thp: stop preallocating hugepages in khugepaged
  mm, thp: khugepaged checks for THP allocability before scanning
  mm, thp: try fault allocations only if we expect them to succeed
  mm, thp: wake up khugepaged when huge page is not available

 mm/huge_memory.c | 216 +++++++++++++++++++++++++++++++------------------------
 mm/internal.h    |  36 ++++++++++
 mm/mempolicy.c   |  37 ++++++----
 mm/page_alloc.c  |   3 +
 4 files changed, 182 insertions(+), 110 deletions(-)

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 1/4] mm, thp: stop preallocating hugepages in khugepaged
  2015-05-11 14:35 [RFC 0/4] Outsourcing page fault THP allocations to khugepaged Vlastimil Babka
@ 2015-05-11 14:35 ` Vlastimil Babka
  2015-06-18  0:34   ` David Rientjes
  2015-05-11 14:35 ` [RFC 2/4] mm, thp: khugepaged checks for THP allocability before scanning Vlastimil Babka
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Vlastimil Babka @ 2015-05-11 14:35 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Alex Thorlton, David Rientjes, Vlastimil Babka

Khugepaged tries to preallocate a hugepage before scanning for THP collapse
candidates. If the preallocation fails, scanning is not attempted. This makes
sense, but it is only restricted to !NUMA configurations, where it does not
need to predict on which node to preallocate.

Besides the !NUMA restriction, the preallocated page may also end up being
unused and put back when no collapse candidate is found. I have observed the
thp_collapse_alloc vmstat counter to have 3+ times the value of the counter
of actually collapsed pages in /sys/.../khugepaged/pages_collapsed. On the
other hand, the periodic hugepage allocation attempts involving sync
compaction can be beneficial for the antifragmentation mechanism, but that's
however harder to evaluate.

The following patch will introduce per-node THP availability tracking, which
has more benefits than current preallocation and is applicable to CONFIG_NUMA.
We can therefore remove the preallocation, which also allows a cleanup of the
functions involved in khugepaged allocations. Another small benefit of the
patch is that NUMA configs can now reuse an allocated hugepage for another
collapse attempt, if the previous one was for the same node and failed.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 150 ++++++++++++++++++++-----------------------------------
 1 file changed, 53 insertions(+), 97 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 078832c..565864b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,9 +765,9 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return 0;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+static inline gfp_t alloc_hugepage_gfpmask(int defrag)
 {
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT));
 }
 
 /* Caller must hold page table lock. */
@@ -825,7 +825,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		return 0;
 	}
-	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
+	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1116,7 +1116,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
-		huge_gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
+		huge_gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
@@ -2318,39 +2318,41 @@ static int khugepaged_find_target_node(void)
 	return target_node;
 }
 
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+static inline struct page *alloc_hugepage_node(gfp_t gfp, int node)
 {
-	if (IS_ERR(*hpage)) {
-		if (!*wait)
-			return false;
-
-		*wait = false;
-		*hpage = NULL;
-		khugepaged_alloc_sleep();
-	} else if (*hpage) {
-		put_page(*hpage);
-		*hpage = NULL;
-	}
-
-	return true;
+	gfp |= __GFP_THISNODE | __GFP_OTHER_NODE;
+	return alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER);
+}
+#else
+static int khugepaged_find_target_node(void)
+{
+	return 0;
 }
 
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
-		       struct vm_area_struct *vma, unsigned long address,
-		       int node)
+static inline struct page *alloc_hugepage_node(gfp_t gfp, int node)
 {
-	VM_BUG_ON_PAGE(*hpage, *hpage);
+	return alloc_pages(gfp, HPAGE_PMD_ORDER);
+}
+#endif
 
+static struct page
+*khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
+{
 	/*
-	 * Before allocating the hugepage, release the mmap_sem read lock.
-	 * The allocation can take potentially a long time if it involves
-	 * sync compaction, and we do not need to hold the mmap_sem during
-	 * that. We will recheck the vma after taking it again in write mode.
+	 * If we allocated a hugepage previously and failed to collapse, reuse
+	 * the page, unless it's on different NUMA node.
 	 */
-	up_read(&mm->mmap_sem);
+	if (!IS_ERR_OR_NULL(*hpage)) {
+		if (IS_ENABLED(CONFIG_NUMA) && page_to_nid(*hpage) != node) {
+			put_page(*hpage);
+			*hpage = NULL;
+		} else {
+			return *hpage;
+		}
+	}
+
+	*hpage = alloc_hugepage_node(gfp, node);
 
-	*hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
@@ -2360,60 +2362,6 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
-#else
-static int khugepaged_find_target_node(void)
-{
-	return 0;
-}
-
-static inline struct page *alloc_hugepage(int defrag)
-{
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
-			   HPAGE_PMD_ORDER);
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
-	struct page *hpage;
-
-	do {
-		hpage = alloc_hugepage(khugepaged_defrag());
-		if (!hpage) {
-			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-			if (!*wait)
-				return NULL;
-
-			*wait = false;
-			khugepaged_alloc_sleep();
-		} else
-			count_vm_event(THP_COLLAPSE_ALLOC);
-	} while (unlikely(!hpage) && likely(khugepaged_enabled()));
-
-	return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
-	if (!*hpage)
-		*hpage = khugepaged_alloc_hugepage(wait);
-
-	if (unlikely(!*hpage))
-		return false;
-
-	return true;
-}
-
-static struct page *
-khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
-		       struct vm_area_struct *vma, unsigned long address,
-		       int node)
-{
-	up_read(&mm->mmap_sem);
-	VM_BUG_ON(!*hpage);
-
-	return  *hpage;
-}
-#endif
 
 static bool hugepage_vma_check(struct vm_area_struct *vma)
 {
@@ -2449,17 +2397,25 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	/* Only allocate from the target node */
-	gfp = alloc_hugepage_gfpmask(khugepaged_defrag(), __GFP_OTHER_NODE) |
-		__GFP_THISNODE;
+	/* 
+	 * Determine the flags relevant for both hugepage allocation and memcg
+	 * charge. Hugepage allocation may still add __GFP_THISNODE and
+	 * __GFP_OTHER_NODE, which memcg ignores.
+	 */
+	gfp = alloc_hugepage_gfpmask(khugepaged_defrag());
 
-	/* release the mmap_sem read lock. */
-	new_page = khugepaged_alloc_page(hpage, gfp, mm, vma, address, node);
+	/*
+	 * Before allocating the hugepage, release the mmap_sem read lock.
+	 * The allocation can take potentially a long time if it involves
+	 * sync compaction, and we do not need to hold the mmap_sem during
+	 * that. We will recheck the vma after taking it again in write mode.
+	 */
+	up_read(&mm->mmap_sem);
+	new_page = khugepaged_alloc_page(hpage, gfp, node);
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm,
-					   gfp, &memcg)))
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg)))
 		return;
 
 	/*
@@ -2788,15 +2744,9 @@ static void khugepaged_do_scan(void)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
-	unsigned int pages = khugepaged_pages_to_scan;
-	bool wait = true;
-
-	barrier(); /* write khugepaged_pages_to_scan to local stack */
+	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
 
 	while (progress < pages) {
-		if (!khugepaged_prealloc_page(&hpage, &wait))
-			break;
-
 		cond_resched();
 
 		if (unlikely(kthread_should_stop() || freezing(current)))
@@ -2812,6 +2762,12 @@ static void khugepaged_do_scan(void)
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
+
+		/* THP allocation has failed during collapse */
+		if (IS_ERR(hpage)) {
+			khugepaged_alloc_sleep();
+			break;
+		}
 	}
 
 	if (!IS_ERR_OR_NULL(hpage))
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC 1/4] mm, thp: stop preallocating hugepages in khugepaged
  2015-05-11 14:35 ` [RFC 1/4] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
@ 2015-06-18  0:34   ` David Rientjes
  0 siblings, 0 replies; 10+ messages in thread
From: David Rientjes @ 2015-06-18  0:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Alex Thorlton

On Mon, 11 May 2015, Vlastimil Babka wrote:

> Khugepaged tries to preallocate a hugepage before scanning for THP collapse
> candidates. If the preallocation fails, scanning is not attempted. This makes
> sense, but it is only restricted to !NUMA configurations, where it does not
> need to predict on which node to preallocate.
> 
> Besides the !NUMA restriction, the preallocated page may also end up being
> unused and put back when no collapse candidate is found. I have observed the
> thp_collapse_alloc vmstat counter to have 3+ times the value of the counter
> of actually collapsed pages in /sys/.../khugepaged/pages_collapsed. On the
> other hand, the periodic hugepage allocation attempts involving sync
> compaction can be beneficial for the antifragmentation mechanism, but that's
> however harder to evaluate.
> 
> The following patch will introduce per-node THP availability tracking, which
> has more benefits than current preallocation and is applicable to CONFIG_NUMA.
> We can therefore remove the preallocation, which also allows a cleanup of the
> functions involved in khugepaged allocations. Another small benefit of the
> patch is that NUMA configs can now reuse an allocated hugepage for another
> collapse attempt, if the previous one was for the same node and failed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I think this is fine if the rest of the series is adopted, and I 
understand how the removal and cleanup is easier when done first before 
the following patches.  I think you can unify alloc_hugepage_node() for 
both NUMA and !NUMA configs and inline it in khugepaged_alloc_page().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 2/4] mm, thp: khugepaged checks for THP allocability before scanning
  2015-05-11 14:35 [RFC 0/4] Outsourcing page fault THP allocations to khugepaged Vlastimil Babka
  2015-05-11 14:35 ` [RFC 1/4] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
@ 2015-05-11 14:35 ` Vlastimil Babka
  2015-06-18  1:00   ` David Rientjes
  2015-05-11 14:35 ` [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
  2015-05-11 14:35 ` [RFC 4/4] mm, thp: wake up khugepaged when huge page is not available Vlastimil Babka
  3 siblings, 1 reply; 10+ messages in thread
From: Vlastimil Babka @ 2015-05-11 14:35 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Alex Thorlton, David Rientjes, Vlastimil Babka

Khugepaged could be scanning for collapse candidates uselessly, if it cannot
allocate a hugepage in the end. The hugepage preallocation mechanism prevented
this, but only for !NUMA configurations. It was removed by the previous patch,
and this patch replaces it with a more generic mechanism.

The patch itroduces a thp_avail_nodes nodemask, which initially assumes that
hugepage can be allocated on any node. Whenever khugepaged fails to allocate
a hugepage, it clears the corresponding node bit. Before scanning for collapse
candidates, it tries to allocate a hugepage on each online node with the bit
cleared, and set it back on success. It tries to hold on to the hugepage if
it doesn't hold any other yet. But the assumption is that even if the hugepage
is freed back, it should be possible to allocate it in near future without
further reclaim and compaction attempts.

During the scaning, khugepaged avoids collapsing on nodes with the bit cleared,
as soon as possible. If no nodes have hugepages available, scanning is skipped
altogether.

During testing, the patch did not show much difference in preventing
thp_collapse_failed events from khugepaged, but this can be attributed to the
sync compaction, which only khugepaged is allowed to use, and which is
heavyweight enough to succeed frequently enough nowadays. The next patch will
however extend the nodemask check to page fault context, where it has much
larger impact. Also, with the future plan to convert THP collapsing to
task_work context, this patch is a preparation to avoid useless scanning or
heavyweight THP allocations in that context.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 63 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 565864b..b86a72a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -102,7 +102,7 @@ struct khugepaged_scan {
 static struct khugepaged_scan khugepaged_scan = {
 	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
 };
-
+static nodemask_t thp_avail_nodes = NODE_MASK_ALL;
 
 static int set_recommended_min_free_kbytes(void)
 {
@@ -2273,6 +2273,14 @@ static bool khugepaged_scan_abort(int nid)
 	int i;
 
 	/*
+	 * If it's clear that we are going to select a node where THP
+	 * allocation is unlikely to succeed, abort
+	 */
+	if (khugepaged_node_load[nid] == (HPAGE_PMD_NR / 2) &&
+				!node_isset(nid, thp_avail_nodes))
+		return true;
+
+	/*
 	 * If zone_reclaim_mode is disabled, then no extra effort is made to
 	 * allocate memory locally.
 	 */
@@ -2356,6 +2364,7 @@ static struct page
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
+		node_clear(node, thp_avail_nodes);
 		return NULL;
 	}
 
@@ -2363,6 +2372,42 @@ static struct page
 	return *hpage;
 }
 
+/* Return true, if THP should be allocatable on at least one node */
+static bool khugepaged_check_nodes(struct page **hpage)
+{
+	bool ret = false;
+	int nid;
+	struct page *newpage = NULL;
+	gfp_t gfp = alloc_hugepage_gfpmask(khugepaged_defrag());
+
+	for_each_online_node(nid) {
+		if (node_isset(nid, thp_avail_nodes)) {
+			ret = true;
+			continue;
+		}
+
+		newpage = alloc_hugepage_node(gfp, nid);
+
+		if (newpage) {
+			node_set(nid, thp_avail_nodes);
+			ret = true;
+			/*
+			 * Heuristic - try to hold on to the page for collapse
+			 * scanning, if we don't hold any yet.
+			 */
+			if (IS_ERR_OR_NULL(*hpage)) {
+				*hpage = newpage;
+				//NIXME: should we count all/no allocations?
+				count_vm_event(THP_COLLAPSE_ALLOC);
+			} else {
+				put_page(newpage);
+			}
+		}
+	}
+
+	return ret;
+}
+
 static bool hugepage_vma_check(struct vm_area_struct *vma)
 {
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
@@ -2590,6 +2635,10 @@ out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
 		node = khugepaged_find_target_node();
+		if (!node_isset(node, thp_avail_nodes)) {
+			ret = 0;
+			goto out;
+		}
 		/* collapse_huge_page will return with the mmap_sem released */
 		collapse_huge_page(mm, address, hpage, vma, node);
 	}
@@ -2740,12 +2789,16 @@ static int khugepaged_wait_event(void)
 		kthread_should_stop();
 }
 
-static void khugepaged_do_scan(void)
+/* Return false if THP allocation failed, true otherwise */
+static bool khugepaged_do_scan(void)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
 	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
 
+	if (!khugepaged_check_nodes(&hpage))
+		return false;
+
 	while (progress < pages) {
 		cond_resched();
 
@@ -2764,14 +2817,14 @@ static void khugepaged_do_scan(void)
 		spin_unlock(&khugepaged_mm_lock);
 
 		/* THP allocation has failed during collapse */
-		if (IS_ERR(hpage)) {
-			khugepaged_alloc_sleep();
-			break;
-		}
+		if (IS_ERR(hpage))
+			return false;
 	}
 
 	if (!IS_ERR_OR_NULL(hpage))
 		put_page(hpage);
+
+	return true;
 }
 
 static void khugepaged_wait_work(void)
@@ -2800,8 +2853,10 @@ static int khugepaged(void *none)
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
-		khugepaged_do_scan();
-		khugepaged_wait_work();
+		if (khugepaged_do_scan())
+			khugepaged_wait_work();
+		else
+			khugepaged_alloc_sleep();
 	}
 
 	spin_lock(&khugepaged_mm_lock);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC 2/4] mm, thp: khugepaged checks for THP allocability before scanning
  2015-05-11 14:35 ` [RFC 2/4] mm, thp: khugepaged checks for THP allocability before scanning Vlastimil Babka
@ 2015-06-18  1:00   ` David Rientjes
  2015-06-23 15:41     ` Vlastimil Babka
  0 siblings, 1 reply; 10+ messages in thread
From: David Rientjes @ 2015-06-18  1:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Alex Thorlton

On Mon, 11 May 2015, Vlastimil Babka wrote:

> Khugepaged could be scanning for collapse candidates uselessly, if it cannot
> allocate a hugepage in the end. The hugepage preallocation mechanism prevented
> this, but only for !NUMA configurations. It was removed by the previous patch,
> and this patch replaces it with a more generic mechanism.
> 
> The patch itroduces a thp_avail_nodes nodemask, which initially assumes that
> hugepage can be allocated on any node. Whenever khugepaged fails to allocate
> a hugepage, it clears the corresponding node bit. Before scanning for collapse
> candidates, it tries to allocate a hugepage on each online node with the bit
> cleared, and set it back on success. It tries to hold on to the hugepage if
> it doesn't hold any other yet. But the assumption is that even if the hugepage
> is freed back, it should be possible to allocate it in near future without
> further reclaim and compaction attempts.
> 
> During the scaning, khugepaged avoids collapsing on nodes with the bit cleared,
> as soon as possible. If no nodes have hugepages available, scanning is skipped
> altogether.
> 

I'm not exactly sure what you mean by avoiding to do something as soon as 
possible.

> During testing, the patch did not show much difference in preventing
> thp_collapse_failed events from khugepaged, but this can be attributed to the
> sync compaction, which only khugepaged is allowed to use, and which is
> heavyweight enough to succeed frequently enough nowadays. The next patch will
> however extend the nodemask check to page fault context, where it has much
> larger impact. Also, with the future plan to convert THP collapsing to
> task_work context, this patch is a preparation to avoid useless scanning or
> heavyweight THP allocations in that context.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/huge_memory.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 63 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 565864b..b86a72a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -102,7 +102,7 @@ struct khugepaged_scan {
>  static struct khugepaged_scan khugepaged_scan = {
>  	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
>  };
> -
> +static nodemask_t thp_avail_nodes = NODE_MASK_ALL;

Seems like it should have khugepaged in its name so it's understood that 
the nodemask doesn't need to be synchronized, even though it will later be 
read outside of khugepaged, or at least a comment to say only khugepaged 
can store to it.

>  
>  static int set_recommended_min_free_kbytes(void)
>  {
> @@ -2273,6 +2273,14 @@ static bool khugepaged_scan_abort(int nid)
>  	int i;
>  
>  	/*
> +	 * If it's clear that we are going to select a node where THP
> +	 * allocation is unlikely to succeed, abort
> +	 */
> +	if (khugepaged_node_load[nid] == (HPAGE_PMD_NR / 2) &&
> +				!node_isset(nid, thp_avail_nodes))
> +		return true;
> +
> +	/*
>  	 * If zone_reclaim_mode is disabled, then no extra effort is made to
>  	 * allocate memory locally.
>  	 */

If khugepaged_node_load for a node doesn't reach HPAGE_PMD_NR / 2, then 
this doesn't cause an abort.  I don't think it's necessary to try to 
optimize and abort the scan early when this is met, I think this should 
only be checked before collapse_huge_page().

> @@ -2356,6 +2364,7 @@ static struct page
>  	if (unlikely(!*hpage)) {
>  		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>  		*hpage = ERR_PTR(-ENOMEM);
> +		node_clear(node, thp_avail_nodes);
>  		return NULL;
>  	}
>  
> @@ -2363,6 +2372,42 @@ static struct page
>  	return *hpage;
>  }
>  
> +/* Return true, if THP should be allocatable on at least one node */
> +static bool khugepaged_check_nodes(struct page **hpage)
> +{
> +	bool ret = false;
> +	int nid;
> +	struct page *newpage = NULL;
> +	gfp_t gfp = alloc_hugepage_gfpmask(khugepaged_defrag());
> +
> +	for_each_online_node(nid) {
> +		if (node_isset(nid, thp_avail_nodes)) {
> +			ret = true;
> +			continue;
> +		}
> +
> +		newpage = alloc_hugepage_node(gfp, nid);
> +
> +		if (newpage) {
> +			node_set(nid, thp_avail_nodes);
> +			ret = true;
> +			/*
> +			 * Heuristic - try to hold on to the page for collapse
> +			 * scanning, if we don't hold any yet.
> +			 */
> +			if (IS_ERR_OR_NULL(*hpage)) {
> +				*hpage = newpage;
> +				//NIXME: should we count all/no allocations?
> +				count_vm_event(THP_COLLAPSE_ALLOC);

Seems like we'd only count the event when the node load has selected a 
target node and the hugepage that is allocated here is used, but if this 
approach is adopted then I think you'll need to introduce a new event to 
track when a hugepage is allocated and subsequently dropped.

> +			} else {
> +				put_page(newpage);
> +			}

Eek, rather than do put_page() why not store a preallocated hugepage for 
every node and let khugepaged_alloc_page() use it?  It would be 
unfortunate that page_to_nid(*hpage) may not equal the target node after 
scanning.

> +		}
> +	}
> +
> +	return ret;
> +}
> +
>  static bool hugepage_vma_check(struct vm_area_struct *vma)
>  {
>  	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
> @@ -2590,6 +2635,10 @@ out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret) {
>  		node = khugepaged_find_target_node();
> +		if (!node_isset(node, thp_avail_nodes)) {
> +			ret = 0;
> +			goto out;
> +		}
>  		/* collapse_huge_page will return with the mmap_sem released */
>  		collapse_huge_page(mm, address, hpage, vma, node);
>  	}
> @@ -2740,12 +2789,16 @@ static int khugepaged_wait_event(void)
>  		kthread_should_stop();
>  }
>  
> -static void khugepaged_do_scan(void)
> +/* Return false if THP allocation failed, true otherwise */
> +static bool khugepaged_do_scan(void)
>  {
>  	struct page *hpage = NULL;
>  	unsigned int progress = 0, pass_through_head = 0;
>  	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
>  
> +	if (!khugepaged_check_nodes(&hpage))
> +		return false;
> +
>  	while (progress < pages) {
>  		cond_resched();
>  
> @@ -2764,14 +2817,14 @@ static void khugepaged_do_scan(void)
>  		spin_unlock(&khugepaged_mm_lock);
>  
>  		/* THP allocation has failed during collapse */
> -		if (IS_ERR(hpage)) {
> -			khugepaged_alloc_sleep();
> -			break;
> -		}
> +		if (IS_ERR(hpage))
> +			return false;
>  	}
>  
>  	if (!IS_ERR_OR_NULL(hpage))
>  		put_page(hpage);
> +
> +	return true;
>  }
>  
>  static void khugepaged_wait_work(void)
> @@ -2800,8 +2853,10 @@ static int khugepaged(void *none)
>  	set_user_nice(current, MAX_NICE);
>  
>  	while (!kthread_should_stop()) {
> -		khugepaged_do_scan();
> -		khugepaged_wait_work();
> +		if (khugepaged_do_scan())
> +			khugepaged_wait_work();
> +		else
> +			khugepaged_alloc_sleep();
>  	}
>  
>  	spin_lock(&khugepaged_mm_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 2/4] mm, thp: khugepaged checks for THP allocability before scanning
  2015-06-18  1:00   ` David Rientjes
@ 2015-06-23 15:41     ` Vlastimil Babka
  0 siblings, 0 replies; 10+ messages in thread
From: Vlastimil Babka @ 2015-06-23 15:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Alex Thorlton

On 06/18/2015 03:00 AM, David Rientjes wrote:
> On Mon, 11 May 2015, Vlastimil Babka wrote:
>
>> Khugepaged could be scanning for collapse candidates uselessly, if it cannot
>> allocate a hugepage in the end. The hugepage preallocation mechanism prevented
>> this, but only for !NUMA configurations. It was removed by the previous patch,
>> and this patch replaces it with a more generic mechanism.
>>
>> The patch itroduces a thp_avail_nodes nodemask, which initially assumes that
>> hugepage can be allocated on any node. Whenever khugepaged fails to allocate
>> a hugepage, it clears the corresponding node bit. Before scanning for collapse
>> candidates, it tries to allocate a hugepage on each online node with the bit
>> cleared, and set it back on success. It tries to hold on to the hugepage if
>> it doesn't hold any other yet. But the assumption is that even if the hugepage
>> is freed back, it should be possible to allocate it in near future without
>> further reclaim and compaction attempts.
>>
>> During the scaning, khugepaged avoids collapsing on nodes with the bit cleared,
>> as soon as possible. If no nodes have hugepages available, scanning is skipped
>> altogether.
>>
>
> I'm not exactly sure what you mean by avoiding to do something as soon as
> possible.

That's referring to the check when node_load is half the pmd size, which 
you want me to remove :)

>> During testing, the patch did not show much difference in preventing
>> thp_collapse_failed events from khugepaged, but this can be attributed to the
>> sync compaction, which only khugepaged is allowed to use, and which is
>> heavyweight enough to succeed frequently enough nowadays. The next patch will
>> however extend the nodemask check to page fault context, where it has much
>> larger impact. Also, with the future plan to convert THP collapsing to
>> task_work context, this patch is a preparation to avoid useless scanning or
>> heavyweight THP allocations in that context.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>   mm/huge_memory.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 63 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 565864b..b86a72a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -102,7 +102,7 @@ struct khugepaged_scan {
>>   static struct khugepaged_scan khugepaged_scan = {
>>   	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
>>   };
>> -
>> +static nodemask_t thp_avail_nodes = NODE_MASK_ALL;
>
> Seems like it should have khugepaged in its name so it's understood that
> the nodemask doesn't need to be synchronized, even though it will later be
> read outside of khugepaged, or at least a comment to say only khugepaged
> can store to it.

After patch 3, bits can be cleared from the mask also outside of 
khugepaged, i.e. when THP allocations fail on page fault.
But, node_set() and node_clear() use the atomic bitmap functions 
set_bit() and clear_bit(), so it is in fact synchronized.

>>
>>   static int set_recommended_min_free_kbytes(void)
>>   {
>> @@ -2273,6 +2273,14 @@ static bool khugepaged_scan_abort(int nid)
>>   	int i;
>>
>>   	/*
>> +	 * If it's clear that we are going to select a node where THP
>> +	 * allocation is unlikely to succeed, abort
>> +	 */
>> +	if (khugepaged_node_load[nid] == (HPAGE_PMD_NR / 2) &&
>> +				!node_isset(nid, thp_avail_nodes))
>> +		return true;
>> +
>> +	/*
>>   	 * If zone_reclaim_mode is disabled, then no extra effort is made to
>>   	 * allocate memory locally.
>>   	 */
>
> If khugepaged_node_load for a node doesn't reach HPAGE_PMD_NR / 2, then
> this doesn't cause an abort.

Yes such situation is also covered.

> I don't think it's necessary to try to
> optimize and abort the scan early when this is met, I think this should
> only be checked before collapse_huge_page().

Avoiding potentially 256 iterations of a loop sounds good to me, no?
The check shouldn't be expensive thanks to short-circuiting the other 
part.:)

>> @@ -2356,6 +2364,7 @@ static struct page
>>   	if (unlikely(!*hpage)) {
>>   		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
>>   		*hpage = ERR_PTR(-ENOMEM);
>> +		node_clear(node, thp_avail_nodes);
>>   		return NULL;
>>   	}
>>
>> @@ -2363,6 +2372,42 @@ static struct page
>>   	return *hpage;
>>   }
>>
>> +/* Return true, if THP should be allocatable on at least one node */
>> +static bool khugepaged_check_nodes(struct page **hpage)
>> +{
>> +	bool ret = false;
>> +	int nid;
>> +	struct page *newpage = NULL;
>> +	gfp_t gfp = alloc_hugepage_gfpmask(khugepaged_defrag());
>> +
>> +	for_each_online_node(nid) {
>> +		if (node_isset(nid, thp_avail_nodes)) {
>> +			ret = true;
>> +			continue;
>> +		}
>> +
>> +		newpage = alloc_hugepage_node(gfp, nid);
>> +
>> +		if (newpage) {
>> +			node_set(nid, thp_avail_nodes);
>> +			ret = true;
>> +			/*
>> +			 * Heuristic - try to hold on to the page for collapse
>> +			 * scanning, if we don't hold any yet.
>> +			 */
>> +			if (IS_ERR_OR_NULL(*hpage)) {
>> +				*hpage = newpage;
>> +				//NIXME: should we count all/no allocations?
>> +				count_vm_event(THP_COLLAPSE_ALLOC);
>
> Seems like we'd only count the event when the node load has selected a
> target node and the hugepage that is allocated here is used, but if this

Yeah even the node preallocation was misleading in this regard (see 
commit log of patch 1).

> approach is adopted then I think you'll need to introduce a new event to
> track when a hugepage is allocated and subsequently dropped.

Alternatively add event for successful collapses (and keep the current 
one for allocations). It is exported now under /sys but having that in 
vmstat would be more consistent.
Then the count of pages subsequently dropped is simply the difference 
between collapse allocations and collapses (with some rather negligible 
amount possibly being held waiting as you suggest below).
I think this approach would be better as we wouldn't change semantic of 
existing THP_COLLAPSE_ALLOC event?

>
>> +			} else {
>> +				put_page(newpage);
>> +			}
>
> Eek, rather than do put_page() why not store a preallocated hugepage for
> every node and let khugepaged_alloc_page() use it?  It would be
> unfortunate that page_to_nid(*hpage) may not equal the target node after
> scanning.

I considered that but were afraid that if those pages' nodes ended up 
not selected, the stored pages would just occupy memory. But maybe I 
could introduce a shrinker for freeing those?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed
  2015-05-11 14:35 [RFC 0/4] Outsourcing page fault THP allocations to khugepaged Vlastimil Babka
  2015-05-11 14:35 ` [RFC 1/4] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
  2015-05-11 14:35 ` [RFC 2/4] mm, thp: khugepaged checks for THP allocability before scanning Vlastimil Babka
@ 2015-05-11 14:35 ` Vlastimil Babka
  2015-06-18  1:20   ` David Rientjes
  2015-05-11 14:35 ` [RFC 4/4] mm, thp: wake up khugepaged when huge page is not available Vlastimil Babka
  3 siblings, 1 reply; 10+ messages in thread
From: Vlastimil Babka @ 2015-05-11 14:35 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Alex Thorlton, David Rientjes, Vlastimil Babka

Since we track THP availability for khugepaged THP collapses, we can use it
also for page fault THP allocations. If khugepaged with its sync compaction
is not able to allocate a hugepage, then it's unlikely that the less involved
attempt on page fault would succeed, and the cost could be higher than THP
benefits. Also clear the THP availability flag if we do attempt and fail to
allocate during page fault, and set the flag if we are freeing a large enough
page from any context. The latter doesn't include merges, as that's a fast
path and unlikely to make much difference.

Also restructure alloc_pages_vma() a bit.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c |  3 ++-
 mm/internal.h    | 39 +++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c   | 37 ++++++++++++++++++++++---------------
 mm/page_alloc.c  |  3 +++
 4 files changed, 66 insertions(+), 16 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b86a72a..d3081a7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -102,7 +102,8 @@ struct khugepaged_scan {
 static struct khugepaged_scan khugepaged_scan = {
 	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
 };
-static nodemask_t thp_avail_nodes = NODE_MASK_ALL;
+
+nodemask_t thp_avail_nodes = NODE_MASK_ALL;
 
 static int set_recommended_min_free_kbytes(void)
 {
diff --git a/mm/internal.h b/mm/internal.h
index a25e359..6d9a711 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -162,6 +162,45 @@ extern bool is_free_buddy_page(struct page *page);
 #endif
 extern int user_min_free_kbytes;
 
+/*
+ * in mm/huge_memory.c
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
+extern nodemask_t thp_avail_nodes;
+
+static inline bool thp_avail_isset(int nid)
+{
+	return node_isset(nid, thp_avail_nodes);
+}
+
+static inline void thp_avail_set(int nid)
+{
+	node_set(nid, thp_avail_nodes);
+}
+
+static inline void thp_avail_clear(int nid)
+{
+	node_clear(nid, thp_avail_nodes);
+}
+
+#else
+
+static inline bool thp_avail_isset(int nid)
+{
+	return true;
+}
+
+static inline void thp_avail_set(int nid)
+{
+}
+
+static inline void thp_avail_clear(int nid)
+{
+}
+
+#endif
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ede2629..41923b0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1963,17 +1963,32 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node, bool hugepage)
 {
 	struct mempolicy *pol;
-	struct page *page;
+	struct page *page = NULL;
 	unsigned int cpuset_mems_cookie;
 	struct zonelist *zl;
 	nodemask_t *nmask;
 
+	/* Help compiler eliminate code */
+	hugepage = IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage;
+
 retry_cpuset:
 	pol = get_vma_policy(vma, addr);
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage &&
-					pol->mode != MPOL_INTERLEAVE)) {
+	if (pol->mode == MPOL_INTERLEAVE) {
+		unsigned nid;
+
+		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
+		mpol_cond_put(pol);
+		if (!hugepage || thp_avail_isset(nid))
+			page = alloc_page_interleave(gfp, order, nid);
+		if (hugepage && !page)
+			thp_avail_clear(nid);
+		goto out;
+	}
+
+	nmask = policy_nodemask(gfp, pol);
+	if (hugepage) {
 		/*
 		 * For hugepage allocation and non-interleave policy which
 		 * allows the current node, we only try to allocate from the
@@ -1983,25 +1998,17 @@ retry_cpuset:
 		 * If the policy is interleave, or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(node, *nmask)) {
 			mpol_cond_put(pol);
-			page = alloc_pages_exact_node(node,
+			if (thp_avail_isset(node))
+				page = alloc_pages_exact_node(node,
 						gfp | __GFP_THISNODE, order);
+			if (!page)
+				thp_avail_clear(node);
 			goto out;
 		}
 	}
 
-	if (pol->mode == MPOL_INTERLEAVE) {
-		unsigned nid;
-
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, order, nid);
-		goto out;
-	}
-
-	nmask = policy_nodemask(gfp, pol);
 	zl = policy_zonelist(gfp, pol, node);
 	mpol_cond_put(pol);
 	page = __alloc_pages_nodemask(gfp, order, zl, nmask);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..f7ff90e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -830,6 +830,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	set_freepage_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)
+			&& order >= HPAGE_PMD_ORDER)
+		thp_avail_set(page_to_nid(page));
 }
 
 void __init __free_pages_bootmem(struct page *page, unsigned int order)
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed
  2015-05-11 14:35 ` [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
@ 2015-06-18  1:20   ` David Rientjes
  2015-06-23 16:23     ` Vlastimil Babka
  0 siblings, 1 reply; 10+ messages in thread
From: David Rientjes @ 2015-06-18  1:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Alex Thorlton

On Mon, 11 May 2015, Vlastimil Babka wrote:

> Since we track THP availability for khugepaged THP collapses, we can use it
> also for page fault THP allocations. If khugepaged with its sync compaction
> is not able to allocate a hugepage, then it's unlikely that the less involved
> attempt on page fault would succeed, and the cost could be higher than THP
> benefits. Also clear the THP availability flag if we do attempt and fail to
> allocate during page fault, and set the flag if we are freeing a large enough
> page from any context. The latter doesn't include merges, as that's a fast
> path and unlikely to make much difference.
> 

That depends on how long {scan,alloc}_sleep_millisecs are, so if 
khugepaged fails to allocate a hugepage on all nodes, it sleeps for 
alloc_sleep_millisecs (default 60s), and then there's immediate memory 
freeing, thp page faults don't happen again for 60s.  That's scary to me 
when thp_avail_nodes is clear, a large process terminates, and then 
immediately starts back up.  None of its memory is faulted as thp and 
depending on how large it is, khugepaged may fail to allocate hugepages 
when it wakes back up so it never scans (the only reason why 
thp_avail_nodes was clear before it terminated originally).

I'm not sure that approach can work unless the inference of whether a 
hugepage can be allocated at a given time is a very good indicator of 
whether a hugepage can be allocated alloc_sleep_millisecs later, and I'm 
afraid that's not the case.

I'm very happy that you're looking at thp fault latency and the role that 
khugepaged can play in accepting responsibility for defragmentation, 
though.  It's an area that has caused me some trouble lately and I'd like 
to be able to improve.

We see an immediate benefit when experimenting with doing synchronous 
memory compactions of all memory every 15s.  That's done using a cronjob 
rather than khugepaged, but the idea is the same.

What would your thoughts be about doing something radical like

 - having khugepaged do synchronous memory compaction of all memory at
   regulary intervals,

 - track how many pageblocks are free for thp memory to be allocated,

 - terminate collapsing if free pageblocks are below a threshold,

 - trigger a khugepaged wakeup at page fault when that number of 
   pageblocks falls below a threshold,

 - determine the next full sync memory compaction based on how many
   pageblocks were defragmented on the last wakeup, and

 - avoid memory compaction for all thp page faults.

(I'd ignore what is actually the responsibility of khugepaged and what is 
done in task work at this time.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed
  2015-06-18  1:20   ` David Rientjes
@ 2015-06-23 16:23     ` Vlastimil Babka
  0 siblings, 0 replies; 10+ messages in thread
From: Vlastimil Babka @ 2015-06-23 16:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Alex Thorlton

On 06/18/2015 03:20 AM, David Rientjes wrote:
> On Mon, 11 May 2015, Vlastimil Babka wrote:
>
>> Since we track THP availability for khugepaged THP collapses, we can use it
>> also for page fault THP allocations. If khugepaged with its sync compaction
>> is not able to allocate a hugepage, then it's unlikely that the less involved
>> attempt on page fault would succeed, and the cost could be higher than THP
>> benefits. Also clear the THP availability flag if we do attempt and fail to
>> allocate during page fault, and set the flag if we are freeing a large enough
>> page from any context. The latter doesn't include merges, as that's a fast
>> path and unlikely to make much difference.
>>
>
> That depends on how long {scan,alloc}_sleep_millisecs are, so if
> khugepaged fails to allocate a hugepage on all nodes, it sleeps for
> alloc_sleep_millisecs (default 60s)

Waking up khugepaged earlier is handled in patch 4.

> and then there's immediate memory
> freeing, thp page faults don't happen again for 60s.  That's scary to me
> when thp_avail_nodes is clear, a large process terminates, and then
> immediately starts back up.

The last hunk of this patch makes sure that freeing a >=HPAGE_PMD_ORDER 
page sets the thp availability bit so that scenario should be OK. This 
wouldn't handle merging of free pages to form a large enough page, but 
that should be rare enough to be negligible.

> None of its memory is faulted as thp and
> depending on how large it is, khugepaged may fail to allocate hugepages
> when it wakes back up so it never scans (the only reason why
> thp_avail_nodes was clear before it terminated originally).
>
> I'm not sure that approach can work unless the inference of whether a
> hugepage can be allocated at a given time is a very good indicator of
> whether a hugepage can be allocated alloc_sleep_millisecs later, and I'm
> afraid that's not the case.

So does the explanation above solve the concern?

> I'm very happy that you're looking at thp fault latency and the role that
> khugepaged can play in accepting responsibility for defragmentation,
> though.  It's an area that has caused me some trouble lately and I'd like
> to be able to improve.

Good.

> We see an immediate benefit when experimenting with doing synchronous
> memory compactions of all memory every 15s.  That's done using a cronjob
> rather than khugepaged, but the idea is the same.
>
> What would your thoughts be about doing something radical like
>
>   - having khugepaged do synchronous memory compaction of all memory at
>     regulary intervals,

I'm also thinking towards something like this for some time, yeah. Also 
maybe not khugepaged but per-node "kcompatd" that's handles just the 
compation and not thp collapses.

>   - track how many pageblocks are free for thp memory to be allocated,

That should be easy to determine from free lists already? There are 
per-order counts AFAIK, you just have to sum up over all zones and 
orders between pageblock order and MAX_ORDER (which should be just 1 or 
2 orders).

>   - terminate collapsing if free pageblocks are below a threshold,

Why not.

>   - trigger a khugepaged wakeup at page fault when that number of
>     pageblocks falls below a threshold,
>
>   - determine the next full sync memory compaction based on how many
>     pageblocks were defragmented on the last wakeup, and
>
>   - avoid memory compaction for all thp page faults.

Right. That should also reduce the amount of GFP_TRANSHUGE decisions 
done in the allocator right now...

I think there are more benefits possible when a thread is responsible 
for thorough defragmentation and its activity is tuned appropriately 
(and doesn't depend on the collapse scanning results as it's now the 
case for khugepaged - it won't compact anything on a node if there's 
nothing to collapse there).

- direct compaction can quickly skip a block of memory in migrate 
scanner as soon as it finds a page that cannot be isolated. I had a 
patch for that [1], but dropped it due to longer-term fragmentation 
becoming worse.

- I think that direct compaction could also stop using the current free 
scanner and just get free pages from free lists. In my current testing I 
see that free scanner spends an awful lot of time to find those free 
pages, if we are near the watermarks. I think this approach should work 
better, combined with implementing the previous point:
   - if the free page that came from the free list is within the 
order-aligned block that the migrate scanner is processing, then of 
course we don't use it as migration target. We keep the page aside on a 
list so it can later merge with the pages freed by migration.
   - since getting pages from free lists is done in increasing order 
starting from 0, it would also have some natural antifragmentation 
effects. Right now the free scanner can be easily breaking an order-8 
page to obtain one or few pages as migration targets.

Of course after such modifications direct compaction is no longer truly 
a "compaction", that's why complementing it with the traditional one 
done by a dedicated thread would be needed to avoid regressions in 
long-term fragmentation.

[1] http://www.spinics.net/lists/linux-mm/msg76307.html

> (I'd ignore what is actually the responsibility of khugepaged and what is
> done in task work at this time.)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 4/4] mm, thp: wake up khugepaged when huge page is not available
  2015-05-11 14:35 [RFC 0/4] Outsourcing page fault THP allocations to khugepaged Vlastimil Babka
                   ` (2 preceding siblings ...)
  2015-05-11 14:35 ` [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
@ 2015-05-11 14:35 ` Vlastimil Babka
  3 siblings, 0 replies; 10+ messages in thread
From: Vlastimil Babka @ 2015-05-11 14:35 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Alex Thorlton, David Rientjes, Vlastimil Babka

After previous patch, THP page faults check the thp_avail_nodes nodemask to
determine whether to attempt allocating hugepage or fallback immediately.
The khugepaged task is responsible for attempting reclaim and compaction for
nodes where hugepages are not available, and updating the nodemask as
appropriate.

To get faster reaction on THP allocation failures, we will wake up khugepaged
whenever THP page fault has to fallback. This includes both situations when
hugepage was supposed to be available, but allocation fails, and situations
where hugepage is already marked as unavailable. In the latter case, khugepaged
will not wait according to its alloc_sleep_millisecs parameter under /sys, but
retry allocation immediately. This is done to scale the khugepaged activity
with respect to THP demand, instead of a fixed tunable. Excessive compaction
failures are still being prevented by the self-tuning deferred compaction
mechanism in this case.  For this mechanism to work as intended, the check for
deferred compaction should be done on each THP allocation attempt to bump the
internal counter, and waiting full alloc_sleep_millisecs period could make the
deferred periods excessively long.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 22 ++++++++++++++++++----
 mm/internal.h    |  5 +----
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3081a7..b3d08a0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -104,6 +104,15 @@ static struct khugepaged_scan khugepaged_scan = {
 };
 
 nodemask_t thp_avail_nodes = NODE_MASK_ALL;
+static bool khugepaged_thp_requested = false;
+
+void thp_avail_clear(int nid)
+{
+	node_clear(nid, thp_avail_nodes);
+	khugepaged_thp_requested = true;
+	wake_up_interruptible(&khugepaged_wait);
+}
+
 
 static int set_recommended_min_free_kbytes(void)
 {
@@ -2263,7 +2272,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 
 static void khugepaged_alloc_sleep(void)
 {
-	wait_event_freezable_timeout(khugepaged_wait, false,
+	wait_event_freezable_timeout(khugepaged_wait,
+			khugepaged_thp_requested,
 			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 }
 
@@ -2381,6 +2391,8 @@ static bool khugepaged_check_nodes(struct page **hpage)
 	struct page *newpage = NULL;
 	gfp_t gfp = alloc_hugepage_gfpmask(khugepaged_defrag());
 
+	khugepaged_thp_requested = false;
+
 	for_each_online_node(nid) {
 		if (node_isset(nid, thp_avail_nodes)) {
 			ret = true;
@@ -2780,13 +2792,15 @@ breakouterloop_mmap_sem:
 
 static int khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) &&
+	return (khugepaged_thp_requested ||
+			!list_empty(&khugepaged_scan.mm_head)) &&
 		khugepaged_enabled();
 }
 
 static int khugepaged_wait_event(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) ||
+	return khugepaged_thp_requested ||
+		!list_empty(&khugepaged_scan.mm_head) ||
 		kthread_should_stop();
 }
 
@@ -2837,7 +2851,7 @@ static void khugepaged_wait_work(void)
 			return;
 
 		wait_event_freezable_timeout(khugepaged_wait,
-					     kthread_should_stop(),
+			khugepaged_thp_requested || kthread_should_stop(),
 			msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
 		return;
 	}
diff --git a/mm/internal.h b/mm/internal.h
index 6d9a711..5c37e4d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -179,10 +179,7 @@ static inline void thp_avail_set(int nid)
 	node_set(nid, thp_avail_nodes);
 }
 
-static inline void thp_avail_clear(int nid)
-{
-	node_clear(nid, thp_avail_nodes);
-}
+extern void thp_avail_clear(int nid);
 
 #else
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-06-23 16:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-11 14:35 [RFC 0/4] Outsourcing page fault THP allocations to khugepaged Vlastimil Babka
2015-05-11 14:35 ` [RFC 1/4] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
2015-06-18  0:34   ` David Rientjes
2015-05-11 14:35 ` [RFC 2/4] mm, thp: khugepaged checks for THP allocability before scanning Vlastimil Babka
2015-06-18  1:00   ` David Rientjes
2015-06-23 15:41     ` Vlastimil Babka
2015-05-11 14:35 ` [RFC 3/4] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
2015-06-18  1:20   ` David Rientjes
2015-06-23 16:23     ` Vlastimil Babka
2015-05-11 14:35 ` [RFC 4/4] mm, thp: wake up khugepaged when huge page is not available Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).