linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Configurable fair allocation zone policy
@ 2013-12-12 15:06 Mel Gorman
  2013-12-12 15:06 ` [PATCH 1/4] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Mel Gorman @ 2013-12-12 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of how
the page allocator and kswapd interacted on the per-zone LRU lists.

Unfortunately a side-effect missed during review was that it's now very
easy to allocate remote memory on NUMA machines. The problem is that
it is not a simple case of just restoring local allocation policies as
there are genuine reasons why global page aging may be prefereable. It's
still a major change to default behaviour so this patch makes the policy
configurable and sets what I think is a sensible default.

The patches are on top of some NUMA balancing patches currently in -mm.
The first patch in the series is a patch posted by Johannes that must be
taken into account before any of my patches on top. The last patch of the
series is what alters default behaviour and makes the fair zone allocator
policy configurable.

Sniff test results based on following kernels

vanilla		 3.13-rc3 stock
instrument-v5r1  NUMA balancing patches just to rule out any conflicts there
lruslabonly-v1r2 Patch 1 only
local-v1r2	 Full series

kernbench
                          3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                             vanilla       instrument-v5r1      lruslabonly-v1r2            local-v1r2
User    min        1417.32 (  0.00%)     1412.10 (  0.37%)     1408.49 (  0.62%)     1407.41 (  0.70%)
User    mean       1419.10 (  0.00%)     1419.18 ( -0.01%)     1413.36 (  0.40%)     1410.62 (  0.60%)
User    stddev        2.25 (  0.00%)        5.14 (-128.59%)        3.71 (-64.89%)        3.11 (-38.47%)
User    max        1422.92 (  0.00%)     1425.26 ( -0.16%)     1419.83 (  0.22%)     1416.30 (  0.47%)
User    range         5.60 (  0.00%)       13.16 (-135.00%)       11.34 (-102.50%)        8.89 (-58.75%)
System  min         114.83 (  0.00%)      114.69 (  0.12%)      113.76 (  0.93%)      108.45 (  5.56%)
System  mean        115.89 (  0.00%)      115.18 (  0.61%)      114.39 (  1.29%)      108.93 (  6.00%)
System  stddev        0.63 (  0.00%)        0.33 ( 48.39%)        0.65 ( -3.01%)        0.50 ( 21.44%)
System  max         116.81 (  0.00%)      115.65 (  0.99%)      115.55 (  1.08%)      109.76 (  6.04%)
System  range         1.98 (  0.00%)        0.96 ( 51.52%)        1.79 (  9.60%)        1.31 ( 33.84%)
Elapsed min          42.90 (  0.00%)       42.42 (  1.12%)       43.62 ( -1.68%)       42.91 ( -0.02%)
Elapsed mean         43.58 (  0.00%)       43.56 (  0.04%)       44.04 ( -1.05%)       44.30 ( -1.66%)
Elapsed stddev        0.74 (  0.00%)        1.09 (-46.88%)        0.25 ( 66.08%)        1.16 (-56.18%)
Elapsed max          44.52 (  0.00%)       45.36 ( -1.89%)       44.33 (  0.43%)       46.39 ( -4.20%)
Elapsed range         1.62 (  0.00%)        2.94 (-81.48%)        0.71 ( 56.17%)        3.48 (-114.81%)
CPU     min        3451.00 (  0.00%)     3366.00 (  2.46%)     3441.00 (  0.29%)     3269.00 (  5.27%)
CPU     mean       3522.40 (  0.00%)     3523.80 ( -0.04%)     3468.40 (  1.53%)     3431.60 (  2.58%)
CPU     stddev       54.34 (  0.00%)       97.81 (-79.99%)       24.70 ( 54.54%)       89.66 (-64.99%)
CPU     max        3570.00 (  0.00%)     3630.00 ( -1.68%)     3501.00 (  1.93%)     3541.00 (  0.81%)
CPU     range       119.00 (  0.00%)      264.00 (-121.85%)       60.00 ( 49.58%)      272.00 (-128.57%)

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v5r1lruslabonly-v1r2  local-v1r2
User         8540.49     8535.44     8502.52     8490.02
System        706.31      701.29      697.45      664.39
Elapsed       307.58      309.38      311.90      311.43

kernbench figures themselves are not that compelling but the system CPU cost
is down a lot. It's just such a small percentage of the overall workload
that it doesn't really matter and the processes are short lived anyway.

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v5r1lruslabonly-v1r2  local-v1r2
NUMA alloc hit                73783951    73094711    73540917    93365205
NUMA alloc miss               20013534    20280058    19805156           0
NUMA interleave hit                  0           0           0           0
NUMA alloc local              73783935    73094693    73540908    93365198

NUMA miss rate speaks for itself.

vmr-stream
                                3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3                  3.13.0-rc3
                                   vanilla             instrument-v5r1            lruslabonly-v1r2                  local-v1r2
Add      5M        3809.80 (  0.00%)     3793.23 ( -0.44%)     3808.76 ( -0.03%)     3997.69 (  4.93%)
Copy     5M        3360.75 (  0.00%)     3362.61 (  0.06%)     3367.19 (  0.19%)     3478.45 (  3.50%)
Scale    5M        3160.39 (  0.00%)     3151.84 ( -0.27%)     3159.05 ( -0.04%)     3399.14 (  7.55%)
Triad    5M        3533.04 (  0.00%)     3523.70 ( -0.26%)     3536.32 (  0.09%)     3858.46 (  9.21%)
Add      7M        3789.82 (  0.00%)     3796.51 (  0.18%)     3799.61 (  0.26%)     4029.79 (  6.33%)
Copy     7M        3345.85 (  0.00%)     3358.16 (  0.37%)     3353.81 (  0.24%)     3483.16 (  4.10%)
Scale    7M        3176.00 (  0.00%)     3161.42 ( -0.46%)     3161.61 ( -0.45%)     3403.88 (  7.17%)
Triad    7M        3528.85 (  0.00%)     3530.45 (  0.05%)     3533.71 (  0.14%)     3856.90 (  9.30%)
Add      8M        3801.60 (  0.00%)     3813.84 (  0.32%)     3811.72 (  0.27%)     3976.81 (  4.61%)
Copy     8M        3364.64 (  0.00%)     3365.61 (  0.03%)     3362.38 ( -0.07%)     3473.99 (  3.25%)
Scale    8M        3169.34 (  0.00%)     3173.77 (  0.14%)     3160.40 ( -0.28%)     3396.07 (  7.15%)
Triad    8M        3531.38 (  0.00%)     3539.19 (  0.22%)     3536.68 (  0.15%)     3854.70 (  9.16%)
Add      10M       3807.95 (  0.00%)     3798.47 ( -0.25%)     3788.44 ( -0.51%)     4003.61 (  5.14%)
Copy     10M       3365.64 (  0.00%)     3363.00 ( -0.08%)     3355.89 ( -0.29%)     3477.50 (  3.32%)
Scale    10M       3172.71 (  0.00%)     3177.81 (  0.16%)     3165.05 ( -0.24%)     3397.21 (  7.08%)
Triad    10M       3536.15 (  0.00%)     3534.21 ( -0.05%)     3523.98 ( -0.34%)     3857.77 (  9.10%)
Add      14M       3787.56 (  0.00%)     3797.21 (  0.25%)     3797.02 (  0.25%)     4003.90 (  5.71%)
Copy     14M       3345.19 (  0.00%)     3346.86 (  0.05%)     3355.17 (  0.30%)     3477.81 (  3.96%)
Scale    14M       3154.55 (  0.00%)     3169.49 (  0.47%)     3161.52 (  0.22%)     3397.43 (  7.70%)
Triad    14M       3522.09 (  0.00%)     3533.46 (  0.32%)     3526.82 (  0.13%)     3857.32 (  9.52%)
Add      17M       3806.34 (  0.00%)     3803.44 ( -0.08%)     3786.03 ( -0.53%)     4008.76 (  5.32%)
Copy     17M       3368.39 (  0.00%)     3364.10 ( -0.13%)     3353.70 ( -0.44%)     3482.19 (  3.38%)
Scale    17M       3169.18 (  0.00%)     3170.80 (  0.05%)     3169.06 ( -0.00%)     3401.51 (  7.33%)
Triad    17M       3535.05 (  0.00%)     3536.79 (  0.05%)     3521.98 ( -0.37%)     3863.29 (  9.29%)
Add      21M       3795.31 (  0.00%)     3808.91 (  0.36%)     3797.88 (  0.07%)     3996.53 (  5.30%)
Copy     21M       3353.43 (  0.00%)     3360.01 (  0.20%)     3357.44 (  0.12%)     3477.10 (  3.69%)
Scale    21M       3160.96 (  0.00%)     3164.94 (  0.13%)     3154.44 ( -0.21%)     3400.94 (  7.59%)
Triad    21M       3530.45 (  0.00%)     3540.10 (  0.27%)     3527.00 ( -0.10%)     3858.31 (  9.29%)
Add      28M       3803.11 (  0.00%)     3792.40 ( -0.28%)     3786.70 ( -0.43%)     4003.07 (  5.26%)
Copy     28M       3361.16 (  0.00%)     3363.44 (  0.07%)     3357.54 ( -0.11%)     3475.82 (  3.41%)
Scale    28M       3160.43 (  0.00%)     3148.44 ( -0.38%)     3157.26 ( -0.10%)     3398.14 (  7.52%)
Triad    28M       3533.66 (  0.00%)     3517.45 ( -0.46%)     3525.84 ( -0.22%)     3856.66 (  9.14%)
Add      35M       3792.86 (  0.00%)     3795.61 (  0.07%)     3794.84 (  0.05%)     4009.65 (  5.72%)
Copy     35M       3344.24 (  0.00%)     3356.56 (  0.37%)     3351.18 (  0.21%)     3484.09 (  4.18%)
Scale    35M       3160.14 (  0.00%)     3155.12 ( -0.16%)     3164.29 (  0.13%)     3401.23 (  7.63%)
Triad    35M       3531.94 (  0.00%)     3523.24 ( -0.25%)     3523.71 ( -0.23%)     3861.73 (  9.34%)
Add      42M       3803.39 (  0.00%)     3777.52 ( -0.68%)     3789.36 ( -0.37%)     4014.25 (  5.54%)
Copy     42M       3360.64 (  0.00%)     3351.85 ( -0.26%)     3348.05 ( -0.37%)     3484.10 (  3.67%)
Scale    42M       3158.64 (  0.00%)     3159.51 (  0.03%)     3157.44 ( -0.04%)     3400.86 (  7.67%)
Triad    42M       3529.99 (  0.00%)     3515.82 ( -0.40%)     3527.70 ( -0.06%)     3860.66 (  9.37%)
Add      56M       3778.07 (  0.00%)     3806.79 (  0.76%)     3789.54 (  0.30%)     3984.74 (  5.47%)
Copy     56M       3348.68 (  0.00%)     3361.92 (  0.40%)     3282.70 ( -1.97%)     3473.71 (  3.73%)
Scale    56M       3169.25 (  0.00%)     3160.16 ( -0.29%)     3097.85 ( -2.25%)     3394.91 (  7.12%)
Triad    56M       3517.62 (  0.00%)     3534.72 (  0.49%)     3529.84 (  0.35%)     3853.46 (  9.55%)
Add      71M       3811.71 (  0.00%)     3785.42 ( -0.69%)     3786.40 ( -0.66%)     3975.32 (  4.29%)
Copy     71M       3370.59 (  0.00%)     3350.70 ( -0.59%)     3351.49 ( -0.57%)     3476.33 (  3.14%)
Scale    71M       3168.70 (  0.00%)     3162.75 ( -0.19%)     3172.31 (  0.11%)     3397.66 (  7.23%)
Triad    71M       3536.14 (  0.00%)     3522.81 ( -0.38%)     3525.17 ( -0.31%)     3855.20 (  9.02%)
Add      85M       3805.94 (  0.00%)     3796.04 ( -0.26%)     3793.99 ( -0.31%)     4024.25 (  5.74%)
Copy     85M       3354.76 (  0.00%)     3355.38 (  0.02%)     3364.42 (  0.29%)     3482.99 (  3.82%)
Scale    85M       3162.20 (  0.00%)     3171.71 (  0.30%)     3146.74 ( -0.49%)     3405.10 (  7.68%)
Triad    85M       3538.76 (  0.00%)     3528.62 ( -0.29%)     3524.00 ( -0.42%)     3857.08 (  9.00%)
Add      113M      3803.66 (  0.00%)     3791.42 ( -0.32%)     3802.68 ( -0.03%)     4050.85 (  6.50%)
Copy     113M      3348.32 (  0.00%)     3363.66 (  0.46%)     3355.31 (  0.21%)     3488.14 (  4.18%)
Scale    113M      3177.09 (  0.00%)     3167.40 ( -0.30%)     3160.93 ( -0.51%)     3399.56 (  7.00%)
Triad    113M      3536.06 (  0.00%)     3529.99 ( -0.17%)     3529.75 ( -0.18%)     3860.58 (  9.18%)
Add      142M      3814.65 (  0.00%)     3795.83 ( -0.49%)     3794.67 ( -0.52%)     4001.91 (  4.91%)
Copy     142M      3353.31 (  0.00%)     3357.70 (  0.13%)     3362.35 (  0.27%)     3483.25 (  3.87%)
Scale    142M      3186.05 (  0.00%)     3156.22 ( -0.94%)     3149.30 ( -1.15%)     3403.12 (  6.81%)
Triad    142M      3545.41 (  0.00%)     3526.16 ( -0.54%)     3523.67 ( -0.61%)     3864.64 (  9.00%)
Add      170M      3787.71 (  0.00%)     3788.86 (  0.03%)     3812.66 (  0.66%)     3990.36 (  5.35%)
Copy     170M      3351.50 (  0.00%)     3353.34 (  0.05%)     3368.86 (  0.52%)     3480.04 (  3.84%)
Scale    170M      3158.38 (  0.00%)     3165.12 (  0.21%)     3163.39 (  0.16%)     3399.74 (  7.64%)
Triad    170M      3521.84 (  0.00%)     3527.88 (  0.17%)     3538.46 (  0.47%)     3859.29 (  9.58%)
Add      227M      3794.46 (  0.00%)     3804.21 (  0.26%)     3789.75 ( -0.12%)     3996.34 (  5.32%)
Copy     227M      3368.15 (  0.00%)     3365.69 ( -0.07%)     3353.55 ( -0.43%)     3477.20 (  3.24%)
Scale    227M      3160.18 (  0.00%)     3155.38 ( -0.15%)     3152.46 ( -0.24%)     3408.65 (  7.86%)
Triad    227M      3525.39 (  0.00%)     3532.53 (  0.20%)     3518.85 ( -0.19%)     3857.57 (  9.42%)
Add      284M      3804.29 (  0.00%)     3806.62 (  0.06%)     3798.59 ( -0.15%)     3957.29 (  4.02%)
Copy     284M      3366.21 (  0.00%)     3355.53 ( -0.32%)     3362.62 ( -0.11%)     3469.98 (  3.08%)
Scale    284M      3174.61 (  0.00%)     3161.86 ( -0.40%)     3171.81 ( -0.09%)     3394.82 (  6.94%)
Triad    284M      3538.50 (  0.00%)     3535.29 ( -0.09%)     3532.22 ( -0.18%)     3851.68 (  8.85%)
Add      341M      3805.26 (  0.00%)     3788.76 ( -0.43%)     3787.26 ( -0.47%)     3977.29 (  4.52%)
Copy     341M      3366.98 (  0.00%)     3361.62 ( -0.16%)     3357.70 ( -0.28%)     3471.49 (  3.10%)
Scale    341M      3159.11 (  0.00%)     3157.50 ( -0.05%)     3150.75 ( -0.26%)     3396.89 (  7.53%)
Triad    341M      3530.80 (  0.00%)     3522.61 ( -0.23%)     3518.81 ( -0.34%)     3854.18 (  9.16%)
Add      455M      3791.15 (  0.00%)     3794.25 (  0.08%)     3796.21 (  0.13%)     4023.51 (  6.13%)
Copy     455M      3353.30 (  0.00%)     3356.46 (  0.09%)     3356.24 (  0.09%)     3483.58 (  3.88%)
Scale    455M      3161.21 (  0.00%)     3163.74 (  0.08%)     3156.56 ( -0.15%)     3400.35 (  7.56%)
Triad    455M      3527.90 (  0.00%)     3526.21 ( -0.05%)     3523.52 ( -0.12%)     3859.30 (  9.39%)
Add      568M      3779.79 (  0.00%)     3791.20 (  0.30%)     3794.86 (  0.40%)     4030.14 (  6.62%)
Copy     568M      3349.93 (  0.00%)     3354.29 (  0.13%)     3349.21 ( -0.02%)     3481.71 (  3.93%)
Scale    568M      3163.69 (  0.00%)     3161.94 ( -0.06%)     3168.22 (  0.14%)     3399.29 (  7.45%)
Triad    568M      3518.65 (  0.00%)     3526.50 (  0.22%)     3532.29 (  0.39%)     3857.38 (  9.63%)
Add      682M      3801.06 (  0.00%)     3807.09 (  0.16%)     3803.26 (  0.06%)     3995.04 (  5.10%)
Copy     682M      3363.64 (  0.00%)     3365.88 (  0.07%)     3363.97 (  0.01%)     3475.74 (  3.33%)
Scale    682M      3151.89 (  0.00%)     3169.84 (  0.57%)     3162.50 (  0.34%)     3400.35 (  7.88%)
Triad    682M      3528.97 (  0.00%)     3535.76 (  0.19%)     3530.60 (  0.05%)     3860.42 (  9.39%)
Add      910M      3778.97 (  0.00%)     3784.80 (  0.15%)     3782.46 (  0.09%)     3965.85 (  4.95%)
Copy     910M      3345.09 (  0.00%)     3347.40 (  0.07%)     3354.06 (  0.27%)     3471.09 (  3.77%)
Scale    910M      3164.46 (  0.00%)     3159.83 ( -0.15%)     3147.78 ( -0.53%)     3392.25 (  7.20%)
Triad    910M      3516.19 (  0.00%)     3518.54 (  0.07%)     3516.70 (  0.01%)     3848.91 (  9.46%)
Add      1137M     3812.17 (  0.00%)     3808.22 ( -0.10%)     3794.70 ( -0.46%)     3969.04 (  4.11%)
Copy     1137M     3367.52 (  0.00%)     3380.77 (  0.39%)     3353.99 ( -0.40%)     3473.60 (  3.15%)
Scale    1137M     3158.62 (  0.00%)     3160.72 (  0.07%)     3159.71 (  0.03%)     3397.40 (  7.56%)
Triad    1137M     3536.97 (  0.00%)     3533.26 ( -0.10%)     3522.25 ( -0.42%)     3856.24 (  9.03%)
Add      1365M     3806.51 (  0.00%)     3799.39 ( -0.19%)     3785.71 ( -0.55%)     3965.55 (  4.18%)
Copy     1365M     3360.43 (  0.00%)     3356.22 ( -0.12%)     3346.74 ( -0.41%)     3469.44 (  3.24%)
Scale    1365M     3155.95 (  0.00%)     3160.66 (  0.15%)     3163.20 (  0.23%)     3392.79 (  7.50%)
Triad    1365M     3534.18 (  0.00%)     3538.56 (  0.12%)     3524.20 ( -0.28%)     3849.29 (  8.92%)
Add      1820M     3797.86 (  0.00%)     3801.04 (  0.08%)     3796.84 ( -0.03%)     4014.92 (  5.72%)
Copy     1820M     3362.09 (  0.00%)     3360.66 ( -0.04%)     3352.27 ( -0.29%)     3483.55 (  3.61%)
Scale    1820M     3170.20 (  0.00%)     3159.70 ( -0.33%)     3159.59 ( -0.33%)     3400.90 (  7.28%)
Triad    1820M     3531.00 (  0.00%)     3531.65 (  0.02%)     3528.99 ( -0.06%)     3862.69 (  9.39%)
Add      2275M     3810.31 (  0.00%)     3797.19 ( -0.34%)     3785.76 ( -0.64%)     3913.43 (  2.71%)
Copy     2275M     3373.60 (  0.00%)     3355.79 ( -0.53%)     3340.74 ( -0.97%)     3456.16 (  2.45%)
Scale    2275M     3174.64 (  0.00%)     3157.28 ( -0.55%)     3150.71 ( -0.75%)     3383.35 (  6.57%)
Triad    2275M     3537.57 (  0.00%)     3529.60 ( -0.23%)     3518.91 ( -0.53%)     3837.46 (  8.48%)
Add      2730M     3801.09 (  0.00%)     3796.96 ( -0.11%)     3800.62 ( -0.01%)     4008.15 (  5.45%)
Copy     2730M     3357.18 (  0.00%)     3351.88 ( -0.16%)     3358.55 (  0.04%)     3482.93 (  3.75%)
Scale    2730M     3177.66 (  0.00%)     3159.95 ( -0.56%)     3167.56 ( -0.32%)     3401.39 (  7.04%)
Triad    2730M     3539.59 (  0.00%)     3532.29 ( -0.21%)     3531.57 ( -0.23%)     3863.61 (  9.15%)
Add      3640M     3816.88 (  0.00%)     3809.59 ( -0.19%)     3805.49 ( -0.30%)     3991.09 (  4.56%)
Copy     3640M     3375.91 (  0.00%)     3367.14 ( -0.26%)     3349.94 ( -0.77%)     3477.90 (  3.02%)
Scale    3640M     3167.22 (  0.00%)     3167.15 ( -0.00%)     3166.88 ( -0.01%)     3398.62 (  7.31%)
Triad    3640M     3546.45 (  0.00%)     3536.31 ( -0.29%)     3539.15 ( -0.21%)     3860.10 (  8.84%)
Add      4551M     3799.05 (  0.00%)     3778.41 ( -0.54%)     3784.31 ( -0.39%)     3976.60 (  4.67%)
Copy     4551M     3355.66 (  0.00%)     3351.03 ( -0.14%)     3355.51 ( -0.00%)     3482.15 (  3.77%)
Scale    4551M     3171.91 (  0.00%)     3156.10 ( -0.50%)     3166.90 ( -0.16%)     3401.11 (  7.23%)
Triad    4551M     3531.61 (  0.00%)     3514.39 ( -0.49%)     3516.99 ( -0.41%)     3861.99 (  9.35%)
Add      5461M     3801.60 (  0.00%)     3807.33 (  0.15%)     3810.09 (  0.22%)     3950.47 (  3.92%)
Copy     5461M     3360.29 (  0.00%)     3372.50 (  0.36%)     3357.41 ( -0.09%)     3470.96 (  3.29%)
Scale    5461M     3161.18 (  0.00%)     3159.49 ( -0.05%)     3163.35 (  0.07%)     3394.10 (  7.37%)
Triad    5461M     3532.35 (  0.00%)     3534.62 (  0.06%)     3539.14 (  0.19%)     3852.67 (  9.07%)
Add      7281M     3800.80 (  0.00%)     3805.50 (  0.12%)     3787.10 ( -0.36%)     4042.38 (  6.36%)
Copy     7281M     3359.99 (  0.00%)     3362.34 (  0.07%)     3354.09 ( -0.18%)     3487.91 (  3.81%)
Scale    7281M     3168.68 (  0.00%)     3165.30 ( -0.11%)     3154.04 ( -0.46%)     3400.69 (  7.32%)
Triad    7281M     3533.59 (  0.00%)     3537.71 (  0.12%)     3518.15 ( -0.44%)     3862.47 (  9.31%)
Add      9102M     3790.67 (  0.00%)     3797.98 (  0.19%)     3808.76 (  0.48%)     3995.83 (  5.41%)
Copy     9102M     3345.80 (  0.00%)     3360.87 (  0.45%)     3360.86 (  0.45%)     3477.97 (  3.95%)
Scale    9102M     3174.65 (  0.00%)     3160.05 ( -0.46%)     3164.47 ( -0.32%)     3399.24 (  7.07%)
Triad    9102M     3529.51 (  0.00%)     3533.84 (  0.12%)     3533.79 (  0.12%)     3859.38 (  9.35%)
Add      10922M     3807.96 (  0.00%)     3803.49 ( -0.12%)     3809.65 (  0.04%)     4002.50 (  5.11%)
Copy     10922M     3350.99 (  0.00%)     3352.21 (  0.04%)     3359.79 (  0.26%)     3477.24 (  3.77%)
Scale    10922M     3164.74 (  0.00%)     3170.89 (  0.19%)     3167.50 (  0.09%)     3395.79 (  7.30%)
Triad    10922M     3536.69 (  0.00%)     3532.25 ( -0.13%)     3538.25 (  0.04%)     3856.99 (  9.06%)
Add      14563M     3786.28 (  0.00%)     3770.86 ( -0.41%)     3789.66 (  0.09%)     3988.87 (  5.35%)
Copy     14563M     3352.51 (  0.00%)     3339.40 ( -0.39%)     3351.25 ( -0.04%)     3479.41 (  3.79%)
Scale    14563M     3171.95 (  0.00%)     3151.73 ( -0.64%)     3154.62 ( -0.55%)     3399.56 (  7.18%)
Triad    14563M     3522.50 (  0.00%)     3511.16 ( -0.32%)     3521.94 ( -0.02%)     3858.30 (  9.53%)
Add      18204M     3809.56 (  0.00%)     3799.32 ( -0.27%)     3800.40 ( -0.24%)     3975.64 (  4.36%)
Copy     18204M     3365.06 (  0.00%)     3360.08 ( -0.15%)     3360.78 ( -0.13%)     3478.50 (  3.37%)
Scale    18204M     3171.25 (  0.00%)     3147.35 ( -0.75%)     3160.06 ( -0.35%)     3402.14 (  7.28%)
Triad    18204M     3539.90 (  0.00%)     3526.72 ( -0.37%)     3529.69 ( -0.29%)     3863.45 (  9.14%)
Add      21845M     3798.46 (  0.00%)     3775.06 ( -0.62%)     3800.56 (  0.06%)     3971.45 (  4.55%)
Copy     21845M     3362.14 (  0.00%)     3354.93 ( -0.21%)     3358.40 ( -0.11%)     3468.39 (  3.16%)
Scale    21845M     3170.99 (  0.00%)     3164.52 ( -0.20%)     3167.92 ( -0.10%)     3391.02 (  6.94%)
Triad    21845M     3534.49 (  0.00%)     3511.51 ( -0.65%)     3534.65 (  0.00%)     3847.34 (  8.85%)
Add      29127M     3819.69 (  0.00%)     3809.20 ( -0.27%)     3798.24 ( -0.56%)     4004.57 (  4.84%)
Copy     29127M     3384.67 (  0.00%)     3365.17 ( -0.58%)     3353.97 ( -0.91%)     3478.36 (  2.77%)
Scale    29127M     3158.68 (  0.00%)     3162.35 (  0.12%)     3171.84 (  0.42%)     3396.96 (  7.54%)
Triad    29127M     3538.17 (  0.00%)     3539.05 (  0.02%)     3530.30 ( -0.22%)     3854.82 (  8.95%)
Add      36408M     3806.95 (  0.00%)     3796.64 ( -0.27%)     3802.86 ( -0.11%)     4014.22 (  5.44%)
Copy     36408M     3361.11 (  0.00%)     3358.35 ( -0.08%)     3358.30 ( -0.08%)     3481.66 (  3.59%)
Scale    36408M     3165.87 (  0.00%)     3165.94 (  0.00%)     3176.78 (  0.34%)     3400.27 (  7.40%)
Triad    36408M     3536.86 (  0.00%)     3529.81 ( -0.20%)     3538.19 (  0.04%)     3862.39 (  9.20%)
Add      43690M     3799.39 (  0.00%)     3806.60 (  0.19%)     3803.09 (  0.10%)     3989.60 (  5.01%)
Copy     43690M     3359.26 (  0.00%)     3384.76 (  0.76%)     3359.11 ( -0.00%)     3478.31 (  3.54%)
Scale    43690M     3175.35 (  0.00%)     3164.08 ( -0.36%)     3161.71 ( -0.43%)     3400.39 (  7.09%)
Triad    43690M     3535.26 (  0.00%)     3534.77 ( -0.01%)     3531.62 ( -0.10%)     3861.40 (  9.23%)
Add      58254M     3799.66 (  0.00%)     3809.97 (  0.27%)     3800.36 (  0.02%)     3993.44 (  5.10%)
Copy     58254M     3355.12 (  0.00%)     3367.42 (  0.37%)     3357.51 (  0.07%)     3485.58 (  3.89%)
Scale    58254M     3170.94 (  0.00%)     3165.55 ( -0.17%)     3170.76 ( -0.01%)     3406.36 (  7.42%)
Triad    58254M     3537.26 (  0.00%)     3539.78 (  0.07%)     3528.61 ( -0.24%)     3867.25 (  9.33%)
Add      72817M     3815.26 (  0.00%)     3798.60 ( -0.44%)     3802.47 ( -0.34%)     4017.50 (  5.30%)
Copy     72817M     3362.18 (  0.00%)     3355.17 ( -0.21%)     3356.95 ( -0.16%)     3484.11 (  3.63%)
Scale    72817M     3175.73 (  0.00%)     3155.96 ( -0.62%)     3162.10 ( -0.43%)     3399.64 (  7.05%)
Triad    72817M     3546.44 (  0.00%)     3528.61 ( -0.50%)     3531.39 ( -0.42%)     3860.93 (  8.87%)
Add      87381M     3519.93 (  0.00%)     3511.38 ( -0.24%)     3501.07 ( -0.54%)     3842.46 (  9.16%)
Copy     87381M     3175.29 (  0.00%)     3168.75 ( -0.21%)     3166.12 ( -0.29%)     3271.07 (  3.02%)
Scale    87381M     2848.76 (  0.00%)     2842.46 ( -0.22%)     2840.72 ( -0.28%)     3184.16 ( 11.77%)
Triad    87381M     3465.19 (  0.00%)     3461.85 ( -0.10%)     3451.36 ( -0.40%)     3786.76 (  9.28%)

This is a memory streaming benchmark that makes the remote costs a bit
more visible.

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v5r1lruslabonly-v1r2  local-v1r2
NUMA alloc hit                 1238820     1347097     1432817     2103498
NUMA alloc miss                 691541      757204      667484           0
NUMA interleave hit                  0           0           0           0
NUMA alloc local               1238815     1347095     1432815     2103493
NUMA page range updates       24916702    24987450    25104153    24929595
NUMA huge PMD updates            48025       48138       48364       48025
NUMA PTE updates                375927      388932      390149      388820
NUMA hint faults                373397       48138       48364       48025
NUMA hint local faults          142051       12653       12667       48025
NUMA hint local percent             38          26          26         100
NUMA pages migrated              83407       68608       86528           0
AutoNUMA cost                     2042         416         419         414

NUMA miss rates sepak for themself. I also included the NUMA balancing
stats and the number of hinting faults that are local and number of pages
migrated is also interesting.

pft
                        3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                           vanilla       instrument-v5r1      lruslabonly-v1r2            local-v1r2
User       1       0.6980 (  0.00%)       0.7090 ( -1.58%)       0.7210 ( -3.30%)       0.6760 (  3.15%)
User       2       0.7040 (  0.00%)       0.6970 (  0.99%)       0.6640 (  5.68%)       0.6590 (  6.39%)
User       3       0.6910 (  0.00%)       0.7270 ( -5.21%)       0.7450 ( -7.81%)       0.7070 ( -2.32%)
User       4       0.7250 (  0.00%)       0.7160 (  1.24%)       0.7260 ( -0.14%)       0.7530 ( -3.86%)
User       5       0.7590 (  0.00%)       0.7790 ( -2.64%)       0.7960 ( -4.87%)       0.7610 ( -0.26%)
User       6       0.8130 (  0.00%)       0.8180 ( -0.62%)       0.7860 (  3.32%)       0.8030 (  1.23%)
User       7       0.8210 (  0.00%)       0.8240 ( -0.37%)       0.8050 (  1.95%)       0.7690 (  6.33%)
User       8       0.8390 (  0.00%)       0.8410 ( -0.24%)       0.7870 (  6.20%)       0.7780 (  7.27%)
System     1       9.1230 (  0.00%)       9.0640 (  0.65%)       9.6980 ( -6.30%)       8.5410 (  6.38%)
System     2       9.3990 (  0.00%)       9.3630 (  0.38%)       9.6880 ( -3.07%)       8.5570 (  8.96%)
System     3       9.1460 (  0.00%)       9.0930 (  0.58%)       9.3010 ( -1.69%)       8.6700 (  5.20%)
System     4       8.9160 (  0.00%)       8.8630 (  0.59%)       8.9340 ( -0.20%)       8.7370 (  2.01%)
System     5       9.5900 (  0.00%)       9.4450 (  1.51%)       9.5940 ( -0.04%)       8.8960 (  7.24%)
System     6       9.8640 (  0.00%)       9.7130 (  1.53%)       9.9510 ( -0.88%)       9.1420 (  7.32%)
System     7       9.9860 (  0.00%)       9.9050 (  0.81%)       9.9830 (  0.03%)       9.2490 (  7.38%)
System     8       9.8570 (  0.00%)      10.0090 ( -1.54%)      10.0430 ( -1.89%)       9.3030 (  5.62%)
Elapsed    1       9.8240 (  0.00%)       9.7790 (  0.46%)      10.4280 ( -6.15%)       9.2280 (  6.07%)
Elapsed    2       5.0870 (  0.00%)       5.0480 (  0.77%)       5.2190 ( -2.59%)       4.6360 (  8.87%)
Elapsed    3       3.3220 (  0.00%)       3.3040 (  0.54%)       3.3670 ( -1.35%)       3.1430 (  5.39%)
Elapsed    4       2.4440 (  0.00%)       2.4340 (  0.41%)       2.4450 ( -0.04%)       2.4010 (  1.76%)
Elapsed    5       2.1500 (  0.00%)       2.1340 (  0.74%)       2.1590 ( -0.42%)       2.0020 (  6.88%)
Elapsed    6       1.8290 (  0.00%)       1.8110 (  0.98%)       1.8460 ( -0.93%)       1.6910 (  7.55%)
Elapsed    7       1.5760 (  0.00%)       1.5740 (  0.13%)       1.5600 (  1.02%)       1.4570 (  7.55%)
Elapsed    8       1.3660 (  0.00%)       1.3750 ( -0.66%)       1.3840 ( -1.32%)       1.2720 (  6.88%)
Faults/cpu 1  336505.5875 (  0.00%)  338169.9002 (  0.49%)  317186.6996 ( -5.74%)  358456.5721 (  6.52%)
Faults/cpu 2  327139.2186 (  0.00%)  328492.4614 (  0.41%)  319274.5257 ( -2.40%)  358628.2150 (  9.63%)
Faults/cpu 3  336004.1324 (  0.00%)  336567.6552 (  0.17%)  328975.0655 ( -2.09%)  352460.9626 (  4.90%)
Faults/cpu 4  342824.1564 (  0.00%)  345092.8897 (  0.66%)  342110.6189 ( -0.21%)  348245.2828 (  1.58%)
Faults/cpu 5  319553.7707 (  0.00%)  323342.1439 (  1.19%)  318221.0947 ( -0.42%)  342196.2266 (  7.09%)
Faults/cpu 6  309614.5554 (  0.00%)  313909.9679 (  1.39%)  307872.9151 ( -0.56%)  332404.7055 (  7.36%)
Faults/cpu 7  306159.2969 (  0.00%)  308038.1690 (  0.61%)  306307.5499 (  0.05%)  329872.4584 (  7.75%)
Faults/cpu 8  309077.4966 (  0.00%)  304874.1843 ( -1.36%)  305342.7590 ( -1.21%)  328041.6604 (  6.14%)
Faults/sec 1  336364.5575 (  0.00%)  337965.2381 (  0.48%)  316902.8780 ( -5.79%)  358111.9993 (  6.47%)
Faults/sec 2  649713.2290 (  0.00%)  654535.5476 (  0.74%)  633369.6295 ( -2.52%)  712772.6198 (  9.71%)
Faults/sec 3  994812.3119 (  0.00%) 1000190.1734 (  0.54%)  981316.5256 ( -1.36%) 1051712.4141 (  5.72%)
Faults/sec 4 1352137.4832 (  0.00%) 1359242.0027 (  0.53%) 1351401.4285 ( -0.05%) 1376465.2565 (  1.80%)
Faults/sec 5 1538115.0421 (  0.00%) 1550443.5505 (  0.80%) 1530614.5827 ( -0.49%) 1651864.2216 (  7.40%)
Faults/sec 6 1807211.7324 (  0.00%) 1826306.7214 (  1.06%) 1790976.0367 ( -0.90%) 1955407.8574 (  8.20%)
Faults/sec 7 2101840.1872 (  0.00%) 2101627.1857 ( -0.01%) 2117333.0681 (  0.74%) 2269862.7330 (  7.99%)
Faults/sec 8 2421813.7208 (  0.00%) 2407803.4867 ( -0.58%) 2393045.9288 ( -1.19%) 2601789.0136 (  7.43%)

Local allocations help fault rate microbenchmarl

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v5r1lruslabonly-v1r2  local-v1r2
User           60.57       62.13       61.33       60.06
System        868.16      862.63      881.63      810.32
Elapsed       336.19      336.05      346.70      317.59

system CPU down, presuably all the system CPU usage drops are related to
zeroing of memory.

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v5r1lruslabonly-v1r2  local-v1r2
NUMA alloc hit               187243902   187602290   188017672   264999137
NUMA alloc miss               77736695    77400777    76985460           0
NUMA interleave hit                  0           0           0           0
NUMA alloc local             187243902   187602290   188017672   264999135
NUMA page range updates      136246380   135333180   468517162   425261524
NUMA huge PMD updates                0           0           0           0
NUMA PTE updates             136246380   135333180   468517162   425261524
NUMA hint faults                   512           0           0           0
NUMA hint local faults             248           0           0           0
NUMA hint local percent             48         100         100         100
NUMA pages migrated                169           0           0           0
AutoNUMA cost                      956         947        3279        2976

Does not need spelling out. No huge PMD updates is a curiousity worth looking
at some other time.

ebizzy
                       3.13.0-rc3            3.13.0-rc3            3.13.0-rc3            3.13.0-rc3
                          vanilla       instrument-v5r1      lruslabonly-v1r2            local-v1r2
Mean     1      3213.33 (  0.00%)     3204.33 ( -0.28%)     3188.00 ( -0.79%)     3195.33 ( -0.56%)
Mean     2      2291.33 (  0.00%)     2324.67 (  1.45%)     2328.00 (  1.60%)     2350.33 (  2.57%)
Mean     3      2234.67 (  0.00%)     2264.67 (  1.34%)     2255.00 (  0.91%)     2283.00 (  2.16%)
Mean     4      2224.33 (  0.00%)     2253.67 (  1.32%)     2252.33 (  1.26%)     2281.00 (  2.55%)
Mean     5      2256.33 (  0.00%)     2276.00 (  0.87%)     2241.00 ( -0.68%)     2234.33 ( -0.98%)
Mean     6      2233.00 (  0.00%)     2265.67 (  1.46%)     2238.67 (  0.25%)     2258.00 (  1.12%)
Mean     7      2212.33 (  0.00%)     2242.67 (  1.37%)     2251.33 (  1.76%)     2258.67 (  2.09%)
Mean     8      2224.67 (  0.00%)     2241.67 (  0.76%)     2215.33 ( -0.42%)     2240.67 (  0.72%)
Mean     12     2213.33 (  0.00%)     2246.00 (  1.48%)     2257.00 (  1.97%)     2263.67 (  2.27%)
Mean     16     2221.00 (  0.00%)     2263.67 (  1.92%)     2257.67 (  1.65%)     2254.00 (  1.49%)
Mean     20     2215.00 (  0.00%)     2268.67 (  2.42%)     2262.67 (  2.15%)     2282.67 (  3.05%)
Mean     24     2175.00 (  0.00%)     2204.00 (  1.33%)     2232.67 (  2.65%)     2224.00 (  2.25%)
Mean     28     2110.00 (  0.00%)     2142.33 (  1.53%)     2164.00 (  2.56%)     2182.33 (  3.43%)
Mean     32     2077.67 (  0.00%)     2089.33 (  0.56%)     2074.33 ( -0.16%)     2132.67 (  2.65%)
Mean     36     2016.33 (  0.00%)     2025.33 (  0.45%)     2040.33 (  1.19%)     2096.33 (  3.97%)
Mean     40     1984.00 (  0.00%)     1983.67 ( -0.02%)     2002.67 (  0.94%)     2067.00 (  4.18%)
Mean     44     1943.33 (  0.00%)     1960.33 (  0.87%)     1961.33 (  0.93%)     2027.33 (  4.32%)
Mean     48     1925.00 (  0.00%)     1938.33 (  0.69%)     1942.00 (  0.88%)     2024.00 (  5.14%)
Stddev   1        25.42 (  0.00%)       43.52 (-71.21%)       32.54 (-27.99%)       74.42 (-192.77%)
Stddev   2        29.68 (  0.00%)        2.62 ( 91.16%)       22.73 ( 23.41%)       20.53 ( 30.82%)
Stddev   3        18.15 (  0.00%)       11.15 ( 38.60%)        6.16 ( 66.04%)        0.82 ( 95.50%)
Stddev   4        41.28 (  0.00%)        8.73 ( 78.85%)        4.64 ( 88.75%)        6.38 ( 84.55%)
Stddev   5        27.18 (  0.00%)       30.41 (-11.87%)       28.25 ( -3.92%)       31.08 (-14.35%)
Stddev   6        10.80 (  0.00%)       20.24 (-87.36%)       22.10 (-104.57%)       29.70 (-174.95%)
Stddev   7        23.10 (  0.00%)        7.59 ( 67.16%)       21.56 (  6.66%)       19.15 ( 17.08%)
Stddev   8         3.68 (  0.00%)        8.65 (-135.04%)       26.23 (-612.53%)       27.79 (-654.77%)
Stddev   12       23.84 (  0.00%)       10.03 ( 57.91%)        4.97 ( 79.16%)        4.99 ( 79.07%)
Stddev   16       20.22 (  0.00%)       32.11 (-58.83%)        3.09 ( 84.71%)        4.97 ( 75.43%)
Stddev   20        3.74 (  0.00%)       28.94 (-673.47%)       36.02 (-862.72%)       16.94 (-352.68%)
Stddev   24       18.18 (  0.00%)       27.90 (-53.45%)       21.36 (-17.46%)        9.90 ( 45.56%)
Stddev   28       11.78 (  0.00%)        1.70 ( 85.57%)       23.85 (-102.51%)       46.94 (-298.64%)
Stddev   32        9.74 (  0.00%)       20.27 (-108.09%)       10.87 (-11.62%)        8.58 ( 11.96%)
Stddev   36        3.86 (  0.00%)        8.50 (-120.24%)       11.44 (-196.50%)       21.25 (-450.71%)
Stddev   40       14.17 (  0.00%)        2.05 ( 85.49%)        7.04 ( 50.31%)        7.12 ( 49.75%)
Stddev   44        7.54 (  0.00%)        2.87 ( 61.98%)        2.05 ( 72.76%)        2.49 ( 66.93%)
Stddev   48        2.94 (  0.00%)        5.44 (-84.67%)        7.07 (-140.19%)       15.64 (-431.33%)

Ran ebizzy because it double up as a page allocation micro benchmark that
hits page faults differently to PFT. Looks like an ok gain but the stddev
is high and would need to be stabilised to draw a solid conclusion from.

          3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
             vanillainstrument-v5r1lruslabonly-v1r2  local-v1r2
User          491.24      497.42      497.39      505.94
System        874.62      871.88      872.00      867.24
Elapsed      1082.00     1082.28     1082.24     1082.18

                            3.13.0-rc3  3.13.0-rc3  3.13.0-rc3  3.13.0-rc3
                               vanillainstrument-v5r1lruslabonly-v1r2  local-v1r2
NUMA alloc hit               238904205   239454483   240328156   317773920
NUMA alloc miss               71969773    74642250    73502556           0
NUMA interleave hit                  0           0           0           0
NUMA alloc local             238904198   239454483   240328156   317773916
NUMA page range updates         157577      731770      766010      781561
NUMA huge PMD updates               33          34          72          76
NUMA PTE updates                140714      714396      729218      742725
NUMA hint faults                 39395          28          65          70
NUMA hint local faults           17294          12          43          46
NUMA hint local percent             43          42          66          65
NUMA pages migrated               7183        2048        3072        6656
AutoNUMA cost                      198           5           5           5

Similarish observations.

None of these benchmarks do *anything* related to what commit 81c0a2bb was
supposed to fix. I just wanted to get the point across that our current
default behaviour sucks and we should revisit that decision.

My position is that by default we should only round-robin zones local to
the allocating process and that node round-robin is something that should
only be explicitely enabled.

I'm less sure about the round robin treatment of slab but am erring on
the side of historical behaviour until it is proven otherwise.

 Documentation/sysctl/vm.txt |  28 +++++++++++
 include/linux/mmzone.h      |   2 +
 include/linux/swap.h        |   2 +
 kernel/sysctl.c             |   8 ++++
 mm/page_alloc.c             | 111 +++++++++++++++++++++++++++++++++++---------
 5 files changed, 130 insertions(+), 21 deletions(-)

-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
  2013-12-12 15:06 [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman
@ 2013-12-12 15:06 ` Mel Gorman
  2013-12-12 15:06 ` [PATCH 2/4] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2013-12-12 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

From: Johannes Weiner <hannes@cmpxchg.org>

Dave Hansen noted a regression in a microbenchmark that loops around
open() and close() on an 8-node NUMA machine and bisected it down to
81c0a2bb515f ("mm: page_alloc: fair zone allocator policy").  That
change forces the slab allocations of the file descriptor to spread
out to all 8 nodes, causing remote references in the page allocator
and slab.

The round-robin policy is only there to provide fairness among memory
allocations that are reclaimed involuntarily based on pressure in each
zone.  It does not make sense to apply it to unreclaimable kernel
allocations that are freed manually, in this case instantly after the
allocation, and incur the remote reference costs twice for no reason.

Only round-robin allocations that are usually freed through page
reclaim or slab shrinking.

Cc: <stable@kernel.org>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..f861d02 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ zonelist_scan:
 		 * back to remote zones that do not partake in the
 		 * fairness round-robin cycle of this zonelist.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & GFP_MOVABLE_MASK)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 			if (zone_reclaim_mode &&
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] mm: page_alloc: Break out zone page aging distribution into its own helper
  2013-12-12 15:06 [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman
  2013-12-12 15:06 ` [PATCH 1/4] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
@ 2013-12-12 15:06 ` Mel Gorman
  2013-12-12 15:06 ` [PATCH 3/4] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2013-12-12 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

This patch moves the decision on whether to round-robin allocations between
zones and nodes into its own helper functions. It'll make some later patches
easier to understand and it will be automatically inlined.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 63 ++++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f861d02..64020eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1872,6 +1872,42 @@ static inline void init_zone_allows_reclaim(int nid)
 #endif	/* CONFIG_NUMA */
 
 /*
+ * Distribute pages in proportion to the individual zone size to ensure fair
+ * page aging.  The zone a page was allocated in should have no effect on the
+ * time the page has in memory before being reclaimed.
+ * 
+ * Returns true if this zone should be skipped to spread the page ages to
+ * other zones.
+ */
+static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
+				struct zone *zone, int alloc_flags)
+{
+	/* Only round robin in the allocator fast path */
+	if (!(alloc_flags & ALLOC_WMARK_LOW))
+		return false;
+
+	/* Only round robin pages likely to be LRU or reclaimable slab */
+	if (!(gfp_mask & GFP_MOVABLE_MASK))
+		return false;
+
+	/* Distribute to the next zone if this zone has exhausted its batch */
+	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
+		return true;
+
+	/*
+	 * When zone_reclaim_mode is enabled, try to stay in local zones in the
+	 * fastpath.  If that fails, the slowpath is entered, which will do
+	 * another pass starting with the local zones, but ultimately fall back
+	 * back to remote zones that do not partake in the fairness round-robin
+	 * cycle of this zonelist.
+	 */
+	if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+		return true;
+
+	return false;
+}
+
+/*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
  */
@@ -1907,27 +1943,12 @@ zonelist_scan:
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
 			goto try_this_zone;
-		/*
-		 * Distribute pages in proportion to the individual
-		 * zone size to ensure fair page aging.  The zone a
-		 * page was allocated in should have no effect on the
-		 * time the page has in memory before being reclaimed.
-		 *
-		 * When zone_reclaim_mode is enabled, try to stay in
-		 * local zones in the fastpath.  If that fails, the
-		 * slowpath is entered, which will do another pass
-		 * starting with the local zones, but ultimately fall
-		 * back to remote zones that do not partake in the
-		 * fairness round-robin cycle of this zonelist.
-		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & GFP_MOVABLE_MASK)) {
-			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
-				continue;
-			if (zone_reclaim_mode &&
-			    !zone_local(preferred_zone, zone))
-				continue;
-		}
+
+		/* Distribute pages to ensure fair page aging */
+		if (zone_distribute_age(gfp_mask, preferred_zone, zone,
+					alloc_flags))
+			continue;
+
 		/*
 		 * When allocating a page cache page for writing, we
 		 * want to get it from a zone that is within its dirty
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] mm: page_alloc: Use zone node IDs to approximate locality
  2013-12-12 15:06 [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman
  2013-12-12 15:06 ` [PATCH 1/4] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
  2013-12-12 15:06 ` [PATCH 2/4] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
@ 2013-12-12 15:06 ` Mel Gorman
  2013-12-12 15:06 ` [PATCH 4/4] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
  2013-12-12 15:34 ` [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman
  4 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2013-12-12 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

zone_local is using node_distance which is a more expensive call than
necessary. On x86, it's another function call in the allocator fast path
and increases cache footprint. This patch makes the assumption zones on a
local node will share the same node ID. The necessary information should
already be cache hot.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 64020eb..fd9677e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1816,7 +1816,7 @@ static void zlc_clear_zones_full(struct zonelist *zonelist)
 
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
-	return node_distance(local_zone->node, zone->node) == LOCAL_DISTANCE;
+	return zone_to_nid(zone) == numa_node_id();
 }
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] mm: page_alloc: Make zone distribution page aging policy configurable
  2013-12-12 15:06 [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman
                   ` (2 preceding siblings ...)
  2013-12-12 15:06 ` [PATCH 3/4] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
@ 2013-12-12 15:06 ` Mel Gorman
  2013-12-12 15:34 ` [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman
  4 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2013-12-12 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML,
	Mel Gorman

Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
bug whereby new pages could be reclaimed before old pages because of
how the page allocator and kswapd interacted on the per-zone LRU lists.
Unfortunately it was missed during review that a consequence is that
we also round-robin between NUMA nodes. This is bad for two reasons

1. It alters the semantics of MPOL_LOCAL without telling anyone
2. It incurs an immediate remote memory performance hit in exchange
   for a potential performance gain when memory needs to be reclaimed
   later

No cookies for the reviewers on this one.

This patch makes the behaviour of the fair zone allocator policy
configurable.  By default it will only distribute pages that are going
to exist on the LRU between zones local to the allocating process. This
preserves the historical semantics of MPOL_LOCAL.

By default, slab pages are not distributed between zones after this patch is
applied. It can be argued that they should get similar treatment but they
have different lifecycles to LRU pages, the shrinkers are not zone-aware
and the interaction between the page allocator and kswapd is different
for slabs. If it turns out to be an almost universal win, we can change
the default.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/vm.txt | 28 ++++++++++++++++++++++++
 include/linux/mmzone.h      |  2 ++
 include/linux/swap.h        |  2 ++
 kernel/sysctl.c             |  8 +++++++
 mm/page_alloc.c             | 53 ++++++++++++++++++++++++++++++++++++++++++---
 5 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 1fbd4eb..cd45b4c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- zone_distribute_mode
 - zone_reclaim_mode
 
 ==============================================================
@@ -724,6 +725,33 @@ causes the kernel to prefer to reclaim dentries and inodes.
 
 ==============================================================
 
+zone_distribute_mode
+
+Pages allocation and reclaim are managed on a per-zone basis. When the
+system needs to reclaim memory, candidate pages are selected from these
+per-zone lists.  Historically, a potential consequence was that recently
+allocated pages were considered reclaim candidates. From a zone-local
+perspective, page aging was preserved but from a system-wide perspective
+there was an age inversion problem.
+
+A similar problem occurs on a node level where young pages may be reclaimed
+from the local node instead of allocating remote memory. Unforuntately, the
+cost of accessing remote nodes is higher so the system must choose by default
+between favouring page aging or node locality. zone_distribute_mode controls
+how the system will distribute page ages between zones.
+
+The values are ORed together
+
+0	= Never round-robin based on age
+1	= Distribute between zones local to the allocating node
+2	= Distribute between all nodes, effectively alters MPOL_DEFAULT
+4	= Distribute reclaimable slab pages between zones
+
+Note that zone_reclaim_mode overrides "2" above. If zone_reclaim_mode is
+enabled then node-local allocation policies are still enforced.
+
+==============================================================
+
 zone_reclaim_mode:
 
 Zone_reclaim_mode allows someone to set more or less aggressive approaches to
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b835d3f..20a75e3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -897,6 +897,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
+int sysctl_zone_distribute_mode_handler(struct ctl_table *, int,
+			void __user *, size_t *, loff_t *);
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..44329b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,6 +318,8 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned __bitwise__ zone_distribute_mode;
+
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 34a6047..b75c08f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1349,6 +1349,14 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+	{
+		.procname	= "zone_distribute_mode",
+		.data		= &zone_distribute_mode,
+		.maxlen		= sizeof(zone_distribute_mode),
+		.mode		= 0644,
+		.proc_handler	= sysctl_zone_distribute_mode_handler,
+		.extra1		= &zero,
+	},
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd9677e..fef353c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1871,6 +1871,44 @@ static inline void init_zone_allows_reclaim(int nid)
 }
 #endif	/* CONFIG_NUMA */
 
+/* Controls how page ages are distributed across zones automatically */
+unsigned __bitwise__ zone_distribute_mode __read_mostly;
+unsigned __bitwise__ zone_distribute_mask __read_mostly;
+
+/* See zone_distribute_mode docmentation in Documentation/sysctl/vm.txt */
+#define DISTRIBUTE_DISABLE	(0)
+#define DISTRIBUTE_LOCAL	(1UL << 0)
+#define DISTRIBUTE_REMOTE	(1UL << 1)
+#define DISTRIBUTE_SLAB		(1UL << 2)
+
+#define DISTRIBUTE_STUPID	(DISTRIBUTE_LOCAL|DISTRIBUTE_REMOTE)
+#define DISTRIBUTE_DEFAULT	(DISTRIBUTE_LOCAL)
+
+int sysctl_zone_distribute_mode_handler(ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	/* If you are an admin reading this comment, what were you thinking? */
+	if (WARN_ON_ONCE((zone_distribute_mode & DISTRIBUTE_STUPID) ==
+							DISTRIBUTE_STUPID))
+		zone_distribute_mode &= ~DISTRIBUTE_REMOTE;
+
+	/* Set the allowed GFP flags for fair allocation policy */
+	zone_distribute_mask = 0;
+	if (zone_distribute_mode) {
+		zone_distribute_mask = __GFP_MOVABLE;
+		if (zone_distribute_mode & DISTRIBUTE_SLAB)
+			zone_distribute_mask |= __GFP_RECLAIMABLE;
+	}
+
+	return 0;
+}
+
 /*
  * Distribute pages in proportion to the individual zone size to ensure fair
  * page aging.  The zone a page was allocated in should have no effect on the
@@ -1882,18 +1920,25 @@ static inline void init_zone_allows_reclaim(int nid)
 static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
 				struct zone *zone, int alloc_flags)
 {
+	bool zone_is_local;
+
 	/* Only round robin in the allocator fast path */
 	if (!(alloc_flags & ALLOC_WMARK_LOW))
 		return false;
 
-	/* Only round robin pages likely to be LRU or reclaimable slab */
-	if (!(gfp_mask & GFP_MOVABLE_MASK))
+	/* Only round robin the requested sort of pages */
+	if (!(gfp_mask & zone_distribute_mask))
 		return false;
 
 	/* Distribute to the next zone if this zone has exhausted its batch */
 	if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 		return true;
 
+	/* Distribute only between zones local to the node if requested */
+	zone_is_local = zone_local(preferred_zone, zone);
+	if ((zone_distribute_mode & DISTRIBUTE_LOCAL) && !zone_is_local)
+		return true;
+
 	/*
 	 * When zone_reclaim_mode is enabled, try to stay in local zones in the
 	 * fastpath.  If that fails, the slowpath is entered, which will do
@@ -1901,7 +1946,8 @@ static bool zone_distribute_age(gfp_t gfp_mask, struct zone *preferred_zone,
 	 * back to remote zones that do not partake in the fairness round-robin
 	 * cycle of this zonelist.
 	 */
-	if (zone_reclaim_mode && !zone_local(preferred_zone, zone))
+	WARN_ON_ONCE(!(zone_distribute_mode & DISTRIBUTE_REMOTE));
+	if (zone_reclaim_mode && !zone_is_local)
 		return true;
 
 	return false;
@@ -3797,6 +3843,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 		__build_all_zonelists(NULL);
 		mminit_verify_zonelist();
 		cpuset_init_current_mems_allowed();
+		zone_distribute_mode = DISTRIBUTE_DEFAULT;
 	} else {
 #ifdef CONFIG_MEMORY_HOTPLUG
 		if (zone)
-- 
1.8.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/4] Configurable fair allocation zone policy
  2013-12-12 15:06 [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman
                   ` (3 preceding siblings ...)
  2013-12-12 15:06 ` [PATCH 4/4] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
@ 2013-12-12 15:34 ` Mel Gorman
  4 siblings, 0 replies; 6+ messages in thread
From: Mel Gorman @ 2013-12-12 15:34 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Dave Hansen, Rik van Riel, Linux-MM, LKML

On Thu, Dec 12, 2013 at 03:06:15PM +0000, Mel Gorman wrote:
> Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") solved a
> bug whereby new pages could be reclaimed before old pages because of how
> the page allocator and kswapd interacted on the per-zone LRU lists.
> 
> Unfortunately a side-effect missed during review was that it's now very
> easy to allocate remote memory on NUMA machines. The problem is that
> it is not a simple case of just restoring local allocation policies as
> there are genuine reasons why global page aging may be prefereable. It's
> still a major change to default behaviour so this patch makes the policy
> configurable and sets what I think is a sensible default.
> 
> The patches are on top of some NUMA balancing patches currently in -mm.
> The first patch in the series is a patch posted by Johannes that must be
> taken into account before any of my patches on top. The last patch of the
> series is what alters default behaviour and makes the fair zone allocator
> policy configurable.
> 
> Sniff test results based on following kernels
> 
> vanilla		 3.13-rc3 stock
> instrument-v5r1  NUMA balancing patches just to rule out any conflicts there
> lruslabonly-v1r2 Patch 1 only
> local-v1r2	 Full series
> 

These figures need to be redone. The instrument-v5r1 and later kernels
included a debugging patch that increases migration rates to trigger
another bug. The figures of local-v1r2 relative to instrument-v5r1 are
fine but not relative to 3.13.0-rc3-vanilla

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-12-12 15:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-12 15:06 [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman
2013-12-12 15:06 ` [PATCH 1/4] mm: page_alloc: exclude unreclaimable allocations from zone fairness policy Mel Gorman
2013-12-12 15:06 ` [PATCH 2/4] mm: page_alloc: Break out zone page aging distribution into its own helper Mel Gorman
2013-12-12 15:06 ` [PATCH 3/4] mm: page_alloc: Use zone node IDs to approximate locality Mel Gorman
2013-12-12 15:06 ` [PATCH 4/4] mm: page_alloc: Make zone distribution page aging policy configurable Mel Gorman
2013-12-12 15:34 ` [RFC PATCH 0/4] Configurable fair allocation zone policy Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).