* [RFC] mm/slub: Reduce memory consumption in extreme scenarios
@ 2023-03-07 8:28 Chen Jun
2023-03-07 10:30 ` kernel test robot
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Chen Jun @ 2023-03-07 8:28 UTC (permalink / raw)
To: linux-kernel, linux-mm, cl, penberg, rientjes, iamjoonsoo.kim,
akpm, vbabka
Cc: xuqiang36, chenjun102
If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
Slub will alloc a slub page which is not belong to A, and put the page
to kmem_cache_node[page_to_nid(page)]. The page can not be reused
at next calling, because NULL will be get from get_partical().
That make kmalloc_node consume more memory.
On qemu with 4 numas and each numa has 1G memory, Write a test ko
to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
cat /proc/slabinfo shows:
kmalloc-256 4302317 15151808 256 32 2 : tunables..
the total objects is much more then active objects.
After this patch, cat /prac/slubinfo shows:
kmalloc-256 5244950 5245088 256 32 2 : tunables..
Signed-off-by: Chen Jun <chenjun102@huawei.com>
---
mm/slub.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 39327e98fce3..c0090a5de54e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
searchnode = numa_mem_id();
object = get_partial_node(s, get_node(s, searchnode), pc);
- if (object || node != NUMA_NO_NODE)
+ if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
return object;
return get_any_partial(s, pc);
@@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
struct slab *slab;
unsigned long flags;
struct partial_context pc;
+ int try_thisndoe = 0;
stat(s, ALLOC_SLOWPATH);
@@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
}
new_objects:
-
pc.flags = gfpflags;
+
+ /* Try to get page from specific node even if __GFP_THISNODE is not set */
+ if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
+ pc.flags |= __GFP_THISNODE;
+
pc.slab = &slab;
pc.orig_size = orig_size;
freelist = get_partial(s, node, &pc);
@@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
goto check_new_slab;
slub_put_cpu_ptr(s->cpu_slab);
- slab = new_slab(s, gfpflags, node);
+ slab = new_slab(s, pc.flags, node);
c = slub_get_cpu_ptr(s->cpu_slab);
if (unlikely(!slab)) {
+ /* Try to get page from any other node */
+ if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
+ try_thisnode = 0;
+ goto new_objects;
+ }
+
slab_out_of_memory(s, gfpflags, node);
return NULL;
}
--
2.17.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
2023-03-07 8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
@ 2023-03-07 10:30 ` kernel test robot
2023-03-07 10:50 ` kernel test robot
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2023-03-07 10:30 UTC (permalink / raw)
To: Chen Jun; +Cc: oe-kbuild-all
Hi Chen,
[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230307082811.120774-1-chenjun102%40huawei.com
patch subject: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20230307/202303071836.MYJQsNfz-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/d00decb3ccc45b514f25df0d3775153a933ffaa9
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
git checkout d00decb3ccc45b514f25df0d3775153a933ffaa9
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 olddefconfig
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303071836.MYJQsNfz-lkp@intel.com/
All warnings (new ones prefixed by >>):
mm/slub.c: In function '___slab_alloc':
mm/slub.c:3188:69: error: 'try_thisnode' undeclared (first use in this function); did you mean 'try_thisndoe'?
3188 | if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
| ^~~~~~~~~~~~
| try_thisndoe
mm/slub.c:3188:69: note: each undeclared identifier is reported only once for each function it appears in
>> mm/slub.c:3072:13: warning: unused variable 'try_thisndoe' [-Wunused-variable]
3072 | int try_thisndoe = 0;
| ^~~~~~~~~~~~
vim +/try_thisndoe +3072 mm/slub.c
3045
3046 /*
3047 * Slow path. The lockless freelist is empty or we need to perform
3048 * debugging duties.
3049 *
3050 * Processing is still very fast if new objects have been freed to the
3051 * regular freelist. In that case we simply take over the regular freelist
3052 * as the lockless freelist and zap the regular freelist.
3053 *
3054 * If that is not working then we fall back to the partial lists. We take the
3055 * first element of the freelist as the object to allocate now and move the
3056 * rest of the freelist to the lockless freelist.
3057 *
3058 * And if we were unable to get a new slab from the partial slab lists then
3059 * we need to allocate a new slab. This is the slowest path since it involves
3060 * a call to the page allocator and the setup of a new slab.
3061 *
3062 * Version of __slab_alloc to use when we know that preemption is
3063 * already disabled (which is the case for bulk allocation).
3064 */
3065 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
3066 unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
3067 {
3068 void *freelist;
3069 struct slab *slab;
3070 unsigned long flags;
3071 struct partial_context pc;
> 3072 int try_thisndoe = 0;
3073
3074 stat(s, ALLOC_SLOWPATH);
3075
3076 reread_slab:
3077
3078 slab = READ_ONCE(c->slab);
3079 if (!slab) {
3080 /*
3081 * if the node is not online or has no normal memory, just
3082 * ignore the node constraint
3083 */
3084 if (unlikely(node != NUMA_NO_NODE &&
3085 !node_isset(node, slab_nodes)))
3086 node = NUMA_NO_NODE;
3087 goto new_slab;
3088 }
3089 redo:
3090
3091 if (unlikely(!node_match(slab, node))) {
3092 /*
3093 * same as above but node_match() being false already
3094 * implies node != NUMA_NO_NODE
3095 */
3096 if (!node_isset(node, slab_nodes)) {
3097 node = NUMA_NO_NODE;
3098 } else {
3099 stat(s, ALLOC_NODE_MISMATCH);
3100 goto deactivate_slab;
3101 }
3102 }
3103
3104 /*
3105 * By rights, we should be searching for a slab page that was
3106 * PFMEMALLOC but right now, we are losing the pfmemalloc
3107 * information when the page leaves the per-cpu allocator
3108 */
3109 if (unlikely(!pfmemalloc_match(slab, gfpflags)))
3110 goto deactivate_slab;
3111
3112 /* must check again c->slab in case we got preempted and it changed */
3113 local_lock_irqsave(&s->cpu_slab->lock, flags);
3114 if (unlikely(slab != c->slab)) {
3115 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3116 goto reread_slab;
3117 }
3118 freelist = c->freelist;
3119 if (freelist)
3120 goto load_freelist;
3121
3122 freelist = get_freelist(s, slab);
3123
3124 if (!freelist) {
3125 c->slab = NULL;
3126 c->tid = next_tid(c->tid);
3127 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3128 stat(s, DEACTIVATE_BYPASS);
3129 goto new_slab;
3130 }
3131
3132 stat(s, ALLOC_REFILL);
3133
3134 load_freelist:
3135
3136 lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
3137
3138 /*
3139 * freelist is pointing to the list of objects to be used.
3140 * slab is pointing to the slab from which the objects are obtained.
3141 * That slab must be frozen for per cpu allocations to work.
3142 */
3143 VM_BUG_ON(!c->slab->frozen);
3144 c->freelist = get_freepointer(s, freelist);
3145 c->tid = next_tid(c->tid);
3146 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3147 return freelist;
3148
3149 deactivate_slab:
3150
3151 local_lock_irqsave(&s->cpu_slab->lock, flags);
3152 if (slab != c->slab) {
3153 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3154 goto reread_slab;
3155 }
3156 freelist = c->freelist;
3157 c->slab = NULL;
3158 c->freelist = NULL;
3159 c->tid = next_tid(c->tid);
3160 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3161 deactivate_slab(s, slab, freelist);
3162
3163 new_slab:
3164
3165 if (slub_percpu_partial(c)) {
3166 local_lock_irqsave(&s->cpu_slab->lock, flags);
3167 if (unlikely(c->slab)) {
3168 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3169 goto reread_slab;
3170 }
3171 if (unlikely(!slub_percpu_partial(c))) {
3172 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3173 /* we were preempted and partial list got empty */
3174 goto new_objects;
3175 }
3176
3177 slab = c->slab = slub_percpu_partial(c);
3178 slub_set_percpu_partial(c, slab);
3179 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3180 stat(s, CPU_PARTIAL_ALLOC);
3181 goto redo;
3182 }
3183
3184 new_objects:
3185 pc.flags = gfpflags;
3186
3187 /* Try to get page from specific node even if __GFP_THISNODE is not set */
3188 if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
3189 pc.flags |= __GFP_THISNODE;
3190
3191 pc.slab = &slab;
3192 pc.orig_size = orig_size;
3193 freelist = get_partial(s, node, &pc);
3194 if (freelist)
3195 goto check_new_slab;
3196
3197 slub_put_cpu_ptr(s->cpu_slab);
3198 slab = new_slab(s, pc.flags, node);
3199 c = slub_get_cpu_ptr(s->cpu_slab);
3200
3201 if (unlikely(!slab)) {
3202 /* Try to get page from any other node */
3203 if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
3204 try_thisnode = 0;
3205 goto new_objects;
3206 }
3207
3208 slab_out_of_memory(s, gfpflags, node);
3209 return NULL;
3210 }
3211
3212 stat(s, ALLOC_SLAB);
3213
3214 if (kmem_cache_debug(s)) {
3215 freelist = alloc_single_from_new_slab(s, slab, orig_size);
3216
3217 if (unlikely(!freelist))
3218 goto new_objects;
3219
3220 if (s->flags & SLAB_STORE_USER)
3221 set_track(s, freelist, TRACK_ALLOC, addr);
3222
3223 return freelist;
3224 }
3225
3226 /*
3227 * No other reference to the slab yet so we can
3228 * muck around with it freely without cmpxchg
3229 */
3230 freelist = slab->freelist;
3231 slab->freelist = NULL;
3232 slab->inuse = slab->objects;
3233 slab->frozen = 1;
3234
3235 inc_slabs_node(s, slab_nid(slab), slab->objects);
3236
3237 check_new_slab:
3238
3239 if (kmem_cache_debug(s)) {
3240 /*
3241 * For debug caches here we had to go through
3242 * alloc_single_from_partial() so just store the tracking info
3243 * and return the object
3244 */
3245 if (s->flags & SLAB_STORE_USER)
3246 set_track(s, freelist, TRACK_ALLOC, addr);
3247
3248 return freelist;
3249 }
3250
3251 if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
3252 /*
3253 * For !pfmemalloc_match() case we don't load freelist so that
3254 * we don't make further mismatched allocations easier.
3255 */
3256 deactivate_slab(s, slab, get_freepointer(s, freelist));
3257 return freelist;
3258 }
3259
3260 retry_load_slab:
3261
3262 local_lock_irqsave(&s->cpu_slab->lock, flags);
3263 if (unlikely(c->slab)) {
3264 void *flush_freelist = c->freelist;
3265 struct slab *flush_slab = c->slab;
3266
3267 c->slab = NULL;
3268 c->freelist = NULL;
3269 c->tid = next_tid(c->tid);
3270
3271 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3272
3273 deactivate_slab(s, flush_slab, flush_freelist);
3274
3275 stat(s, CPUSLAB_FLUSH);
3276
3277 goto retry_load_slab;
3278 }
3279 c->slab = slab;
3280
3281 goto load_freelist;
3282 }
3283
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
2023-03-07 8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
2023-03-07 10:30 ` kernel test robot
@ 2023-03-07 10:50 ` kernel test robot
2023-03-07 14:20 ` Hyeonggon Yoo
2023-03-08 10:43 ` kernel test robot
3 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2023-03-07 10:50 UTC (permalink / raw)
To: Chen Jun; +Cc: oe-kbuild-all
Hi Chen,
[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230307082811.120774-1-chenjun102%40huawei.com
patch subject: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20230307/202303071843.3sjPyGIX-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/d00decb3ccc45b514f25df0d3775153a933ffaa9
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
git checkout d00decb3ccc45b514f25df0d3775153a933ffaa9
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303071843.3sjPyGIX-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/slub.c: In function '___slab_alloc':
>> mm/slub.c:3188:69: error: 'try_thisnode' undeclared (first use in this function); did you mean 'try_thisndoe'?
3188 | if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
| ^~~~~~~~~~~~
| try_thisndoe
mm/slub.c:3188:69: note: each undeclared identifier is reported only once for each function it appears in
mm/slub.c:3072:13: warning: unused variable 'try_thisndoe' [-Wunused-variable]
3072 | int try_thisndoe = 0;
| ^~~~~~~~~~~~
vim +3188 mm/slub.c
3045
3046 /*
3047 * Slow path. The lockless freelist is empty or we need to perform
3048 * debugging duties.
3049 *
3050 * Processing is still very fast if new objects have been freed to the
3051 * regular freelist. In that case we simply take over the regular freelist
3052 * as the lockless freelist and zap the regular freelist.
3053 *
3054 * If that is not working then we fall back to the partial lists. We take the
3055 * first element of the freelist as the object to allocate now and move the
3056 * rest of the freelist to the lockless freelist.
3057 *
3058 * And if we were unable to get a new slab from the partial slab lists then
3059 * we need to allocate a new slab. This is the slowest path since it involves
3060 * a call to the page allocator and the setup of a new slab.
3061 *
3062 * Version of __slab_alloc to use when we know that preemption is
3063 * already disabled (which is the case for bulk allocation).
3064 */
3065 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
3066 unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
3067 {
3068 void *freelist;
3069 struct slab *slab;
3070 unsigned long flags;
3071 struct partial_context pc;
3072 int try_thisndoe = 0;
3073
3074 stat(s, ALLOC_SLOWPATH);
3075
3076 reread_slab:
3077
3078 slab = READ_ONCE(c->slab);
3079 if (!slab) {
3080 /*
3081 * if the node is not online or has no normal memory, just
3082 * ignore the node constraint
3083 */
3084 if (unlikely(node != NUMA_NO_NODE &&
3085 !node_isset(node, slab_nodes)))
3086 node = NUMA_NO_NODE;
3087 goto new_slab;
3088 }
3089 redo:
3090
3091 if (unlikely(!node_match(slab, node))) {
3092 /*
3093 * same as above but node_match() being false already
3094 * implies node != NUMA_NO_NODE
3095 */
3096 if (!node_isset(node, slab_nodes)) {
3097 node = NUMA_NO_NODE;
3098 } else {
3099 stat(s, ALLOC_NODE_MISMATCH);
3100 goto deactivate_slab;
3101 }
3102 }
3103
3104 /*
3105 * By rights, we should be searching for a slab page that was
3106 * PFMEMALLOC but right now, we are losing the pfmemalloc
3107 * information when the page leaves the per-cpu allocator
3108 */
3109 if (unlikely(!pfmemalloc_match(slab, gfpflags)))
3110 goto deactivate_slab;
3111
3112 /* must check again c->slab in case we got preempted and it changed */
3113 local_lock_irqsave(&s->cpu_slab->lock, flags);
3114 if (unlikely(slab != c->slab)) {
3115 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3116 goto reread_slab;
3117 }
3118 freelist = c->freelist;
3119 if (freelist)
3120 goto load_freelist;
3121
3122 freelist = get_freelist(s, slab);
3123
3124 if (!freelist) {
3125 c->slab = NULL;
3126 c->tid = next_tid(c->tid);
3127 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3128 stat(s, DEACTIVATE_BYPASS);
3129 goto new_slab;
3130 }
3131
3132 stat(s, ALLOC_REFILL);
3133
3134 load_freelist:
3135
3136 lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
3137
3138 /*
3139 * freelist is pointing to the list of objects to be used.
3140 * slab is pointing to the slab from which the objects are obtained.
3141 * That slab must be frozen for per cpu allocations to work.
3142 */
3143 VM_BUG_ON(!c->slab->frozen);
3144 c->freelist = get_freepointer(s, freelist);
3145 c->tid = next_tid(c->tid);
3146 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3147 return freelist;
3148
3149 deactivate_slab:
3150
3151 local_lock_irqsave(&s->cpu_slab->lock, flags);
3152 if (slab != c->slab) {
3153 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3154 goto reread_slab;
3155 }
3156 freelist = c->freelist;
3157 c->slab = NULL;
3158 c->freelist = NULL;
3159 c->tid = next_tid(c->tid);
3160 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3161 deactivate_slab(s, slab, freelist);
3162
3163 new_slab:
3164
3165 if (slub_percpu_partial(c)) {
3166 local_lock_irqsave(&s->cpu_slab->lock, flags);
3167 if (unlikely(c->slab)) {
3168 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3169 goto reread_slab;
3170 }
3171 if (unlikely(!slub_percpu_partial(c))) {
3172 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3173 /* we were preempted and partial list got empty */
3174 goto new_objects;
3175 }
3176
3177 slab = c->slab = slub_percpu_partial(c);
3178 slub_set_percpu_partial(c, slab);
3179 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3180 stat(s, CPU_PARTIAL_ALLOC);
3181 goto redo;
3182 }
3183
3184 new_objects:
3185 pc.flags = gfpflags;
3186
3187 /* Try to get page from specific node even if __GFP_THISNODE is not set */
> 3188 if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
3189 pc.flags |= __GFP_THISNODE;
3190
3191 pc.slab = &slab;
3192 pc.orig_size = orig_size;
3193 freelist = get_partial(s, node, &pc);
3194 if (freelist)
3195 goto check_new_slab;
3196
3197 slub_put_cpu_ptr(s->cpu_slab);
3198 slab = new_slab(s, pc.flags, node);
3199 c = slub_get_cpu_ptr(s->cpu_slab);
3200
3201 if (unlikely(!slab)) {
3202 /* Try to get page from any other node */
3203 if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
3204 try_thisnode = 0;
3205 goto new_objects;
3206 }
3207
3208 slab_out_of_memory(s, gfpflags, node);
3209 return NULL;
3210 }
3211
3212 stat(s, ALLOC_SLAB);
3213
3214 if (kmem_cache_debug(s)) {
3215 freelist = alloc_single_from_new_slab(s, slab, orig_size);
3216
3217 if (unlikely(!freelist))
3218 goto new_objects;
3219
3220 if (s->flags & SLAB_STORE_USER)
3221 set_track(s, freelist, TRACK_ALLOC, addr);
3222
3223 return freelist;
3224 }
3225
3226 /*
3227 * No other reference to the slab yet so we can
3228 * muck around with it freely without cmpxchg
3229 */
3230 freelist = slab->freelist;
3231 slab->freelist = NULL;
3232 slab->inuse = slab->objects;
3233 slab->frozen = 1;
3234
3235 inc_slabs_node(s, slab_nid(slab), slab->objects);
3236
3237 check_new_slab:
3238
3239 if (kmem_cache_debug(s)) {
3240 /*
3241 * For debug caches here we had to go through
3242 * alloc_single_from_partial() so just store the tracking info
3243 * and return the object
3244 */
3245 if (s->flags & SLAB_STORE_USER)
3246 set_track(s, freelist, TRACK_ALLOC, addr);
3247
3248 return freelist;
3249 }
3250
3251 if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
3252 /*
3253 * For !pfmemalloc_match() case we don't load freelist so that
3254 * we don't make further mismatched allocations easier.
3255 */
3256 deactivate_slab(s, slab, get_freepointer(s, freelist));
3257 return freelist;
3258 }
3259
3260 retry_load_slab:
3261
3262 local_lock_irqsave(&s->cpu_slab->lock, flags);
3263 if (unlikely(c->slab)) {
3264 void *flush_freelist = c->freelist;
3265 struct slab *flush_slab = c->slab;
3266
3267 c->slab = NULL;
3268 c->freelist = NULL;
3269 c->tid = next_tid(c->tid);
3270
3271 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3272
3273 deactivate_slab(s, flush_slab, flush_freelist);
3274
3275 stat(s, CPUSLAB_FLUSH);
3276
3277 goto retry_load_slab;
3278 }
3279 c->slab = slab;
3280
3281 goto load_freelist;
3282 }
3283
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
2023-03-07 8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
2023-03-07 10:30 ` kernel test robot
2023-03-07 10:50 ` kernel test robot
@ 2023-03-07 14:20 ` Hyeonggon Yoo
2023-03-08 7:16 ` chenjun (AM)
2023-03-08 10:43 ` kernel test robot
3 siblings, 1 reply; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-03-07 14:20 UTC (permalink / raw)
To: Chen Jun
Cc: linux-kernel, linux-mm, cl, penberg, rientjes, iamjoonsoo.kim,
akpm, vbabka, xuqiang36
On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:
> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
> Slub will alloc a slub page which is not belong to A, and put the page
> to kmem_cache_node[page_to_nid(page)]. The page can not be reused
> at next calling, because NULL will be get from get_partical().
> That make kmalloc_node consume more memory.
Hello,
elaborating a little bit:
"When kmalloc_node() is called without __GFP_THISNODE and the target node
lacks sufficient memory, SLUB allocates a folio from a different node other
than the requested node, instead of taking a partial slab from it.
However, since the allocated folio does not belong to the requested node,
it is deactivated and added to the partial slab list of the node it
belongs to.
This behavior can result in excessive memory usage when the requested
node has insufficient memory, as SLUB will repeatedly allocate folios from
other nodes without reusing the previously allocated ones.
To prevent memory wastage, take a partial slab from a different node when
the requested node has no partial slab and __GFP_THISNODE is not explicitly
specified."
> On qemu with 4 numas and each numa has 1G memory, Write a test ko
> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
>
> cat /proc/slabinfo shows:
> kmalloc-256 4302317 15151808 256 32 2 : tunables..
>
> the total objects is much more then active objects.
>
> After this patch, cat /prac/slubinfo shows:
> kmalloc-256 5244950 5245088 256 32 2 : tunables..
>
> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> ---
> mm/slub.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 39327e98fce3..c0090a5de54e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
> searchnode = numa_mem_id();
>
> object = get_partial_node(s, get_node(s, searchnode), pc);
> - if (object || node != NUMA_NO_NODE)
> + if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
> return object;
I think the problem here is to avoid taking a partial slab from
different node than the requested node even if __GFP_THISNODE is not set.
(and then allocating new slab instead)
Thus this hunk makes sense to me,
even if SLUB currently do not implement __GFP_THISNODE semantics.
> return get_any_partial(s, pc);
> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> struct slab *slab;
> unsigned long flags;
> struct partial_context pc;
> + int try_thisndoe = 0;
>
>
> stat(s, ALLOC_SLOWPATH);
>
> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> }
>
> new_objects:
> -
> pc.flags = gfpflags;
> +
> + /* Try to get page from specific node even if __GFP_THISNODE is not set */
> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
> + pc.flags |= __GFP_THISNODE;
> +
> pc.slab = &slab;
> pc.orig_size = orig_size;
> freelist = get_partial(s, node, &pc);
> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> goto check_new_slab;
>
> slub_put_cpu_ptr(s->cpu_slab);
> - slab = new_slab(s, gfpflags, node);
> + slab = new_slab(s, pc.flags, node);
> c = slub_get_cpu_ptr(s->cpu_slab);
>
> if (unlikely(!slab)) {
> + /* Try to get page from any other node */
> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
> + try_thisnode = 0;
> + goto new_objects;
> + }
> +
> slab_out_of_memory(s, gfpflags, node);
> return NULL;
But these hunks do not make sense to me.
Why force __GFP_THISNODE even when the caller did not specify it?
(Apart from the fact that try_thisnode is defined as try_thisndoe,
and try_thisnode is never set to nonzero value.)
IMHO the first hunk is enough to solve the problem.
Thanks,
Hyeonggon
> }
> --
> 2.17.1
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
2023-03-07 14:20 ` Hyeonggon Yoo
@ 2023-03-08 7:16 ` chenjun (AM)
2023-03-08 13:37 ` Hyeonggon Yoo
0 siblings, 1 reply; 8+ messages in thread
From: chenjun (AM) @ 2023-03-08 7:16 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl@linux.com,
penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com,
akpm@linux-foundation.org, vbabka@suse.cz, xuqiang (M)
Hi,
Thanks for reply.
在 2023/3/7 22:20, Hyeonggon Yoo 写道:
> On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:
>> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
>> Slub will alloc a slub page which is not belong to A, and put the page
>> to kmem_cache_node[page_to_nid(page)]. The page can not be reused
>> at next calling, because NULL will be get from get_partical().
>> That make kmalloc_node consume more memory.
>
> Hello,
>
> elaborating a little bit:
>
> "When kmalloc_node() is called without __GFP_THISNODE and the target node
> lacks sufficient memory, SLUB allocates a folio from a different node other
> than the requested node, instead of taking a partial slab from it.
>
> However, since the allocated folio does not belong to the requested node,
> it is deactivated and added to the partial slab list of the node it
> belongs to.
>
> This behavior can result in excessive memory usage when the requested
> node has insufficient memory, as SLUB will repeatedly allocate folios from
> other nodes without reusing the previously allocated ones.
>
> To prevent memory wastage, take a partial slab from a different node when
> the requested node has no partial slab and __GFP_THISNODE is not explicitly
> specified."
>
Thanks, This is more clear than what I described.
>> On qemu with 4 numas and each numa has 1G memory, Write a test ko
>> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
>>
>> cat /proc/slabinfo shows:
>> kmalloc-256 4302317 15151808 256 32 2 : tunables..
>>
>> the total objects is much more then active objects.
>>
>> After this patch, cat /prac/slubinfo shows:
>> kmalloc-256 5244950 5245088 256 32 2 : tunables..
>>
>> Signed-off-by: Chen Jun <chenjun102@huawei.com>
>> ---
>> mm/slub.c | 17 ++++++++++++++---
>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 39327e98fce3..c0090a5de54e 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
>> searchnode = numa_mem_id();
>>
>> object = get_partial_node(s, get_node(s, searchnode), pc);
>> - if (object || node != NUMA_NO_NODE)
>> + if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
>> return object;
>
> I think the problem here is to avoid taking a partial slab from
> different node than the requested node even if __GFP_THISNODE is not set.
> (and then allocating new slab instead)
>
> Thus this hunk makes sense to me,
> even if SLUB currently do not implement __GFP_THISNODE semantics.
>
>> return get_any_partial(s, pc);
>> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> struct slab *slab;
>> unsigned long flags;
>> struct partial_context pc;
>> + int try_thisndoe = 0;
>>
>>
>> stat(s, ALLOC_SLOWPATH);
>>
>> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> }
>>
>> new_objects:
>> -
>> pc.flags = gfpflags;
>> +
>> + /* Try to get page from specific node even if __GFP_THISNODE is not set */
>> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
>> + pc.flags |= __GFP_THISNODE;
>> +
>> pc.slab = &slab;
>> pc.orig_size = orig_size;
>> freelist = get_partial(s, node, &pc);
>> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> goto check_new_slab;
>>
>> slub_put_cpu_ptr(s->cpu_slab);
>> - slab = new_slab(s, gfpflags, node);
>> + slab = new_slab(s, pc.flags, node);
>> c = slub_get_cpu_ptr(s->cpu_slab);
>>
>> if (unlikely(!slab)) {
>> + /* Try to get page from any other node */
>> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
>> + try_thisnode = 0;
>> + goto new_objects;
>> + }
>> +
>> slab_out_of_memory(s, gfpflags, node);
>> return NULL;
>
> But these hunks do not make sense to me.
> Why force __GFP_THISNODE even when the caller did not specify it?
>
> (Apart from the fact that try_thisnode is defined as try_thisndoe,
> and try_thisnode is never set to nonzero value.)
My mistake, It should be:
int try_thisnode = 0;
>
> IMHO the first hunk is enough to solve the problem.
I think, we should try to alloc a page on the target node before getting
one from other nodes' partial.
If the caller does not specify __GFP_THISNODE, we add __GFP_THISNODE to
try to get the slab only on the target node. If it fails, use the
original GFP FLAG to allow fallback.
>
> Thanks,
> Hyeonggon
>
>> }
>> --
>> 2.17.1
>>
>>
>
Thanks,
Chen Jun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
2023-03-07 8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
` (2 preceding siblings ...)
2023-03-07 14:20 ` Hyeonggon Yoo
@ 2023-03-08 10:43 ` kernel test robot
3 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2023-03-08 10:43 UTC (permalink / raw)
To: Chen Jun; +Cc: llvm, oe-kbuild-all
Hi Chen,
[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230307082811.120774-1-chenjun102%40huawei.com
patch subject: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
config: i386-randconfig-a013-20230306 (https://download.01.org/0day-ci/archive/20230308/202303081804.A6mbhXOo-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/d00decb3ccc45b514f25df0d3775153a933ffaa9
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
git checkout d00decb3ccc45b514f25df0d3775153a933ffaa9
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303081804.A6mbhXOo-lkp@intel.com/
All errors (new ones prefixed by >>):
>> mm/slub.c:3188:62: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
^~~~~~~~~~~~
try_thisndoe
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^
include/linux/compiler.h:58:52: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^
mm/slub.c:3072:6: note: 'try_thisndoe' declared here
int try_thisndoe = 0;
^
>> mm/slub.c:3188:62: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
^~~~~~~~~~~~
try_thisndoe
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^
include/linux/compiler.h:58:61: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^
mm/slub.c:3072:6: note: 'try_thisndoe' declared here
int try_thisndoe = 0;
^
>> mm/slub.c:3188:62: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
^~~~~~~~~~~~
try_thisndoe
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^
include/linux/compiler.h:58:86: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^
include/linux/compiler.h:69:3: note: expanded from macro '__trace_if_value'
(cond) ? \
^
mm/slub.c:3072:6: note: 'try_thisndoe' declared here
int try_thisndoe = 0;
^
mm/slub.c:3203:63: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
^~~~~~~~~~~~
try_thisndoe
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^
include/linux/compiler.h:58:52: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^
mm/slub.c:3072:6: note: 'try_thisndoe' declared here
int try_thisndoe = 0;
^
mm/slub.c:3203:63: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
^~~~~~~~~~~~
try_thisndoe
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^
include/linux/compiler.h:58:61: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^
mm/slub.c:3072:6: note: 'try_thisndoe' declared here
int try_thisndoe = 0;
^
mm/slub.c:3203:63: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
^~~~~~~~~~~~
try_thisndoe
include/linux/compiler.h:56:47: note: expanded from macro 'if'
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
^
include/linux/compiler.h:58:86: note: expanded from macro '__trace_if_var'
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
^
include/linux/compiler.h:69:3: note: expanded from macro '__trace_if_value'
(cond) ? \
^
mm/slub.c:3072:6: note: 'try_thisndoe' declared here
int try_thisndoe = 0;
^
mm/slub.c:3204:4: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
try_thisnode = 0;
^~~~~~~~~~~~
try_thisndoe
mm/slub.c:3072:6: note: 'try_thisndoe' declared here
int try_thisndoe = 0;
^
7 errors generated.
vim +3188 mm/slub.c
3045
3046 /*
3047 * Slow path. The lockless freelist is empty or we need to perform
3048 * debugging duties.
3049 *
3050 * Processing is still very fast if new objects have been freed to the
3051 * regular freelist. In that case we simply take over the regular freelist
3052 * as the lockless freelist and zap the regular freelist.
3053 *
3054 * If that is not working then we fall back to the partial lists. We take the
3055 * first element of the freelist as the object to allocate now and move the
3056 * rest of the freelist to the lockless freelist.
3057 *
3058 * And if we were unable to get a new slab from the partial slab lists then
3059 * we need to allocate a new slab. This is the slowest path since it involves
3060 * a call to the page allocator and the setup of a new slab.
3061 *
3062 * Version of __slab_alloc to use when we know that preemption is
3063 * already disabled (which is the case for bulk allocation).
3064 */
3065 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
3066 unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
3067 {
3068 void *freelist;
3069 struct slab *slab;
3070 unsigned long flags;
3071 struct partial_context pc;
3072 int try_thisndoe = 0;
3073
3074 stat(s, ALLOC_SLOWPATH);
3075
3076 reread_slab:
3077
3078 slab = READ_ONCE(c->slab);
3079 if (!slab) {
3080 /*
3081 * if the node is not online or has no normal memory, just
3082 * ignore the node constraint
3083 */
3084 if (unlikely(node != NUMA_NO_NODE &&
3085 !node_isset(node, slab_nodes)))
3086 node = NUMA_NO_NODE;
3087 goto new_slab;
3088 }
3089 redo:
3090
3091 if (unlikely(!node_match(slab, node))) {
3092 /*
3093 * same as above but node_match() being false already
3094 * implies node != NUMA_NO_NODE
3095 */
3096 if (!node_isset(node, slab_nodes)) {
3097 node = NUMA_NO_NODE;
3098 } else {
3099 stat(s, ALLOC_NODE_MISMATCH);
3100 goto deactivate_slab;
3101 }
3102 }
3103
3104 /*
3105 * By rights, we should be searching for a slab page that was
3106 * PFMEMALLOC but right now, we are losing the pfmemalloc
3107 * information when the page leaves the per-cpu allocator
3108 */
3109 if (unlikely(!pfmemalloc_match(slab, gfpflags)))
3110 goto deactivate_slab;
3111
3112 /* must check again c->slab in case we got preempted and it changed */
3113 local_lock_irqsave(&s->cpu_slab->lock, flags);
3114 if (unlikely(slab != c->slab)) {
3115 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3116 goto reread_slab;
3117 }
3118 freelist = c->freelist;
3119 if (freelist)
3120 goto load_freelist;
3121
3122 freelist = get_freelist(s, slab);
3123
3124 if (!freelist) {
3125 c->slab = NULL;
3126 c->tid = next_tid(c->tid);
3127 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3128 stat(s, DEACTIVATE_BYPASS);
3129 goto new_slab;
3130 }
3131
3132 stat(s, ALLOC_REFILL);
3133
3134 load_freelist:
3135
3136 lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
3137
3138 /*
3139 * freelist is pointing to the list of objects to be used.
3140 * slab is pointing to the slab from which the objects are obtained.
3141 * That slab must be frozen for per cpu allocations to work.
3142 */
3143 VM_BUG_ON(!c->slab->frozen);
3144 c->freelist = get_freepointer(s, freelist);
3145 c->tid = next_tid(c->tid);
3146 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3147 return freelist;
3148
3149 deactivate_slab:
3150
3151 local_lock_irqsave(&s->cpu_slab->lock, flags);
3152 if (slab != c->slab) {
3153 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3154 goto reread_slab;
3155 }
3156 freelist = c->freelist;
3157 c->slab = NULL;
3158 c->freelist = NULL;
3159 c->tid = next_tid(c->tid);
3160 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3161 deactivate_slab(s, slab, freelist);
3162
3163 new_slab:
3164
3165 if (slub_percpu_partial(c)) {
3166 local_lock_irqsave(&s->cpu_slab->lock, flags);
3167 if (unlikely(c->slab)) {
3168 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3169 goto reread_slab;
3170 }
3171 if (unlikely(!slub_percpu_partial(c))) {
3172 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3173 /* we were preempted and partial list got empty */
3174 goto new_objects;
3175 }
3176
3177 slab = c->slab = slub_percpu_partial(c);
3178 slub_set_percpu_partial(c, slab);
3179 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3180 stat(s, CPU_PARTIAL_ALLOC);
3181 goto redo;
3182 }
3183
3184 new_objects:
3185 pc.flags = gfpflags;
3186
3187 /* Try to get page from specific node even if __GFP_THISNODE is not set */
> 3188 if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
3189 pc.flags |= __GFP_THISNODE;
3190
3191 pc.slab = &slab;
3192 pc.orig_size = orig_size;
3193 freelist = get_partial(s, node, &pc);
3194 if (freelist)
3195 goto check_new_slab;
3196
3197 slub_put_cpu_ptr(s->cpu_slab);
3198 slab = new_slab(s, pc.flags, node);
3199 c = slub_get_cpu_ptr(s->cpu_slab);
3200
3201 if (unlikely(!slab)) {
3202 /* Try to get page from any other node */
3203 if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
3204 try_thisnode = 0;
3205 goto new_objects;
3206 }
3207
3208 slab_out_of_memory(s, gfpflags, node);
3209 return NULL;
3210 }
3211
3212 stat(s, ALLOC_SLAB);
3213
3214 if (kmem_cache_debug(s)) {
3215 freelist = alloc_single_from_new_slab(s, slab, orig_size);
3216
3217 if (unlikely(!freelist))
3218 goto new_objects;
3219
3220 if (s->flags & SLAB_STORE_USER)
3221 set_track(s, freelist, TRACK_ALLOC, addr);
3222
3223 return freelist;
3224 }
3225
3226 /*
3227 * No other reference to the slab yet so we can
3228 * muck around with it freely without cmpxchg
3229 */
3230 freelist = slab->freelist;
3231 slab->freelist = NULL;
3232 slab->inuse = slab->objects;
3233 slab->frozen = 1;
3234
3235 inc_slabs_node(s, slab_nid(slab), slab->objects);
3236
3237 check_new_slab:
3238
3239 if (kmem_cache_debug(s)) {
3240 /*
3241 * For debug caches here we had to go through
3242 * alloc_single_from_partial() so just store the tracking info
3243 * and return the object
3244 */
3245 if (s->flags & SLAB_STORE_USER)
3246 set_track(s, freelist, TRACK_ALLOC, addr);
3247
3248 return freelist;
3249 }
3250
3251 if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
3252 /*
3253 * For !pfmemalloc_match() case we don't load freelist so that
3254 * we don't make further mismatched allocations easier.
3255 */
3256 deactivate_slab(s, slab, get_freepointer(s, freelist));
3257 return freelist;
3258 }
3259
3260 retry_load_slab:
3261
3262 local_lock_irqsave(&s->cpu_slab->lock, flags);
3263 if (unlikely(c->slab)) {
3264 void *flush_freelist = c->freelist;
3265 struct slab *flush_slab = c->slab;
3266
3267 c->slab = NULL;
3268 c->freelist = NULL;
3269 c->tid = next_tid(c->tid);
3270
3271 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
3272
3273 deactivate_slab(s, flush_slab, flush_freelist);
3274
3275 stat(s, CPUSLAB_FLUSH);
3276
3277 goto retry_load_slab;
3278 }
3279 c->slab = slab;
3280
3281 goto load_freelist;
3282 }
3283
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
2023-03-08 7:16 ` chenjun (AM)
@ 2023-03-08 13:37 ` Hyeonggon Yoo
2023-03-09 2:15 ` chenjun (AM)
0 siblings, 1 reply; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-03-08 13:37 UTC (permalink / raw)
To: chenjun (AM)
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl@linux.com,
penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com,
akpm@linux-foundation.org, vbabka@suse.cz, xuqiang (M)
On Wed, Mar 08, 2023 at 07:16:49AM +0000, chenjun (AM) wrote:
> Hi,
>
> Thanks for reply.
>
> 在 2023/3/7 22:20, Hyeonggon Yoo 写道:
> > On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:
> >> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
> >> Slub will alloc a slub page which is not belong to A, and put the page
> >> to kmem_cache_node[page_to_nid(page)]. The page can not be reused
> >> at next calling, because NULL will be get from get_partical().
> >> That make kmalloc_node consume more memory.
> >
> > Hello,
> >
> > elaborating a little bit:
> >
> > "When kmalloc_node() is called without __GFP_THISNODE and the target node
> > lacks sufficient memory, SLUB allocates a folio from a different node other
> > than the requested node, instead of taking a partial slab from it.
> >
> > However, since the allocated folio does not belong to the requested node,
> > it is deactivated and added to the partial slab list of the node it
> > belongs to.
> >
> > This behavior can result in excessive memory usage when the requested
> > node has insufficient memory, as SLUB will repeatedly allocate folios from
> > other nodes without reusing the previously allocated ones.
> >
> > To prevent memory wastage, take a partial slab from a different node when
> > the requested node has no partial slab and __GFP_THISNODE is not explicitly
> > specified."
> >
>
> Thanks, This is more clear than what I described.
>
> >> On qemu with 4 numas and each numa has 1G memory, Write a test ko
> >> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
> >>
> >> cat /proc/slabinfo shows:
> >> kmalloc-256 4302317 15151808 256 32 2 : tunables..
> >>
> >> the total objects is much more then active objects.
> >>
> >> After this patch, cat /prac/slubinfo shows:
> >> kmalloc-256 5244950 5245088 256 32 2 : tunables..
> >>
> >> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> >> ---
> >> mm/slub.c | 17 ++++++++++++++---
> >> 1 file changed, 14 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 39327e98fce3..c0090a5de54e 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
> >> searchnode = numa_mem_id();
> >>
> >> object = get_partial_node(s, get_node(s, searchnode), pc);
> >> - if (object || node != NUMA_NO_NODE)
> >> + if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
> >> return object;
> >
> > I think the problem here is to avoid taking a partial slab from
> > different node than the requested node even if __GFP_THISNODE is not set.
> > (and then allocating new slab instead)
> >
> > Thus this hunk makes sense to me,
> > even if SLUB currently do not implement __GFP_THISNODE semantics.
> >
> >> return get_any_partial(s, pc);
> >> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >> struct slab *slab;
> >> unsigned long flags;
> >> struct partial_context pc;
> >> + int try_thisndoe = 0;
> >>
> >>
> >> stat(s, ALLOC_SLOWPATH);
> >>
> >> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >> }
> >>
> >> new_objects:
> >> -
> >> pc.flags = gfpflags;
> >> +
> >> + /* Try to get page from specific node even if __GFP_THISNODE is not set */
> >> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
> >> + pc.flags |= __GFP_THISNODE;
> >> +
> >> pc.slab = &slab;
> >> pc.orig_size = orig_size;
> >> freelist = get_partial(s, node, &pc);
> >> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >> goto check_new_slab;
> >>
> >> slub_put_cpu_ptr(s->cpu_slab);
> >> - slab = new_slab(s, gfpflags, node);
> >> + slab = new_slab(s, pc.flags, node);
> >> c = slub_get_cpu_ptr(s->cpu_slab);
> >>
> >> if (unlikely(!slab)) {
> >> + /* Try to get page from any other node */
> >> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
> >> + try_thisnode = 0;
> >> + goto new_objects;
> >> + }
> >> +
> >> slab_out_of_memory(s, gfpflags, node);
> >> return NULL;
> >
> > But these hunks do not make sense to me.
> > Why force __GFP_THISNODE even when the caller did not specify it?
> >
> > (Apart from the fact that try_thisnode is defined as try_thisndoe,
> > and try_thisnode is never set to nonzero value.)
>
> My mistake, It should be:
> int try_thisnode = 0;
I think it should be try_thisnode = 1?
Otherwise it won't be executed at all.
Also bool type will be more readable than int.
>
> >
> > IMHO the first hunk is enough to solve the problem.
>
> I think, we should try to alloc a page on the target node before getting
> one from other nodes' partial.
You are right.
Hmm then the new behavior when
(node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is:
1) try to get a partial slab from target node with __GFP_THISNODE
2) if 1) failed, try to allocate a new slab from target node with __GFP_THISNODE
3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint
when node != NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior
remains unchanged.
It does not look that crazy to me, although it complicates the code
a little bit. (Vlastimil may have some opinions?)
Now that I understand your intention, I think this behavior change also
need to be added to the commit log.
Thanks,
Hyeonggon
> If the caller does not specify __GFP_THISNODE, we add __GFP_THISNODE to
> try to get the slab only on the target node. If it fails, use the
> original GFP FLAG to allow fallback.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
2023-03-08 13:37 ` Hyeonggon Yoo
@ 2023-03-09 2:15 ` chenjun (AM)
0 siblings, 0 replies; 8+ messages in thread
From: chenjun (AM) @ 2023-03-09 2:15 UTC (permalink / raw)
To: Hyeonggon Yoo
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl@linux.com,
penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com,
akpm@linux-foundation.org, vbabka@suse.cz, xuqiang (M)
在 2023/3/8 21:37, Hyeonggon Yoo 写道:
> On Wed, Mar 08, 2023 at 07:16:49AM +0000, chenjun (AM) wrote:
>> Hi,
>>
>> Thanks for reply.
>>
>> 在 2023/3/7 22:20, Hyeonggon Yoo 写道:
>>> On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:
>>>> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
>>>> Slub will alloc a slub page which is not belong to A, and put the page
>>>> to kmem_cache_node[page_to_nid(page)]. The page can not be reused
>>>> at next calling, because NULL will be get from get_partical().
>>>> That make kmalloc_node consume more memory.
>>>
>>> Hello,
>>>
>>> elaborating a little bit:
>>>
>>> "When kmalloc_node() is called without __GFP_THISNODE and the target node
>>> lacks sufficient memory, SLUB allocates a folio from a different node other
>>> than the requested node, instead of taking a partial slab from it.
>>>
>>> However, since the allocated folio does not belong to the requested node,
>>> it is deactivated and added to the partial slab list of the node it
>>> belongs to.
>>>
>>> This behavior can result in excessive memory usage when the requested
>>> node has insufficient memory, as SLUB will repeatedly allocate folios from
>>> other nodes without reusing the previously allocated ones.
>>>
>>> To prevent memory wastage, take a partial slab from a different node when
>>> the requested node has no partial slab and __GFP_THISNODE is not explicitly
>>> specified."
>>>
>>
>> Thanks, This is more clear than what I described.
>>
>>>> On qemu with 4 numas and each numa has 1G memory, Write a test ko
>>>> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
>>>>
>>>> cat /proc/slabinfo shows:
>>>> kmalloc-256 4302317 15151808 256 32 2 : tunables..
>>>>
>>>> the total objects is much more then active objects.
>>>>
>>>> After this patch, cat /prac/slubinfo shows:
>>>> kmalloc-256 5244950 5245088 256 32 2 : tunables..
>>>>
>>>> Signed-off-by: Chen Jun <chenjun102@huawei.com>
>>>> ---
>>>> mm/slub.c | 17 ++++++++++++++---
>>>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/slub.c b/mm/slub.c
>>>> index 39327e98fce3..c0090a5de54e 100644
>>>> --- a/mm/slub.c
>>>> +++ b/mm/slub.c
>>>> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
>>>> searchnode = numa_mem_id();
>>>>
>>>> object = get_partial_node(s, get_node(s, searchnode), pc);
>>>> - if (object || node != NUMA_NO_NODE)
>>>> + if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
>>>> return object;
>>>
>>> I think the problem here is to avoid taking a partial slab from
>>> different node than the requested node even if __GFP_THISNODE is not set.
>>> (and then allocating new slab instead)
>>>
>>> Thus this hunk makes sense to me,
>>> even if SLUB currently do not implement __GFP_THISNODE semantics.
>>>
>>>> return get_any_partial(s, pc);
>>>> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>>> struct slab *slab;
>>>> unsigned long flags;
>>>> struct partial_context pc;
>>>> + int try_thisndoe = 0;
>>>>
>>>>
>>>> stat(s, ALLOC_SLOWPATH);
>>>>
>>>> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>>> }
>>>>
>>>> new_objects:
>>>> -
>>>> pc.flags = gfpflags;
>>>> +
>>>> + /* Try to get page from specific node even if __GFP_THISNODE is not set */
>>>> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
>>>> + pc.flags |= __GFP_THISNODE;
>>>> +
Any suggestions to make the change more elegant?
>>>> pc.slab = &slab;
>>>> pc.orig_size = orig_size;
>>>> freelist = get_partial(s, node, &pc);
>>>> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>>> goto check_new_slab;
>>>>
>>>> slub_put_cpu_ptr(s->cpu_slab);
>>>> - slab = new_slab(s, gfpflags, node);
>>>> + slab = new_slab(s, pc.flags, node);
>>>> c = slub_get_cpu_ptr(s->cpu_slab);
>>>>
>>>> if (unlikely(!slab)) {
>>>> + /* Try to get page from any other node */
>>>> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
>>>> + try_thisnode = 0;
>>>> + goto new_objects;
>>>> + }
>>>> +
>>>> slab_out_of_memory(s, gfpflags, node);
>>>> return NULL;
>>>
>>> But these hunks do not make sense to me.
>>> Why force __GFP_THISNODE even when the caller did not specify it?
>>>
>>> (Apart from the fact that try_thisnode is defined as try_thisndoe,
>>> and try_thisnode is never set to nonzero value.)
>>
>> My mistake, It should be:
>> int try_thisnode = 0;
>
> I think it should be try_thisnode = 1?
> Otherwise it won't be executed at all.
> Also bool type will be more readable than int.
>
>>
>>>
>>> IMHO the first hunk is enough to solve the problem.
>>
>> I think, we should try to alloc a page on the target node before getting
>> one from other nodes' partial.
>
> You are right.
>
> Hmm then the new behavior when
> (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is:
>
> 1) try to get a partial slab from target node with __GFP_THISNODE
> 2) if 1) failed, try to allocate a new slab from target node with __GFP_THISNODE
> 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint
>
> when node != NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior
> remains unchanged.
>
> It does not look that crazy to me, although it complicates the code
> a little bit. (Vlastimil may have some opinions?)
>
> Now that I understand your intention, I think this behavior change also
> need to be added to the commit log.
>
I will add it.
> Thanks,
> Hyeonggon
>
>> If the caller does not specify __GFP_THISNODE, we add __GFP_THISNODE to
>> try to get the slab only on the target node. If it fails, use the
>> original GFP FLAG to allow fallback.
>
If there are no other questions, I will send an official patch.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-03-09 2:16 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-07 8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
2023-03-07 10:30 ` kernel test robot
2023-03-07 10:50 ` kernel test robot
2023-03-07 14:20 ` Hyeonggon Yoo
2023-03-08 7:16 ` chenjun (AM)
2023-03-08 13:37 ` Hyeonggon Yoo
2023-03-09 2:15 ` chenjun (AM)
2023-03-08 10:43 ` kernel test robot
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.