All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] mm/slub: Reduce memory consumption in extreme scenarios
@ 2023-03-07  8:28 Chen Jun
  2023-03-07 10:30 ` kernel test robot
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Chen Jun @ 2023-03-07  8:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, cl, penberg, rientjes, iamjoonsoo.kim,
	akpm, vbabka
  Cc: xuqiang36, chenjun102

If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
Slub will alloc a slub page which is not belong to A, and put the page
to kmem_cache_node[page_to_nid(page)]. The page can not be reused
at next calling, because NULL will be get from get_partical().
That make kmalloc_node consume more memory.

On qemu with 4 numas and each numa has 1G memory, Write a test ko
to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.

cat /proc/slabinfo shows:
kmalloc-256       4302317 15151808    256   32    2 : tunables..

the total objects is much more then active objects.

After this patch, cat /prac/slubinfo shows:
kmalloc-256       5244950 5245088    256   32    2 : tunables..

Signed-off-by: Chen Jun <chenjun102@huawei.com>
---
 mm/slub.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 39327e98fce3..c0090a5de54e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
 		searchnode = numa_mem_id();
 
 	object = get_partial_node(s, get_node(s, searchnode), pc);
-	if (object || node != NUMA_NO_NODE)
+	if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
 		return object;
 
 	return get_any_partial(s, pc);
@@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	struct slab *slab;
 	unsigned long flags;
 	struct partial_context pc;
+	int try_thisndoe = 0;
 
 	stat(s, ALLOC_SLOWPATH);
 
@@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	}
 
 new_objects:
-
 	pc.flags = gfpflags;
+
+	/* Try to get page from specific node even if __GFP_THISNODE is not set */
+	if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
+			pc.flags |= __GFP_THISNODE;
+
 	pc.slab = &slab;
 	pc.orig_size = orig_size;
 	freelist = get_partial(s, node, &pc);
@@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto check_new_slab;
 
 	slub_put_cpu_ptr(s->cpu_slab);
-	slab = new_slab(s, gfpflags, node);
+	slab = new_slab(s, pc.flags, node);
 	c = slub_get_cpu_ptr(s->cpu_slab);
 
 	if (unlikely(!slab)) {
+		/* Try to get page from any other node */
+		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
+			try_thisnode = 0;
+			goto new_objects;
+		}
+
 		slab_out_of_memory(s, gfpflags, node);
 		return NULL;
 	}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
  2023-03-07  8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
@ 2023-03-07 10:30 ` kernel test robot
  2023-03-07 10:50 ` kernel test robot
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2023-03-07 10:30 UTC (permalink / raw)
  To: Chen Jun; +Cc: oe-kbuild-all

Hi Chen,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230307082811.120774-1-chenjun102%40huawei.com
patch subject: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20230307/202303071836.MYJQsNfz-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/d00decb3ccc45b514f25df0d3775153a933ffaa9
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
        git checkout d00decb3ccc45b514f25df0d3775153a933ffaa9
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 olddefconfig
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303071836.MYJQsNfz-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/slub.c: In function '___slab_alloc':
   mm/slub.c:3188:69: error: 'try_thisnode' undeclared (first use in this function); did you mean 'try_thisndoe'?
    3188 |         if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
         |                                                                     ^~~~~~~~~~~~
         |                                                                     try_thisndoe
   mm/slub.c:3188:69: note: each undeclared identifier is reported only once for each function it appears in
>> mm/slub.c:3072:13: warning: unused variable 'try_thisndoe' [-Wunused-variable]
    3072 |         int try_thisndoe = 0;
         |             ^~~~~~~~~~~~


vim +/try_thisndoe +3072 mm/slub.c

  3045	
  3046	/*
  3047	 * Slow path. The lockless freelist is empty or we need to perform
  3048	 * debugging duties.
  3049	 *
  3050	 * Processing is still very fast if new objects have been freed to the
  3051	 * regular freelist. In that case we simply take over the regular freelist
  3052	 * as the lockless freelist and zap the regular freelist.
  3053	 *
  3054	 * If that is not working then we fall back to the partial lists. We take the
  3055	 * first element of the freelist as the object to allocate now and move the
  3056	 * rest of the freelist to the lockless freelist.
  3057	 *
  3058	 * And if we were unable to get a new slab from the partial slab lists then
  3059	 * we need to allocate a new slab. This is the slowest path since it involves
  3060	 * a call to the page allocator and the setup of a new slab.
  3061	 *
  3062	 * Version of __slab_alloc to use when we know that preemption is
  3063	 * already disabled (which is the case for bulk allocation).
  3064	 */
  3065	static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  3066				  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
  3067	{
  3068		void *freelist;
  3069		struct slab *slab;
  3070		unsigned long flags;
  3071		struct partial_context pc;
> 3072		int try_thisndoe = 0;
  3073	
  3074		stat(s, ALLOC_SLOWPATH);
  3075	
  3076	reread_slab:
  3077	
  3078		slab = READ_ONCE(c->slab);
  3079		if (!slab) {
  3080			/*
  3081			 * if the node is not online or has no normal memory, just
  3082			 * ignore the node constraint
  3083			 */
  3084			if (unlikely(node != NUMA_NO_NODE &&
  3085				     !node_isset(node, slab_nodes)))
  3086				node = NUMA_NO_NODE;
  3087			goto new_slab;
  3088		}
  3089	redo:
  3090	
  3091		if (unlikely(!node_match(slab, node))) {
  3092			/*
  3093			 * same as above but node_match() being false already
  3094			 * implies node != NUMA_NO_NODE
  3095			 */
  3096			if (!node_isset(node, slab_nodes)) {
  3097				node = NUMA_NO_NODE;
  3098			} else {
  3099				stat(s, ALLOC_NODE_MISMATCH);
  3100				goto deactivate_slab;
  3101			}
  3102		}
  3103	
  3104		/*
  3105		 * By rights, we should be searching for a slab page that was
  3106		 * PFMEMALLOC but right now, we are losing the pfmemalloc
  3107		 * information when the page leaves the per-cpu allocator
  3108		 */
  3109		if (unlikely(!pfmemalloc_match(slab, gfpflags)))
  3110			goto deactivate_slab;
  3111	
  3112		/* must check again c->slab in case we got preempted and it changed */
  3113		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3114		if (unlikely(slab != c->slab)) {
  3115			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3116			goto reread_slab;
  3117		}
  3118		freelist = c->freelist;
  3119		if (freelist)
  3120			goto load_freelist;
  3121	
  3122		freelist = get_freelist(s, slab);
  3123	
  3124		if (!freelist) {
  3125			c->slab = NULL;
  3126			c->tid = next_tid(c->tid);
  3127			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3128			stat(s, DEACTIVATE_BYPASS);
  3129			goto new_slab;
  3130		}
  3131	
  3132		stat(s, ALLOC_REFILL);
  3133	
  3134	load_freelist:
  3135	
  3136		lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
  3137	
  3138		/*
  3139		 * freelist is pointing to the list of objects to be used.
  3140		 * slab is pointing to the slab from which the objects are obtained.
  3141		 * That slab must be frozen for per cpu allocations to work.
  3142		 */
  3143		VM_BUG_ON(!c->slab->frozen);
  3144		c->freelist = get_freepointer(s, freelist);
  3145		c->tid = next_tid(c->tid);
  3146		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3147		return freelist;
  3148	
  3149	deactivate_slab:
  3150	
  3151		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3152		if (slab != c->slab) {
  3153			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3154			goto reread_slab;
  3155		}
  3156		freelist = c->freelist;
  3157		c->slab = NULL;
  3158		c->freelist = NULL;
  3159		c->tid = next_tid(c->tid);
  3160		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3161		deactivate_slab(s, slab, freelist);
  3162	
  3163	new_slab:
  3164	
  3165		if (slub_percpu_partial(c)) {
  3166			local_lock_irqsave(&s->cpu_slab->lock, flags);
  3167			if (unlikely(c->slab)) {
  3168				local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3169				goto reread_slab;
  3170			}
  3171			if (unlikely(!slub_percpu_partial(c))) {
  3172				local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3173				/* we were preempted and partial list got empty */
  3174				goto new_objects;
  3175			}
  3176	
  3177			slab = c->slab = slub_percpu_partial(c);
  3178			slub_set_percpu_partial(c, slab);
  3179			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3180			stat(s, CPU_PARTIAL_ALLOC);
  3181			goto redo;
  3182		}
  3183	
  3184	new_objects:
  3185		pc.flags = gfpflags;
  3186	
  3187		/* Try to get page from specific node even if __GFP_THISNODE is not set */
  3188		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
  3189				pc.flags |= __GFP_THISNODE;
  3190	
  3191		pc.slab = &slab;
  3192		pc.orig_size = orig_size;
  3193		freelist = get_partial(s, node, &pc);
  3194		if (freelist)
  3195			goto check_new_slab;
  3196	
  3197		slub_put_cpu_ptr(s->cpu_slab);
  3198		slab = new_slab(s, pc.flags, node);
  3199		c = slub_get_cpu_ptr(s->cpu_slab);
  3200	
  3201		if (unlikely(!slab)) {
  3202			/* Try to get page from any other node */
  3203			if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
  3204				try_thisnode = 0;
  3205				goto new_objects;
  3206			}
  3207	
  3208			slab_out_of_memory(s, gfpflags, node);
  3209			return NULL;
  3210		}
  3211	
  3212		stat(s, ALLOC_SLAB);
  3213	
  3214		if (kmem_cache_debug(s)) {
  3215			freelist = alloc_single_from_new_slab(s, slab, orig_size);
  3216	
  3217			if (unlikely(!freelist))
  3218				goto new_objects;
  3219	
  3220			if (s->flags & SLAB_STORE_USER)
  3221				set_track(s, freelist, TRACK_ALLOC, addr);
  3222	
  3223			return freelist;
  3224		}
  3225	
  3226		/*
  3227		 * No other reference to the slab yet so we can
  3228		 * muck around with it freely without cmpxchg
  3229		 */
  3230		freelist = slab->freelist;
  3231		slab->freelist = NULL;
  3232		slab->inuse = slab->objects;
  3233		slab->frozen = 1;
  3234	
  3235		inc_slabs_node(s, slab_nid(slab), slab->objects);
  3236	
  3237	check_new_slab:
  3238	
  3239		if (kmem_cache_debug(s)) {
  3240			/*
  3241			 * For debug caches here we had to go through
  3242			 * alloc_single_from_partial() so just store the tracking info
  3243			 * and return the object
  3244			 */
  3245			if (s->flags & SLAB_STORE_USER)
  3246				set_track(s, freelist, TRACK_ALLOC, addr);
  3247	
  3248			return freelist;
  3249		}
  3250	
  3251		if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
  3252			/*
  3253			 * For !pfmemalloc_match() case we don't load freelist so that
  3254			 * we don't make further mismatched allocations easier.
  3255			 */
  3256			deactivate_slab(s, slab, get_freepointer(s, freelist));
  3257			return freelist;
  3258		}
  3259	
  3260	retry_load_slab:
  3261	
  3262		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3263		if (unlikely(c->slab)) {
  3264			void *flush_freelist = c->freelist;
  3265			struct slab *flush_slab = c->slab;
  3266	
  3267			c->slab = NULL;
  3268			c->freelist = NULL;
  3269			c->tid = next_tid(c->tid);
  3270	
  3271			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3272	
  3273			deactivate_slab(s, flush_slab, flush_freelist);
  3274	
  3275			stat(s, CPUSLAB_FLUSH);
  3276	
  3277			goto retry_load_slab;
  3278		}
  3279		c->slab = slab;
  3280	
  3281		goto load_freelist;
  3282	}
  3283	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
  2023-03-07  8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
  2023-03-07 10:30 ` kernel test robot
@ 2023-03-07 10:50 ` kernel test robot
  2023-03-07 14:20 ` Hyeonggon Yoo
  2023-03-08 10:43 ` kernel test robot
  3 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2023-03-07 10:50 UTC (permalink / raw)
  To: Chen Jun; +Cc: oe-kbuild-all

Hi Chen,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230307082811.120774-1-chenjun102%40huawei.com
patch subject: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20230307/202303071843.3sjPyGIX-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/d00decb3ccc45b514f25df0d3775153a933ffaa9
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
        git checkout d00decb3ccc45b514f25df0d3775153a933ffaa9
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=alpha SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303071843.3sjPyGIX-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/slub.c: In function '___slab_alloc':
>> mm/slub.c:3188:69: error: 'try_thisnode' undeclared (first use in this function); did you mean 'try_thisndoe'?
    3188 |         if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
         |                                                                     ^~~~~~~~~~~~
         |                                                                     try_thisndoe
   mm/slub.c:3188:69: note: each undeclared identifier is reported only once for each function it appears in
   mm/slub.c:3072:13: warning: unused variable 'try_thisndoe' [-Wunused-variable]
    3072 |         int try_thisndoe = 0;
         |             ^~~~~~~~~~~~


vim +3188 mm/slub.c

  3045	
  3046	/*
  3047	 * Slow path. The lockless freelist is empty or we need to perform
  3048	 * debugging duties.
  3049	 *
  3050	 * Processing is still very fast if new objects have been freed to the
  3051	 * regular freelist. In that case we simply take over the regular freelist
  3052	 * as the lockless freelist and zap the regular freelist.
  3053	 *
  3054	 * If that is not working then we fall back to the partial lists. We take the
  3055	 * first element of the freelist as the object to allocate now and move the
  3056	 * rest of the freelist to the lockless freelist.
  3057	 *
  3058	 * And if we were unable to get a new slab from the partial slab lists then
  3059	 * we need to allocate a new slab. This is the slowest path since it involves
  3060	 * a call to the page allocator and the setup of a new slab.
  3061	 *
  3062	 * Version of __slab_alloc to use when we know that preemption is
  3063	 * already disabled (which is the case for bulk allocation).
  3064	 */
  3065	static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  3066				  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
  3067	{
  3068		void *freelist;
  3069		struct slab *slab;
  3070		unsigned long flags;
  3071		struct partial_context pc;
  3072		int try_thisndoe = 0;
  3073	
  3074		stat(s, ALLOC_SLOWPATH);
  3075	
  3076	reread_slab:
  3077	
  3078		slab = READ_ONCE(c->slab);
  3079		if (!slab) {
  3080			/*
  3081			 * if the node is not online or has no normal memory, just
  3082			 * ignore the node constraint
  3083			 */
  3084			if (unlikely(node != NUMA_NO_NODE &&
  3085				     !node_isset(node, slab_nodes)))
  3086				node = NUMA_NO_NODE;
  3087			goto new_slab;
  3088		}
  3089	redo:
  3090	
  3091		if (unlikely(!node_match(slab, node))) {
  3092			/*
  3093			 * same as above but node_match() being false already
  3094			 * implies node != NUMA_NO_NODE
  3095			 */
  3096			if (!node_isset(node, slab_nodes)) {
  3097				node = NUMA_NO_NODE;
  3098			} else {
  3099				stat(s, ALLOC_NODE_MISMATCH);
  3100				goto deactivate_slab;
  3101			}
  3102		}
  3103	
  3104		/*
  3105		 * By rights, we should be searching for a slab page that was
  3106		 * PFMEMALLOC but right now, we are losing the pfmemalloc
  3107		 * information when the page leaves the per-cpu allocator
  3108		 */
  3109		if (unlikely(!pfmemalloc_match(slab, gfpflags)))
  3110			goto deactivate_slab;
  3111	
  3112		/* must check again c->slab in case we got preempted and it changed */
  3113		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3114		if (unlikely(slab != c->slab)) {
  3115			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3116			goto reread_slab;
  3117		}
  3118		freelist = c->freelist;
  3119		if (freelist)
  3120			goto load_freelist;
  3121	
  3122		freelist = get_freelist(s, slab);
  3123	
  3124		if (!freelist) {
  3125			c->slab = NULL;
  3126			c->tid = next_tid(c->tid);
  3127			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3128			stat(s, DEACTIVATE_BYPASS);
  3129			goto new_slab;
  3130		}
  3131	
  3132		stat(s, ALLOC_REFILL);
  3133	
  3134	load_freelist:
  3135	
  3136		lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
  3137	
  3138		/*
  3139		 * freelist is pointing to the list of objects to be used.
  3140		 * slab is pointing to the slab from which the objects are obtained.
  3141		 * That slab must be frozen for per cpu allocations to work.
  3142		 */
  3143		VM_BUG_ON(!c->slab->frozen);
  3144		c->freelist = get_freepointer(s, freelist);
  3145		c->tid = next_tid(c->tid);
  3146		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3147		return freelist;
  3148	
  3149	deactivate_slab:
  3150	
  3151		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3152		if (slab != c->slab) {
  3153			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3154			goto reread_slab;
  3155		}
  3156		freelist = c->freelist;
  3157		c->slab = NULL;
  3158		c->freelist = NULL;
  3159		c->tid = next_tid(c->tid);
  3160		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3161		deactivate_slab(s, slab, freelist);
  3162	
  3163	new_slab:
  3164	
  3165		if (slub_percpu_partial(c)) {
  3166			local_lock_irqsave(&s->cpu_slab->lock, flags);
  3167			if (unlikely(c->slab)) {
  3168				local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3169				goto reread_slab;
  3170			}
  3171			if (unlikely(!slub_percpu_partial(c))) {
  3172				local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3173				/* we were preempted and partial list got empty */
  3174				goto new_objects;
  3175			}
  3176	
  3177			slab = c->slab = slub_percpu_partial(c);
  3178			slub_set_percpu_partial(c, slab);
  3179			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3180			stat(s, CPU_PARTIAL_ALLOC);
  3181			goto redo;
  3182		}
  3183	
  3184	new_objects:
  3185		pc.flags = gfpflags;
  3186	
  3187		/* Try to get page from specific node even if __GFP_THISNODE is not set */
> 3188		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
  3189				pc.flags |= __GFP_THISNODE;
  3190	
  3191		pc.slab = &slab;
  3192		pc.orig_size = orig_size;
  3193		freelist = get_partial(s, node, &pc);
  3194		if (freelist)
  3195			goto check_new_slab;
  3196	
  3197		slub_put_cpu_ptr(s->cpu_slab);
  3198		slab = new_slab(s, pc.flags, node);
  3199		c = slub_get_cpu_ptr(s->cpu_slab);
  3200	
  3201		if (unlikely(!slab)) {
  3202			/* Try to get page from any other node */
  3203			if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
  3204				try_thisnode = 0;
  3205				goto new_objects;
  3206			}
  3207	
  3208			slab_out_of_memory(s, gfpflags, node);
  3209			return NULL;
  3210		}
  3211	
  3212		stat(s, ALLOC_SLAB);
  3213	
  3214		if (kmem_cache_debug(s)) {
  3215			freelist = alloc_single_from_new_slab(s, slab, orig_size);
  3216	
  3217			if (unlikely(!freelist))
  3218				goto new_objects;
  3219	
  3220			if (s->flags & SLAB_STORE_USER)
  3221				set_track(s, freelist, TRACK_ALLOC, addr);
  3222	
  3223			return freelist;
  3224		}
  3225	
  3226		/*
  3227		 * No other reference to the slab yet so we can
  3228		 * muck around with it freely without cmpxchg
  3229		 */
  3230		freelist = slab->freelist;
  3231		slab->freelist = NULL;
  3232		slab->inuse = slab->objects;
  3233		slab->frozen = 1;
  3234	
  3235		inc_slabs_node(s, slab_nid(slab), slab->objects);
  3236	
  3237	check_new_slab:
  3238	
  3239		if (kmem_cache_debug(s)) {
  3240			/*
  3241			 * For debug caches here we had to go through
  3242			 * alloc_single_from_partial() so just store the tracking info
  3243			 * and return the object
  3244			 */
  3245			if (s->flags & SLAB_STORE_USER)
  3246				set_track(s, freelist, TRACK_ALLOC, addr);
  3247	
  3248			return freelist;
  3249		}
  3250	
  3251		if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
  3252			/*
  3253			 * For !pfmemalloc_match() case we don't load freelist so that
  3254			 * we don't make further mismatched allocations easier.
  3255			 */
  3256			deactivate_slab(s, slab, get_freepointer(s, freelist));
  3257			return freelist;
  3258		}
  3259	
  3260	retry_load_slab:
  3261	
  3262		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3263		if (unlikely(c->slab)) {
  3264			void *flush_freelist = c->freelist;
  3265			struct slab *flush_slab = c->slab;
  3266	
  3267			c->slab = NULL;
  3268			c->freelist = NULL;
  3269			c->tid = next_tid(c->tid);
  3270	
  3271			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3272	
  3273			deactivate_slab(s, flush_slab, flush_freelist);
  3274	
  3275			stat(s, CPUSLAB_FLUSH);
  3276	
  3277			goto retry_load_slab;
  3278		}
  3279		c->slab = slab;
  3280	
  3281		goto load_freelist;
  3282	}
  3283	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
  2023-03-07  8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
  2023-03-07 10:30 ` kernel test robot
  2023-03-07 10:50 ` kernel test robot
@ 2023-03-07 14:20 ` Hyeonggon Yoo
  2023-03-08  7:16   ` chenjun (AM)
  2023-03-08 10:43 ` kernel test robot
  3 siblings, 1 reply; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-03-07 14:20 UTC (permalink / raw)
  To: Chen Jun
  Cc: linux-kernel, linux-mm, cl, penberg, rientjes, iamjoonsoo.kim,
	akpm, vbabka, xuqiang36

On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:
> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
> Slub will alloc a slub page which is not belong to A, and put the page
> to kmem_cache_node[page_to_nid(page)]. The page can not be reused
> at next calling, because NULL will be get from get_partical().
> That make kmalloc_node consume more memory.

Hello,

elaborating a little bit:

"When kmalloc_node() is called without __GFP_THISNODE and the target node
lacks sufficient memory, SLUB allocates a folio from a different node other
than the requested node, instead of taking a partial slab from it.

However, since the allocated folio does not belong to the requested node,
it is deactivated and added to the partial slab list of the node it
belongs to.

This behavior can result in excessive memory usage when the requested
node has insufficient memory, as SLUB will repeatedly allocate folios from
other nodes without reusing the previously allocated ones.

To prevent memory wastage, take a partial slab from a different node when
the requested node has no partial slab and __GFP_THISNODE is not explicitly
specified."

> On qemu with 4 numas and each numa has 1G memory, Write a test ko
> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
> 
> cat /proc/slabinfo shows:
> kmalloc-256       4302317 15151808    256   32    2 : tunables..
> 
> the total objects is much more then active objects.
> 
> After this patch, cat /prac/slubinfo shows:
> kmalloc-256       5244950 5245088    256   32    2 : tunables..
>
> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> ---
>  mm/slub.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 39327e98fce3..c0090a5de54e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
>  		searchnode = numa_mem_id();
>  
>  	object = get_partial_node(s, get_node(s, searchnode), pc);
> -	if (object || node != NUMA_NO_NODE)
> +	if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
>  		return object;

I think the problem here is to avoid taking a partial slab from
different node than the requested node even if __GFP_THISNODE is not set.
(and then allocating new slab instead)

Thus this hunk makes sense to me,
even if SLUB currently do not implement __GFP_THISNODE semantics.

>  	return get_any_partial(s, pc);
> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	struct slab *slab;
>  	unsigned long flags;
>  	struct partial_context pc;
> +	int try_thisndoe = 0;
>
>  
>  	stat(s, ALLOC_SLOWPATH);
>  
> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	}
>  
>  new_objects:
> -
>  	pc.flags = gfpflags;
> +
> +	/* Try to get page from specific node even if __GFP_THISNODE is not set */
> +	if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
> +			pc.flags |= __GFP_THISNODE;
> +
>  	pc.slab = &slab;
>  	pc.orig_size = orig_size;
>  	freelist = get_partial(s, node, &pc);
> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		goto check_new_slab;
>  
>  	slub_put_cpu_ptr(s->cpu_slab);
> -	slab = new_slab(s, gfpflags, node);
> +	slab = new_slab(s, pc.flags, node);
>  	c = slub_get_cpu_ptr(s->cpu_slab);
>  
>  	if (unlikely(!slab)) {
> +		/* Try to get page from any other node */
> +		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
> +			try_thisnode = 0;
> +			goto new_objects;
> +		}
> +
>  		slab_out_of_memory(s, gfpflags, node);
>  		return NULL;

But these hunks do not make sense to me.
Why force __GFP_THISNODE even when the caller did not specify it?

(Apart from the fact that try_thisnode is defined as try_thisndoe,
 and try_thisnode is never set to nonzero value.)

IMHO the first hunk is enough to solve the problem.

Thanks,
Hyeonggon

>  	}
> -- 
> 2.17.1
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
  2023-03-07 14:20 ` Hyeonggon Yoo
@ 2023-03-08  7:16   ` chenjun (AM)
  2023-03-08 13:37     ` Hyeonggon Yoo
  0 siblings, 1 reply; 8+ messages in thread
From: chenjun (AM) @ 2023-03-08  7:16 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl@linux.com,
	penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com,
	akpm@linux-foundation.org, vbabka@suse.cz, xuqiang (M)

Hi,

Thanks for reply.

在 2023/3/7 22:20, Hyeonggon Yoo 写道:
> On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:
>> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
>> Slub will alloc a slub page which is not belong to A, and put the page
>> to kmem_cache_node[page_to_nid(page)]. The page can not be reused
>> at next calling, because NULL will be get from get_partical().
>> That make kmalloc_node consume more memory.
> 
> Hello,
> 
> elaborating a little bit:
> 
> "When kmalloc_node() is called without __GFP_THISNODE and the target node
> lacks sufficient memory, SLUB allocates a folio from a different node other
> than the requested node, instead of taking a partial slab from it.
> 
> However, since the allocated folio does not belong to the requested node,
> it is deactivated and added to the partial slab list of the node it
> belongs to.
> 
> This behavior can result in excessive memory usage when the requested
> node has insufficient memory, as SLUB will repeatedly allocate folios from
> other nodes without reusing the previously allocated ones.
> 
> To prevent memory wastage, take a partial slab from a different node when
> the requested node has no partial slab and __GFP_THISNODE is not explicitly
> specified."
> 

Thanks, This is more clear than what I described.

>> On qemu with 4 numas and each numa has 1G memory, Write a test ko
>> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
>>
>> cat /proc/slabinfo shows:
>> kmalloc-256       4302317 15151808    256   32    2 : tunables..
>>
>> the total objects is much more then active objects.
>>
>> After this patch, cat /prac/slubinfo shows:
>> kmalloc-256       5244950 5245088    256   32    2 : tunables..
>>
>> Signed-off-by: Chen Jun <chenjun102@huawei.com>
>> ---
>>   mm/slub.c | 17 ++++++++++++++---
>>   1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 39327e98fce3..c0090a5de54e 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
>>   		searchnode = numa_mem_id();
>>   
>>   	object = get_partial_node(s, get_node(s, searchnode), pc);
>> -	if (object || node != NUMA_NO_NODE)
>> +	if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
>>   		return object;
> 
> I think the problem here is to avoid taking a partial slab from
> different node than the requested node even if __GFP_THISNODE is not set.
> (and then allocating new slab instead)
> 
> Thus this hunk makes sense to me,
> even if SLUB currently do not implement __GFP_THISNODE semantics.
> 
>>   	return get_any_partial(s, pc);
>> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   	struct slab *slab;
>>   	unsigned long flags;
>>   	struct partial_context pc;
>> +	int try_thisndoe = 0;
>>
>>   
>>   	stat(s, ALLOC_SLOWPATH);
>>   
>> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   	}
>>   
>>   new_objects:
>> -
>>   	pc.flags = gfpflags;
>> +
>> +	/* Try to get page from specific node even if __GFP_THISNODE is not set */
>> +	if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
>> +			pc.flags |= __GFP_THISNODE;
>> +
>>   	pc.slab = &slab;
>>   	pc.orig_size = orig_size;
>>   	freelist = get_partial(s, node, &pc);
>> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>   		goto check_new_slab;
>>   
>>   	slub_put_cpu_ptr(s->cpu_slab);
>> -	slab = new_slab(s, gfpflags, node);
>> +	slab = new_slab(s, pc.flags, node);
>>   	c = slub_get_cpu_ptr(s->cpu_slab);
>>   
>>   	if (unlikely(!slab)) {
>> +		/* Try to get page from any other node */
>> +		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
>> +			try_thisnode = 0;
>> +			goto new_objects;
>> +		}
>> +
>>   		slab_out_of_memory(s, gfpflags, node);
>>   		return NULL;
> 
> But these hunks do not make sense to me.
> Why force __GFP_THISNODE even when the caller did not specify it?
> 
> (Apart from the fact that try_thisnode is defined as try_thisndoe,
>   and try_thisnode is never set to nonzero value.)

My mistake, It should be:
int try_thisnode = 0;

> 
> IMHO the first hunk is enough to solve the problem.

I think, we should try to alloc a page on the target node before getting 
one from other nodes' partial.

If the caller does not specify __GFP_THISNODE, we add __GFP_THISNODE to 
try to get the slab only on the target node. If it fails, use the 
original GFP FLAG to allow fallback.

> 
> Thanks,
> Hyeonggon
> 
>>   	}
>> -- 
>> 2.17.1
>>
>>
> 

Thanks,
Chen Jun


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
  2023-03-07  8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
                   ` (2 preceding siblings ...)
  2023-03-07 14:20 ` Hyeonggon Yoo
@ 2023-03-08 10:43 ` kernel test robot
  3 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2023-03-08 10:43 UTC (permalink / raw)
  To: Chen Jun; +Cc: llvm, oe-kbuild-all

Hi Chen,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230307082811.120774-1-chenjun102%40huawei.com
patch subject: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
config: i386-randconfig-a013-20230306 (https://download.01.org/0day-ci/archive/20230308/202303081804.A6mbhXOo-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/d00decb3ccc45b514f25df0d3775153a933ffaa9
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chen-Jun/mm-slub-Reduce-memory-consumption-in-extreme-scenarios/20230307-163216
        git checkout d00decb3ccc45b514f25df0d3775153a933ffaa9
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303081804.A6mbhXOo-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/slub.c:3188:62: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
           if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
                                                                       ^~~~~~~~~~~~
                                                                       try_thisndoe
   include/linux/compiler.h:56:47: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                                                 ^
   include/linux/compiler.h:58:52: note: expanded from macro '__trace_if_var'
   #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                      ^
   mm/slub.c:3072:6: note: 'try_thisndoe' declared here
           int try_thisndoe = 0;
               ^
>> mm/slub.c:3188:62: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
           if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
                                                                       ^~~~~~~~~~~~
                                                                       try_thisndoe
   include/linux/compiler.h:56:47: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                                                 ^
   include/linux/compiler.h:58:61: note: expanded from macro '__trace_if_var'
   #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                               ^
   mm/slub.c:3072:6: note: 'try_thisndoe' declared here
           int try_thisndoe = 0;
               ^
>> mm/slub.c:3188:62: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
           if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
                                                                       ^~~~~~~~~~~~
                                                                       try_thisndoe
   include/linux/compiler.h:56:47: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                                                 ^
   include/linux/compiler.h:58:86: note: expanded from macro '__trace_if_var'
   #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                                                        ^
   include/linux/compiler.h:69:3: note: expanded from macro '__trace_if_value'
           (cond) ?                                        \
            ^
   mm/slub.c:3072:6: note: 'try_thisndoe' declared here
           int try_thisndoe = 0;
               ^
   mm/slub.c:3203:63: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
                   if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
                                                                               ^~~~~~~~~~~~
                                                                               try_thisndoe
   include/linux/compiler.h:56:47: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                                                 ^
   include/linux/compiler.h:58:52: note: expanded from macro '__trace_if_var'
   #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                      ^
   mm/slub.c:3072:6: note: 'try_thisndoe' declared here
           int try_thisndoe = 0;
               ^
   mm/slub.c:3203:63: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
                   if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
                                                                               ^~~~~~~~~~~~
                                                                               try_thisndoe
   include/linux/compiler.h:56:47: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                                                 ^
   include/linux/compiler.h:58:61: note: expanded from macro '__trace_if_var'
   #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                               ^
   mm/slub.c:3072:6: note: 'try_thisndoe' declared here
           int try_thisndoe = 0;
               ^
   mm/slub.c:3203:63: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
                   if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
                                                                               ^~~~~~~~~~~~
                                                                               try_thisndoe
   include/linux/compiler.h:56:47: note: expanded from macro 'if'
   #define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
                                                 ^
   include/linux/compiler.h:58:86: note: expanded from macro '__trace_if_var'
   #define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
                                                                                        ^
   include/linux/compiler.h:69:3: note: expanded from macro '__trace_if_value'
           (cond) ?                                        \
            ^
   mm/slub.c:3072:6: note: 'try_thisndoe' declared here
           int try_thisndoe = 0;
               ^
   mm/slub.c:3204:4: error: use of undeclared identifier 'try_thisnode'; did you mean 'try_thisndoe'?
                           try_thisnode = 0;
                           ^~~~~~~~~~~~
                           try_thisndoe
   mm/slub.c:3072:6: note: 'try_thisndoe' declared here
           int try_thisndoe = 0;
               ^
   7 errors generated.


vim +3188 mm/slub.c

  3045	
  3046	/*
  3047	 * Slow path. The lockless freelist is empty or we need to perform
  3048	 * debugging duties.
  3049	 *
  3050	 * Processing is still very fast if new objects have been freed to the
  3051	 * regular freelist. In that case we simply take over the regular freelist
  3052	 * as the lockless freelist and zap the regular freelist.
  3053	 *
  3054	 * If that is not working then we fall back to the partial lists. We take the
  3055	 * first element of the freelist as the object to allocate now and move the
  3056	 * rest of the freelist to the lockless freelist.
  3057	 *
  3058	 * And if we were unable to get a new slab from the partial slab lists then
  3059	 * we need to allocate a new slab. This is the slowest path since it involves
  3060	 * a call to the page allocator and the setup of a new slab.
  3061	 *
  3062	 * Version of __slab_alloc to use when we know that preemption is
  3063	 * already disabled (which is the case for bulk allocation).
  3064	 */
  3065	static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
  3066				  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
  3067	{
  3068		void *freelist;
  3069		struct slab *slab;
  3070		unsigned long flags;
  3071		struct partial_context pc;
  3072		int try_thisndoe = 0;
  3073	
  3074		stat(s, ALLOC_SLOWPATH);
  3075	
  3076	reread_slab:
  3077	
  3078		slab = READ_ONCE(c->slab);
  3079		if (!slab) {
  3080			/*
  3081			 * if the node is not online or has no normal memory, just
  3082			 * ignore the node constraint
  3083			 */
  3084			if (unlikely(node != NUMA_NO_NODE &&
  3085				     !node_isset(node, slab_nodes)))
  3086				node = NUMA_NO_NODE;
  3087			goto new_slab;
  3088		}
  3089	redo:
  3090	
  3091		if (unlikely(!node_match(slab, node))) {
  3092			/*
  3093			 * same as above but node_match() being false already
  3094			 * implies node != NUMA_NO_NODE
  3095			 */
  3096			if (!node_isset(node, slab_nodes)) {
  3097				node = NUMA_NO_NODE;
  3098			} else {
  3099				stat(s, ALLOC_NODE_MISMATCH);
  3100				goto deactivate_slab;
  3101			}
  3102		}
  3103	
  3104		/*
  3105		 * By rights, we should be searching for a slab page that was
  3106		 * PFMEMALLOC but right now, we are losing the pfmemalloc
  3107		 * information when the page leaves the per-cpu allocator
  3108		 */
  3109		if (unlikely(!pfmemalloc_match(slab, gfpflags)))
  3110			goto deactivate_slab;
  3111	
  3112		/* must check again c->slab in case we got preempted and it changed */
  3113		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3114		if (unlikely(slab != c->slab)) {
  3115			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3116			goto reread_slab;
  3117		}
  3118		freelist = c->freelist;
  3119		if (freelist)
  3120			goto load_freelist;
  3121	
  3122		freelist = get_freelist(s, slab);
  3123	
  3124		if (!freelist) {
  3125			c->slab = NULL;
  3126			c->tid = next_tid(c->tid);
  3127			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3128			stat(s, DEACTIVATE_BYPASS);
  3129			goto new_slab;
  3130		}
  3131	
  3132		stat(s, ALLOC_REFILL);
  3133	
  3134	load_freelist:
  3135	
  3136		lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
  3137	
  3138		/*
  3139		 * freelist is pointing to the list of objects to be used.
  3140		 * slab is pointing to the slab from which the objects are obtained.
  3141		 * That slab must be frozen for per cpu allocations to work.
  3142		 */
  3143		VM_BUG_ON(!c->slab->frozen);
  3144		c->freelist = get_freepointer(s, freelist);
  3145		c->tid = next_tid(c->tid);
  3146		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3147		return freelist;
  3148	
  3149	deactivate_slab:
  3150	
  3151		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3152		if (slab != c->slab) {
  3153			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3154			goto reread_slab;
  3155		}
  3156		freelist = c->freelist;
  3157		c->slab = NULL;
  3158		c->freelist = NULL;
  3159		c->tid = next_tid(c->tid);
  3160		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3161		deactivate_slab(s, slab, freelist);
  3162	
  3163	new_slab:
  3164	
  3165		if (slub_percpu_partial(c)) {
  3166			local_lock_irqsave(&s->cpu_slab->lock, flags);
  3167			if (unlikely(c->slab)) {
  3168				local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3169				goto reread_slab;
  3170			}
  3171			if (unlikely(!slub_percpu_partial(c))) {
  3172				local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3173				/* we were preempted and partial list got empty */
  3174				goto new_objects;
  3175			}
  3176	
  3177			slab = c->slab = slub_percpu_partial(c);
  3178			slub_set_percpu_partial(c, slab);
  3179			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3180			stat(s, CPU_PARTIAL_ALLOC);
  3181			goto redo;
  3182		}
  3183	
  3184	new_objects:
  3185		pc.flags = gfpflags;
  3186	
  3187		/* Try to get page from specific node even if __GFP_THISNODE is not set */
> 3188		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
  3189				pc.flags |= __GFP_THISNODE;
  3190	
  3191		pc.slab = &slab;
  3192		pc.orig_size = orig_size;
  3193		freelist = get_partial(s, node, &pc);
  3194		if (freelist)
  3195			goto check_new_slab;
  3196	
  3197		slub_put_cpu_ptr(s->cpu_slab);
  3198		slab = new_slab(s, pc.flags, node);
  3199		c = slub_get_cpu_ptr(s->cpu_slab);
  3200	
  3201		if (unlikely(!slab)) {
  3202			/* Try to get page from any other node */
  3203			if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
  3204				try_thisnode = 0;
  3205				goto new_objects;
  3206			}
  3207	
  3208			slab_out_of_memory(s, gfpflags, node);
  3209			return NULL;
  3210		}
  3211	
  3212		stat(s, ALLOC_SLAB);
  3213	
  3214		if (kmem_cache_debug(s)) {
  3215			freelist = alloc_single_from_new_slab(s, slab, orig_size);
  3216	
  3217			if (unlikely(!freelist))
  3218				goto new_objects;
  3219	
  3220			if (s->flags & SLAB_STORE_USER)
  3221				set_track(s, freelist, TRACK_ALLOC, addr);
  3222	
  3223			return freelist;
  3224		}
  3225	
  3226		/*
  3227		 * No other reference to the slab yet so we can
  3228		 * muck around with it freely without cmpxchg
  3229		 */
  3230		freelist = slab->freelist;
  3231		slab->freelist = NULL;
  3232		slab->inuse = slab->objects;
  3233		slab->frozen = 1;
  3234	
  3235		inc_slabs_node(s, slab_nid(slab), slab->objects);
  3236	
  3237	check_new_slab:
  3238	
  3239		if (kmem_cache_debug(s)) {
  3240			/*
  3241			 * For debug caches here we had to go through
  3242			 * alloc_single_from_partial() so just store the tracking info
  3243			 * and return the object
  3244			 */
  3245			if (s->flags & SLAB_STORE_USER)
  3246				set_track(s, freelist, TRACK_ALLOC, addr);
  3247	
  3248			return freelist;
  3249		}
  3250	
  3251		if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
  3252			/*
  3253			 * For !pfmemalloc_match() case we don't load freelist so that
  3254			 * we don't make further mismatched allocations easier.
  3255			 */
  3256			deactivate_slab(s, slab, get_freepointer(s, freelist));
  3257			return freelist;
  3258		}
  3259	
  3260	retry_load_slab:
  3261	
  3262		local_lock_irqsave(&s->cpu_slab->lock, flags);
  3263		if (unlikely(c->slab)) {
  3264			void *flush_freelist = c->freelist;
  3265			struct slab *flush_slab = c->slab;
  3266	
  3267			c->slab = NULL;
  3268			c->freelist = NULL;
  3269			c->tid = next_tid(c->tid);
  3270	
  3271			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  3272	
  3273			deactivate_slab(s, flush_slab, flush_freelist);
  3274	
  3275			stat(s, CPUSLAB_FLUSH);
  3276	
  3277			goto retry_load_slab;
  3278		}
  3279		c->slab = slab;
  3280	
  3281		goto load_freelist;
  3282	}
  3283	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
  2023-03-08  7:16   ` chenjun (AM)
@ 2023-03-08 13:37     ` Hyeonggon Yoo
  2023-03-09  2:15       ` chenjun (AM)
  0 siblings, 1 reply; 8+ messages in thread
From: Hyeonggon Yoo @ 2023-03-08 13:37 UTC (permalink / raw)
  To: chenjun (AM)
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl@linux.com,
	penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com,
	akpm@linux-foundation.org, vbabka@suse.cz, xuqiang (M)

On Wed, Mar 08, 2023 at 07:16:49AM +0000, chenjun (AM) wrote:
> Hi,
> 
> Thanks for reply.
> 
> 在 2023/3/7 22:20, Hyeonggon Yoo 写道:
> > On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:
> >> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
> >> Slub will alloc a slub page which is not belong to A, and put the page
> >> to kmem_cache_node[page_to_nid(page)]. The page can not be reused
> >> at next calling, because NULL will be get from get_partical().
> >> That make kmalloc_node consume more memory.
> > 
> > Hello,
> > 
> > elaborating a little bit:
> > 
> > "When kmalloc_node() is called without __GFP_THISNODE and the target node
> > lacks sufficient memory, SLUB allocates a folio from a different node other
> > than the requested node, instead of taking a partial slab from it.
> > 
> > However, since the allocated folio does not belong to the requested node,
> > it is deactivated and added to the partial slab list of the node it
> > belongs to.
> > 
> > This behavior can result in excessive memory usage when the requested
> > node has insufficient memory, as SLUB will repeatedly allocate folios from
> > other nodes without reusing the previously allocated ones.
> > 
> > To prevent memory wastage, take a partial slab from a different node when
> > the requested node has no partial slab and __GFP_THISNODE is not explicitly
> > specified."
> > 
> 
> Thanks, This is more clear than what I described.
> 
> >> On qemu with 4 numas and each numa has 1G memory, Write a test ko
> >> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
> >>
> >> cat /proc/slabinfo shows:
> >> kmalloc-256       4302317 15151808    256   32    2 : tunables..
> >>
> >> the total objects is much more then active objects.
> >>
> >> After this patch, cat /prac/slubinfo shows:
> >> kmalloc-256       5244950 5245088    256   32    2 : tunables..
> >>
> >> Signed-off-by: Chen Jun <chenjun102@huawei.com>
> >> ---
> >>   mm/slub.c | 17 ++++++++++++++---
> >>   1 file changed, 14 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 39327e98fce3..c0090a5de54e 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
> >>   		searchnode = numa_mem_id();
> >>   
> >>   	object = get_partial_node(s, get_node(s, searchnode), pc);
> >> -	if (object || node != NUMA_NO_NODE)
> >> +	if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
> >>   		return object;
> > 
> > I think the problem here is to avoid taking a partial slab from
> > different node than the requested node even if __GFP_THISNODE is not set.
> > (and then allocating new slab instead)
> > 
> > Thus this hunk makes sense to me,
> > even if SLUB currently do not implement __GFP_THISNODE semantics.
> > 
> >>   	return get_any_partial(s, pc);
> >> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >>   	struct slab *slab;
> >>   	unsigned long flags;
> >>   	struct partial_context pc;
> >> +	int try_thisndoe = 0;
> >>
> >>   
> >>   	stat(s, ALLOC_SLOWPATH);
> >>   
> >> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >>   	}
> >>   
> >>   new_objects:
> >> -
> >>   	pc.flags = gfpflags;
> >> +
> >> +	/* Try to get page from specific node even if __GFP_THISNODE is not set */
> >> +	if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
> >> +			pc.flags |= __GFP_THISNODE;
> >> +
> >>   	pc.slab = &slab;
> >>   	pc.orig_size = orig_size;
> >>   	freelist = get_partial(s, node, &pc);
> >> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >>   		goto check_new_slab;
> >>   
> >>   	slub_put_cpu_ptr(s->cpu_slab);
> >> -	slab = new_slab(s, gfpflags, node);
> >> +	slab = new_slab(s, pc.flags, node);
> >>   	c = slub_get_cpu_ptr(s->cpu_slab);
> >>   
> >>   	if (unlikely(!slab)) {
> >> +		/* Try to get page from any other node */
> >> +		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
> >> +			try_thisnode = 0;
> >> +			goto new_objects;
> >> +		}
> >> +
> >>   		slab_out_of_memory(s, gfpflags, node);
> >>   		return NULL;
> > 
> > But these hunks do not make sense to me.
> > Why force __GFP_THISNODE even when the caller did not specify it?
> > 
> > (Apart from the fact that try_thisnode is defined as try_thisndoe,
> >   and try_thisnode is never set to nonzero value.)
> 
> My mistake, It should be:
> int try_thisnode = 0;

I think it should be try_thisnode = 1?
Otherwise it won't be executed at all.
Also bool type will be more readable than int.

> 
> > 
> > IMHO the first hunk is enough to solve the problem.
> 
> I think, we should try to alloc a page on the target node before getting 
> one from other nodes' partial.

You are right.

Hmm then the new behavior when 
(node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is:

1) try to get a partial slab from target node with __GFP_THISNODE
2) if 1) failed, try to allocate a new slab from target node with __GFP_THISNODE
3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint

when node != NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior
remains unchanged.

It does not look that crazy to me, although it complicates the code
a little bit. (Vlastimil may have some opinions?)

Now that I understand your intention, I think this behavior change also
need to be added to the commit log.

Thanks,
Hyeonggon

> If the caller does not specify __GFP_THISNODE, we add __GFP_THISNODE to 
> try to get the slab only on the target node. If it fails, use the 
> original GFP FLAG to allow fallback.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
  2023-03-08 13:37     ` Hyeonggon Yoo
@ 2023-03-09  2:15       ` chenjun (AM)
  0 siblings, 0 replies; 8+ messages in thread
From: chenjun (AM) @ 2023-03-09  2:15 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, cl@linux.com,
	penberg@kernel.org, rientjes@google.com, iamjoonsoo.kim@lge.com,
	akpm@linux-foundation.org, vbabka@suse.cz, xuqiang (M)

在 2023/3/8 21:37, Hyeonggon Yoo 写道:
> On Wed, Mar 08, 2023 at 07:16:49AM +0000, chenjun (AM) wrote:
>> Hi,
>>
>> Thanks for reply.
>>
>> 在 2023/3/7 22:20, Hyeonggon Yoo 写道:
>>> On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:
>>>> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory.
>>>> Slub will alloc a slub page which is not belong to A, and put the page
>>>> to kmem_cache_node[page_to_nid(page)]. The page can not be reused
>>>> at next calling, because NULL will be get from get_partical().
>>>> That make kmalloc_node consume more memory.
>>>
>>> Hello,
>>>
>>> elaborating a little bit:
>>>
>>> "When kmalloc_node() is called without __GFP_THISNODE and the target node
>>> lacks sufficient memory, SLUB allocates a folio from a different node other
>>> than the requested node, instead of taking a partial slab from it.
>>>
>>> However, since the allocated folio does not belong to the requested node,
>>> it is deactivated and added to the partial slab list of the node it
>>> belongs to.
>>>
>>> This behavior can result in excessive memory usage when the requested
>>> node has insufficient memory, as SLUB will repeatedly allocate folios from
>>> other nodes without reusing the previously allocated ones.
>>>
>>> To prevent memory wastage, take a partial slab from a different node when
>>> the requested node has no partial slab and __GFP_THISNODE is not explicitly
>>> specified."
>>>
>>
>> Thanks, This is more clear than what I described.
>>
>>>> On qemu with 4 numas and each numa has 1G memory, Write a test ko
>>>> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.
>>>>
>>>> cat /proc/slabinfo shows:
>>>> kmalloc-256       4302317 15151808    256   32    2 : tunables..
>>>>
>>>> the total objects is much more then active objects.
>>>>
>>>> After this patch, cat /prac/slubinfo shows:
>>>> kmalloc-256       5244950 5245088    256   32    2 : tunables..
>>>>
>>>> Signed-off-by: Chen Jun <chenjun102@huawei.com>
>>>> ---
>>>>    mm/slub.c | 17 ++++++++++++++---
>>>>    1 file changed, 14 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/slub.c b/mm/slub.c
>>>> index 39327e98fce3..c0090a5de54e 100644
>>>> --- a/mm/slub.c
>>>> +++ b/mm/slub.c
>>>> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context
>>>>    		searchnode = numa_mem_id();
>>>>    
>>>>    	object = get_partial_node(s, get_node(s, searchnode), pc);
>>>> -	if (object || node != NUMA_NO_NODE)
>>>> +	if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE)))
>>>>    		return object;
>>>
>>> I think the problem here is to avoid taking a partial slab from
>>> different node than the requested node even if __GFP_THISNODE is not set.
>>> (and then allocating new slab instead)
>>>
>>> Thus this hunk makes sense to me,
>>> even if SLUB currently do not implement __GFP_THISNODE semantics.
>>>
>>>>    	return get_any_partial(s, pc);
>>>> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>>>    	struct slab *slab;
>>>>    	unsigned long flags;
>>>>    	struct partial_context pc;
>>>> +	int try_thisndoe = 0;
>>>>
>>>>    
>>>>    	stat(s, ALLOC_SLOWPATH);
>>>>    
>>>> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>>>    	}
>>>>    
>>>>    new_objects:
>>>> -
>>>>    	pc.flags = gfpflags;
>>>> +
>>>> +	/* Try to get page from specific node even if __GFP_THISNODE is not set */
>>>> +	if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode)
>>>> +			pc.flags |= __GFP_THISNODE;
>>>> +

Any suggestions to make the change more elegant?

>>>>    	pc.slab = &slab;
>>>>    	pc.orig_size = orig_size;
>>>>    	freelist = get_partial(s, node, &pc);
>>>> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>>>    		goto check_new_slab;
>>>>    
>>>>    	slub_put_cpu_ptr(s->cpu_slab);
>>>> -	slab = new_slab(s, gfpflags, node);
>>>> +	slab = new_slab(s, pc.flags, node);
>>>>    	c = slub_get_cpu_ptr(s->cpu_slab);
>>>>    
>>>>    	if (unlikely(!slab)) {
>>>> +		/* Try to get page from any other node */
>>>> +		if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) {
>>>> +			try_thisnode = 0;
>>>> +			goto new_objects;
>>>> +		}
>>>> +
>>>>    		slab_out_of_memory(s, gfpflags, node);
>>>>    		return NULL;
>>>
>>> But these hunks do not make sense to me.
>>> Why force __GFP_THISNODE even when the caller did not specify it?
>>>
>>> (Apart from the fact that try_thisnode is defined as try_thisndoe,
>>>    and try_thisnode is never set to nonzero value.)
>>
>> My mistake, It should be:
>> int try_thisnode = 0;
> 
> I think it should be try_thisnode = 1?
> Otherwise it won't be executed at all.
> Also bool type will be more readable than int.
> 
>>
>>>
>>> IMHO the first hunk is enough to solve the problem.
>>
>> I think, we should try to alloc a page on the target node before getting
>> one from other nodes' partial.
> 
> You are right.
> 
> Hmm then the new behavior when
> (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is:
> 
> 1) try to get a partial slab from target node with __GFP_THISNODE
> 2) if 1) failed, try to allocate a new slab from target node with __GFP_THISNODE
> 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint
> 
> when node != NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior
> remains unchanged.
> 
> It does not look that crazy to me, although it complicates the code
> a little bit. (Vlastimil may have some opinions?)
> 
> Now that I understand your intention, I think this behavior change also
> need to be added to the commit log.
> 

I will add it.

> Thanks,
> Hyeonggon
> 
>> If the caller does not specify __GFP_THISNODE, we add __GFP_THISNODE to
>> try to get the slab only on the target node. If it fails, use the
>> original GFP FLAG to allow fallback.
> 

If there are no other questions, I will send an official patch.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-03-09  2:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-07  8:28 [RFC] mm/slub: Reduce memory consumption in extreme scenarios Chen Jun
2023-03-07 10:30 ` kernel test robot
2023-03-07 10:50 ` kernel test robot
2023-03-07 14:20 ` Hyeonggon Yoo
2023-03-08  7:16   ` chenjun (AM)
2023-03-08 13:37     ` Hyeonggon Yoo
2023-03-09  2:15       ` chenjun (AM)
2023-03-08 10:43 ` kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.