[PATCH v5 0/9] slub: Delay freezing of CPU partial slabs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs
@ 2023-11-02  3:23 chengming.zhou
  2023-11-02  3:23 ` [PATCH v5 1/9] slub: Reflow ___slab_alloc() chengming.zhou
                   ` (9 more replies)
  0 siblings, 10 replies; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

Changes in v5:
 - Drop "RFC".
 - Retest to update performance numbers (little difference with RFC v1).
 - Add Reviewed-by and Tested-by tags. Many thanks!
 - Change to better function name: __put_partials().
 - Some minor improvements of comments and changelog.
 - RFC v4: https://lore.kernel.org/all/20231031140741.79387-1-chengming.zhou@linux.dev/

Changes in RFC v4:
 - Reorder patches to put the two cleanup patches to the front.
 - Move slab_node_partial flag functions to mm/slub.c.
 - Fix freeze_slab() by using slab_update_freelist().
 - Fix build error when !CONFIG_SLUB_CPU_PARTIAL.
 - Add a patch to rename all *unfreeze_partials* functions.
 - Add a patch to update inconsistent documentations in the source.
 - Some comments and changelog improvements.
 - Add Reviewed-by and Suggested-by tags. Many thanks!
 - RFC v3: https://lore.kernel.org/all/20231024093345.3676493-1-chengming.zhou@linux.dev/

Changes in RFC v3:
 - Directly use __set_bit() and __clear_bit() for the slab_node_partial
   flag operations to avoid exporting non-atomic "workingset" interfaces.
 - Change get_partial() related functions to return a slab instead of
   returning the freelist or single object.
 - Don't freeze any slab under the node list_lock to further reduce
   list_lock holding times, as suggested by Vlastimil Babka.
 - Introduce freeze_slab() to do the delay freezing and return freelist.
 - Reorder patches.
 - RFC v2: https://lore.kernel.org/all/20231021144317.3400916-1-chengming.zhou@linux.dev/

Changes in RFC v2:
 - Reuse PG_workingset bit to keep track of whether slub is on the
   per-node partial list, as suggested by Matthew Wilcox.
 - Fix OOM problem on kernel without CONFIG_SLUB_CPU_PARTIAL, which
   is caused by leak of partial slabs when get_partial_node().
 - Add a patch to simplify acquire_slab().
 - Reorder patches a little.
 - RFC v1: https://lore.kernel.org/all/20231017154439.3036608-1-chengming.zhou@linux.dev/

1. Problem
==========
Now we have to freeze the slab when get from the node partial list, and
unfreeze the slab when put to the node partial list. Because we need to
rely on the node list_lock to synchronize the "frozen" bit changes.

This implementation has some drawbacks:

 - Alloc path: twice cmpxchg_double.
   It has to get some partial slabs from node when the allocator has used
   up the CPU partial slabs. So it freeze the slab (one cmpxchg_double)
   with node list_lock held, put those frozen slabs on its CPU partial
   list. Later ___slab_alloc() will cmpxchg_double try-loop again if that
   slab is picked to use.

 - Alloc path: amplified contention on node list_lock.
   Since we have to synchronize the "frozen" bit changes under the node
   list_lock, the contention of slab (struct page) can be transferred
   to the node list_lock. On machine with many CPUs in one node, the
   contention of list_lock will be amplified by all CPUs' alloc path.

   The current code has to workaround this problem by avoiding using
   cmpxchg_double try-loop, which will just break and return when
   contention of page encountered and the first cmpxchg_double failed.
   But this workaround has its own problem. For more context, see
   9b1ea29bc0d7 ("Revert "mm, slub: consider rest of partial list if
   acquire_slab() fails"").

 - Free path: redundant unfreeze.
   __slab_free() will freeze and cache some slabs on its partial list,
   and flush them to the node partial list when exceed, which has to
   unfreeze those slabs again under the node list_lock. Actually we
   don't need to freeze slab on CPU partial list, in which case we
   can save the unfreeze cmpxchg_double operations in flush path.

2. Solution
===========
We solve these problems by leaving slabs unfrozen when moving out of
the node partial list and on CPU partial list, so "frozen" bit is 0.

These partial slabs won't be manipulate concurrently by alloc path,
the only racer is free path, which may manipulate its list when !inuse.
So we need to introduce another synchronization way to avoid it, we
reuse PG_workingset to keep track of whether the slab is on node partial
list or not, only in that case we can manipulate the slab list.

The slab will be delay frozen when it's picked to actively use by the
CPU, it becomes full at the same time, in which case we still need to
rely on "frozen" bit to avoid manipulating its list. So the slab will
be frozen only when activate use and be unfrozen only when deactivate.

The current updated scheme (which this series implemented) is:
 - node partial slabs: PG_Workingset && !frozen
 - cpu partial slabs: !PG_Workingset && !frozen
 - cpu slabs: !PG_Workingset && frozen
 - full slabs: !PG_Workingset && !frozen

The most important change is that "frozen" bit is not set for the cpu
partial slabs anymore, __slab_free() will grab node list_lock then
check by !PG_Workingset that it's not on a node partial list.

And the "frozen" bit is still kept for the cpu slabs for performance,
since we don't need to grab node list_lock to check whether PG_Workingset
is set or not if the "frozen" bit is set in the __slab_free().

3. Testing
==========
We did some simple testing on a server with 128 CPUs (2 nodes) to compare
performance.

 - perf bench sched messaging -g 5 -t -l 100000
   baseline	v5
   7.042s	6.934s
   7.022s	6.865s
   7.054s	7.009s

 - stress-ng --rawpkt 128 --rawpkt-ops 100000000
   baseline	v5
   2.42s	2.18s
   2.45s	2.16s
   2.44s	2.17s

It shows above there is about 10% improvement on stress-ng rawpkt
testcase, although no much improvement on perf sched bench testcase.

Thanks for any comment and code review!

Chengming Zhou (9):
  slub: Reflow ___slab_alloc()
  slub: Change get_partial() interfaces to return slab
  slub: Keep track of whether slub is on the per-node partial list
  slub: Prepare __slab_free() for unfrozen partial slab out of node
    partial list
  slub: Introduce freeze_slab()
  slub: Delay freezing of partial slabs
  slub: Optimize deactivate_slab()
  slub: Rename all *unfreeze_partials* functions to *put_partials*
  slub: Update frozen slabs documentations in the source

 mm/slub.c | 384 +++++++++++++++++++++++++-----------------------------
 1 file changed, 180 insertions(+), 204 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 1/9] slub: Reflow ___slab_alloc()
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-11-22  0:26   ` Hyeonggon Yoo
  2023-11-02  3:23 ` [PATCH v5 2/9] slub: Change get_partial() interfaces to return slab chengming.zhou
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

The get_partial() interface used in ___slab_alloc() may return a single
object in the "kmem_cache_debug(s)" case, in which we will just return
the "freelist" object.

Move this handling up to prepare for later changes.

And the "pfmemalloc_match()" part is not needed for node partial slab,
since we already check this in the get_partial_node().

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 63d281dfacdb..0b0fdc8c189f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3216,8 +3216,21 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	pc.slab = &slab;
 	pc.orig_size = orig_size;
 	freelist = get_partial(s, node, &pc);
-	if (freelist)
-		goto check_new_slab;
+	if (freelist) {
+		if (kmem_cache_debug(s)) {
+			/*
+			 * For debug caches here we had to go through
+			 * alloc_single_from_partial() so just store the
+			 * tracking info and return the object.
+			 */
+			if (s->flags & SLAB_STORE_USER)
+				set_track(s, freelist, TRACK_ALLOC, addr);
+
+			return freelist;
+		}
+
+		goto retry_load_slab;
+	}
 
 	slub_put_cpu_ptr(s->cpu_slab);
 	slab = new_slab(s, gfpflags, node);
@@ -3253,20 +3266,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 	inc_slabs_node(s, slab_nid(slab), slab->objects);
 
-check_new_slab:
-
-	if (kmem_cache_debug(s)) {
-		/*
-		 * For debug caches here we had to go through
-		 * alloc_single_from_partial() so just store the tracking info
-		 * and return the object
-		 */
-		if (s->flags & SLAB_STORE_USER)
-			set_track(s, freelist, TRACK_ALLOC, addr);
-
-		return freelist;
-	}
-
 	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
 		/*
 		 * For !pfmemalloc_match() case we don't load freelist so that
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 1/9] slub: Reflow ___slab_alloc()
  2023-11-02  3:23 ` [PATCH v5 1/9] slub: Reflow ___slab_alloc() chengming.zhou
@ 2023-11-22  0:26   ` Hyeonggon Yoo
  0 siblings, 0 replies; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-11-22  0:26 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Thu, Nov 2, 2023 at 12:24 PM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> The get_partial() interface used in ___slab_alloc() may return a single
> object in the "kmem_cache_debug(s)" case, in which we will just return
> the "freelist" object.
>
> Move this handling up to prepare for later changes.
>
> And the "pfmemalloc_match()" part is not needed for node partial slab,
> since we already check this in the get_partial_node().
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 31 +++++++++++++++----------------
>  1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 63d281dfacdb..0b0fdc8c189f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3216,8 +3216,21 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>         pc.slab = &slab;
>         pc.orig_size = orig_size;
>         freelist = get_partial(s, node, &pc);
> -       if (freelist)
> -               goto check_new_slab;
> +       if (freelist) {
> +               if (kmem_cache_debug(s)) {
> +                       /*
> +                        * For debug caches here we had to go through
> +                        * alloc_single_from_partial() so just store the
> +                        * tracking info and return the object.
> +                        */
> +                       if (s->flags & SLAB_STORE_USER)
> +                               set_track(s, freelist, TRACK_ALLOC, addr);
> +
> +                       return freelist;
> +               }
> +
> +               goto retry_load_slab;
> +       }
>
>         slub_put_cpu_ptr(s->cpu_slab);
>         slab = new_slab(s, gfpflags, node);
> @@ -3253,20 +3266,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>
>         inc_slabs_node(s, slab_nid(slab), slab->objects);
>
> -check_new_slab:
> -
> -       if (kmem_cache_debug(s)) {
> -               /*
> -                * For debug caches here we had to go through
> -                * alloc_single_from_partial() so just store the tracking info
> -                * and return the object
> -                */
> -               if (s->flags & SLAB_STORE_USER)
> -                       set_track(s, freelist, TRACK_ALLOC, addr);
> -
> -               return freelist;
> -       }
> -
>         if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
>                 /*
>                  * For !pfmemalloc_match() case we don't load freelist so that

Looks good to me,
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 2/9] slub: Change get_partial() interfaces to return slab
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
  2023-11-02  3:23 ` [PATCH v5 1/9] slub: Reflow ___slab_alloc() chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-11-22  1:09   ` Hyeonggon Yoo
  2023-11-02  3:23 ` [PATCH v5 3/9] slub: Keep track of whether slub is on the per-node partial list chengming.zhou
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

We need all get_partial() related interfaces to return a slab, instead
of returning the freelist (or object).

Use the partial_context.object to return back freelist or object for
now. This patch shouldn't have any functional changes.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 63 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 33 insertions(+), 30 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 0b0fdc8c189f..03384cd965c5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -204,9 +204,9 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
 
 /* Structure holding parameters for get_partial() call chain */
 struct partial_context {
-	struct slab **slab;
 	gfp_t flags;
 	unsigned int orig_size;
+	void *object;
 };
 
 static inline bool kmem_cache_debug(struct kmem_cache *s)
@@ -2269,10 +2269,11 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
 /*
  * Try to allocate a partial slab from a specific node.
  */
-static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
-			      struct partial_context *pc)
+static struct slab *get_partial_node(struct kmem_cache *s,
+				     struct kmem_cache_node *n,
+				     struct partial_context *pc)
 {
-	struct slab *slab, *slab2;
+	struct slab *slab, *slab2, *partial = NULL;
 	void *object = NULL;
 	unsigned long flags;
 	unsigned int partial_slabs = 0;
@@ -2288,27 +2289,28 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 
 	spin_lock_irqsave(&n->list_lock, flags);
 	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
-		void *t;
-
 		if (!pfmemalloc_match(slab, pc->flags))
 			continue;
 
 		if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
 			object = alloc_single_from_partial(s, n, slab,
 							pc->orig_size);
-			if (object)
+			if (object) {
+				partial = slab;
+				pc->object = object;
 				break;
+			}
 			continue;
 		}
 
-		t = acquire_slab(s, n, slab, object == NULL);
-		if (!t)
+		object = acquire_slab(s, n, slab, object == NULL);
+		if (!object)
 			break;
 
-		if (!object) {
-			*pc->slab = slab;
+		if (!partial) {
+			partial = slab;
+			pc->object = object;
 			stat(s, ALLOC_FROM_PARTIAL);
-			object = t;
 		} else {
 			put_cpu_partial(s, slab, 0);
 			stat(s, CPU_PARTIAL_NODE);
@@ -2324,20 +2326,21 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
-	return object;
+	return partial;
 }
 
 /*
  * Get a slab from somewhere. Search in increasing NUMA distances.
  */
-static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
+static struct slab *get_any_partial(struct kmem_cache *s,
+				    struct partial_context *pc)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
 	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type highest_zoneidx = gfp_zone(pc->flags);
-	void *object;
+	struct slab *slab;
 	unsigned int cpuset_mems_cookie;
 
 	/*
@@ -2372,8 +2375,8 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 
 			if (n && cpuset_zone_allowed(zone, pc->flags) &&
 					n->nr_partial > s->min_partial) {
-				object = get_partial_node(s, n, pc);
-				if (object) {
+				slab = get_partial_node(s, n, pc);
+				if (slab) {
 					/*
 					 * Don't check read_mems_allowed_retry()
 					 * here - if mems_allowed was updated in
@@ -2381,7 +2384,7 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 					 * between allocation and the cpuset
 					 * update
 					 */
-					return object;
+					return slab;
 				}
 			}
 		}
@@ -2393,17 +2396,18 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
 /*
  * Get a partial slab, lock it and return it.
  */
-static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
+static struct slab *get_partial(struct kmem_cache *s, int node,
+				struct partial_context *pc)
 {
-	void *object;
+	struct slab *slab;
 	int searchnode = node;
 
 	if (node == NUMA_NO_NODE)
 		searchnode = numa_mem_id();
 
-	object = get_partial_node(s, get_node(s, searchnode), pc);
-	if (object || node != NUMA_NO_NODE)
-		return object;
+	slab = get_partial_node(s, get_node(s, searchnode), pc);
+	if (slab || node != NUMA_NO_NODE)
+		return slab;
 
 	return get_any_partial(s, pc);
 }
@@ -3213,10 +3217,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 new_objects:
 
 	pc.flags = gfpflags;
-	pc.slab = &slab;
 	pc.orig_size = orig_size;
-	freelist = get_partial(s, node, &pc);
-	if (freelist) {
+	slab = get_partial(s, node, &pc);
+	if (slab) {
+		freelist = pc.object;
 		if (kmem_cache_debug(s)) {
 			/*
 			 * For debug caches here we had to go through
@@ -3408,12 +3412,11 @@ static void *__slab_alloc_node(struct kmem_cache *s,
 	void *object;
 
 	pc.flags = gfpflags;
-	pc.slab = &slab;
 	pc.orig_size = orig_size;
-	object = get_partial(s, node, &pc);
+	slab = get_partial(s, node, &pc);
 
-	if (object)
-		return object;
+	if (slab)
+		return pc.object;
 
 	slab = new_slab(s, gfpflags, node);
 	if (unlikely(!slab)) {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 2/9] slub: Change get_partial() interfaces to return slab
  2023-11-02  3:23 ` [PATCH v5 2/9] slub: Change get_partial() interfaces to return slab chengming.zhou
@ 2023-11-22  1:09   ` Hyeonggon Yoo
  0 siblings, 0 replies; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-11-22  1:09 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Thu, Nov 2, 2023 at 12:24 PM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> We need all get_partial() related interfaces to return a slab, instead
> of returning the freelist (or object).
>
> Use the partial_context.object to return back freelist or object for
> now. This patch shouldn't have any functional changes.
>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 63 +++++++++++++++++++++++++++++--------------------------
>  1 file changed, 33 insertions(+), 30 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 0b0fdc8c189f..03384cd965c5 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -204,9 +204,9 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled);
>
>  /* Structure holding parameters for get_partial() call chain */
>  struct partial_context {
> -       struct slab **slab;
>         gfp_t flags;
>         unsigned int orig_size;
> +       void *object;
>  };
>
>  static inline bool kmem_cache_debug(struct kmem_cache *s)
> @@ -2269,10 +2269,11 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>  /*
>   * Try to allocate a partial slab from a specific node.
>   */
> -static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
> -                             struct partial_context *pc)
> +static struct slab *get_partial_node(struct kmem_cache *s,
> +                                    struct kmem_cache_node *n,
> +                                    struct partial_context *pc)
>  {
> -       struct slab *slab, *slab2;
> +       struct slab *slab, *slab2, *partial = NULL;
>         void *object = NULL;
>         unsigned long flags;
>         unsigned int partial_slabs = 0;
> @@ -2288,27 +2289,28 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
>
>         spin_lock_irqsave(&n->list_lock, flags);
>         list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> -               void *t;
> -
>                 if (!pfmemalloc_match(slab, pc->flags))
>                         continue;
>
>                 if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
>                         object = alloc_single_from_partial(s, n, slab,
>                                                         pc->orig_size);
> -                       if (object)
> +                       if (object) {
> +                               partial = slab;
> +                               pc->object = object;
>                                 break;
> +                       }
>                         continue;
>                 }
>
> -               t = acquire_slab(s, n, slab, object == NULL);
> -               if (!t)
> +               object = acquire_slab(s, n, slab, object == NULL);
> +               if (!object)
>                         break;
>
> -               if (!object) {
> -                       *pc->slab = slab;
> +               if (!partial) {
> +                       partial = slab;
> +                       pc->object = object;
>                         stat(s, ALLOC_FROM_PARTIAL);
> -                       object = t;
>                 } else {
>                         put_cpu_partial(s, slab, 0);
>                         stat(s, CPU_PARTIAL_NODE);
> @@ -2324,20 +2326,21 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
>
>         }
>         spin_unlock_irqrestore(&n->list_lock, flags);
> -       return object;
> +       return partial;
>  }
>
>  /*
>   * Get a slab from somewhere. Search in increasing NUMA distances.
>   */
> -static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
> +static struct slab *get_any_partial(struct kmem_cache *s,
> +                                   struct partial_context *pc)
>  {
>  #ifdef CONFIG_NUMA
>         struct zonelist *zonelist;
>         struct zoneref *z;
>         struct zone *zone;
>         enum zone_type highest_zoneidx = gfp_zone(pc->flags);
> -       void *object;
> +       struct slab *slab;
>         unsigned int cpuset_mems_cookie;
>
>         /*
> @@ -2372,8 +2375,8 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
>
>                         if (n && cpuset_zone_allowed(zone, pc->flags) &&
>                                         n->nr_partial > s->min_partial) {
> -                               object = get_partial_node(s, n, pc);
> -                               if (object) {
> +                               slab = get_partial_node(s, n, pc);
> +                               if (slab) {
>                                         /*
>                                          * Don't check read_mems_allowed_retry()
>                                          * here - if mems_allowed was updated in
> @@ -2381,7 +2384,7 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
>                                          * between allocation and the cpuset
>                                          * update
>                                          */
> -                                       return object;
> +                                       return slab;
>                                 }
>                         }
>                 }
> @@ -2393,17 +2396,18 @@ static void *get_any_partial(struct kmem_cache *s, struct partial_context *pc)
>  /*
>   * Get a partial slab, lock it and return it.
>   */
> -static void *get_partial(struct kmem_cache *s, int node, struct partial_context *pc)
> +static struct slab *get_partial(struct kmem_cache *s, int node,
> +                               struct partial_context *pc)
>  {
> -       void *object;
> +       struct slab *slab;
>         int searchnode = node;
>
>         if (node == NUMA_NO_NODE)
>                 searchnode = numa_mem_id();
>
> -       object = get_partial_node(s, get_node(s, searchnode), pc);
> -       if (object || node != NUMA_NO_NODE)
> -               return object;
> +       slab = get_partial_node(s, get_node(s, searchnode), pc);
> +       if (slab || node != NUMA_NO_NODE)
> +               return slab;
>
>         return get_any_partial(s, pc);
>  }
> @@ -3213,10 +3217,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  new_objects:
>
>         pc.flags = gfpflags;
> -       pc.slab = &slab;
>         pc.orig_size = orig_size;
> -       freelist = get_partial(s, node, &pc);
> -       if (freelist) {
> +       slab = get_partial(s, node, &pc);
> +       if (slab) {
> +               freelist = pc.object;
>                 if (kmem_cache_debug(s)) {
>                         /*
>                          * For debug caches here we had to go through
> @@ -3408,12 +3412,11 @@ static void *__slab_alloc_node(struct kmem_cache *s,
>         void *object;
>
>         pc.flags = gfpflags;
> -       pc.slab = &slab;
>         pc.orig_size = orig_size;
> -       object = get_partial(s, node, &pc);
> +       slab = get_partial(s, node, &pc);
>
> -       if (object)
> -               return object;
> +       if (slab)
> +               return pc.object;
>
>         slab = new_slab(s, gfpflags, node);
>         if (unlikely(!slab)) {

Looks good to me,
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 3/9] slub: Keep track of whether slub is on the per-node partial list
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
  2023-11-02  3:23 ` [PATCH v5 1/9] slub: Reflow ___slab_alloc() chengming.zhou
  2023-11-02  3:23 ` [PATCH v5 2/9] slub: Change get_partial() interfaces to return slab chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-11-22  1:21   ` Hyeonggon Yoo
  2023-11-02  3:23 ` [PATCH v5 4/9] slub: Prepare __slab_free() for unfrozen partial slab out of node " chengming.zhou
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou,
	Matthew Wilcox

From: Chengming Zhou <zhouchengming@bytedance.com>

Now we rely on the "frozen" bit to see if we should manipulate the
slab->slab_list, which will be changed in the following patch.

Instead we introduce another way to keep track of whether slub is on
the per-node partial list, here we reuse the PG_workingset bit.

We use __set_bit and __clear_bit directly instead of the atomic version
for better performance and it's safe since it's protected by the slub
node list_lock.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 03384cd965c5..eed8ae0dbaf9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2116,6 +2116,25 @@ static void discard_slab(struct kmem_cache *s, struct slab *slab)
 	free_slab(s, slab);
 }
 
+/*
+ * SLUB reuses PG_workingset bit to keep track of whether it's on
+ * the per-node partial list.
+ */
+static inline bool slab_test_node_partial(const struct slab *slab)
+{
+	return folio_test_workingset((struct folio *)slab_folio(slab));
+}
+
+static inline void slab_set_node_partial(struct slab *slab)
+{
+	__set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+}
+
+static inline void slab_clear_node_partial(struct slab *slab)
+{
+	__clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+}
+
 /*
  * Management of partially allocated slabs.
  */
@@ -2127,6 +2146,7 @@ __add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
 		list_add_tail(&slab->slab_list, &n->partial);
 	else
 		list_add(&slab->slab_list, &n->partial);
+	slab_set_node_partial(slab);
 }
 
 static inline void add_partial(struct kmem_cache_node *n,
@@ -2141,6 +2161,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
 {
 	lockdep_assert_held(&n->list_lock);
 	list_del(&slab->slab_list);
+	slab_clear_node_partial(slab);
 	n->nr_partial--;
 }
 
@@ -4833,6 +4854,7 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
 
 			if (free == slab->objects) {
 				list_move(&slab->slab_list, &discard);
+				slab_clear_node_partial(slab);
 				n->nr_partial--;
 				dec_slabs_node(s, node, slab->objects);
 			} else if (free <= SHRINK_PROMOTE_MAX)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 3/9] slub: Keep track of whether slub is on the per-node partial list
  2023-11-02  3:23 ` [PATCH v5 3/9] slub: Keep track of whether slub is on the per-node partial list chengming.zhou
@ 2023-11-22  1:21   ` Hyeonggon Yoo
  0 siblings, 0 replies; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-11-22  1:21 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou,
	Matthew Wilcox

On Thu, Nov 2, 2023 at 12:24 PM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> Now we rely on the "frozen" bit to see if we should manipulate the
> slab->slab_list, which will be changed in the following patch.
>
> Instead we introduce another way to keep track of whether slub is on
> the per-node partial list, here we reuse the PG_workingset bit.
>
> We use __set_bit and __clear_bit directly instead of the atomic version
> for better performance and it's safe since it's protected by the slub
> node list_lock.
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 03384cd965c5..eed8ae0dbaf9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2116,6 +2116,25 @@ static void discard_slab(struct kmem_cache *s, struct slab *slab)
>         free_slab(s, slab);
>  }
>
> +/*
> + * SLUB reuses PG_workingset bit to keep track of whether it's on
> + * the per-node partial list.
> + */
> +static inline bool slab_test_node_partial(const struct slab *slab)
> +{
> +       return folio_test_workingset((struct folio *)slab_folio(slab));
> +}
> +
> +static inline void slab_set_node_partial(struct slab *slab)
> +{
> +       __set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> +}
> +
> +static inline void slab_clear_node_partial(struct slab *slab)
> +{
> +       __clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> +}
> +
>  /*
>   * Management of partially allocated slabs.
>   */
> @@ -2127,6 +2146,7 @@ __add_partial(struct kmem_cache_node *n, struct slab *slab, int tail)
>                 list_add_tail(&slab->slab_list, &n->partial);
>         else
>                 list_add(&slab->slab_list, &n->partial);
> +       slab_set_node_partial(slab);
>  }
>
>  static inline void add_partial(struct kmem_cache_node *n,
> @@ -2141,6 +2161,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
>  {
>         lockdep_assert_held(&n->list_lock);
>         list_del(&slab->slab_list);
> +       slab_clear_node_partial(slab);
>         n->nr_partial--;
>  }
>
> @@ -4833,6 +4854,7 @@ static int __kmem_cache_do_shrink(struct kmem_cache *s)
>
>                         if (free == slab->objects) {
>                                 list_move(&slab->slab_list, &discard);
> +                               slab_clear_node_partial(slab);
>                                 n->nr_partial--;
>                                 dec_slabs_node(s, node, slab->objects);
>                         } else if (free <= SHRINK_PROMOTE_MAX)
> --

Looks good to me,
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 4/9] slub: Prepare __slab_free() for unfrozen partial slab out of node partial list
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (2 preceding siblings ...)
  2023-11-02  3:23 ` [PATCH v5 3/9] slub: Keep track of whether slub is on the per-node partial list chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-12-03  6:01   ` Hyeonggon Yoo
  2023-11-02  3:23 ` [PATCH v5 5/9] slub: Introduce freeze_slab() chengming.zhou
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

Now the partially empty slub will be frozen when taken out of node partial
list, so the __slab_free() will know from "was_frozen" that the partially
empty slab is not on node partial list and is a cpu or cpu partial slab
of some cpu.

But we will change this, make partial slabs leave the node partial list
with unfrozen state, so we need to change __slab_free() to use the new
slab_test_node_partial() we just introduced.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index eed8ae0dbaf9..1880b483350e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3631,6 +3631,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	unsigned long counters;
 	struct kmem_cache_node *n = NULL;
 	unsigned long flags;
+	bool on_node_partial;
 
 	stat(s, FREE_SLOWPATH);
 
@@ -3678,6 +3679,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 				 */
 				spin_lock_irqsave(&n->list_lock, flags);
 
+				on_node_partial = slab_test_node_partial(slab);
 			}
 		}
 
@@ -3706,6 +3708,15 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		return;
 	}
 
+	/*
+	 * This slab was partially empty but not on the per-node partial list,
+	 * in which case we shouldn't manipulate its list, just return.
+	 */
+	if (prior && !on_node_partial) {
+		spin_unlock_irqrestore(&n->list_lock, flags);
+		return;
+	}
+
 	if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
 		goto slab_empty;
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 4/9] slub: Prepare __slab_free() for unfrozen partial slab out of node partial list
  2023-11-02  3:23 ` [PATCH v5 4/9] slub: Prepare __slab_free() for unfrozen partial slab out of node " chengming.zhou
@ 2023-12-03  6:01   ` Hyeonggon Yoo
  0 siblings, 0 replies; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-12-03  6:01 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Thu, Nov 2, 2023 at 12:24 PM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> Now the partially empty slub will be frozen when taken out of node partial
> list, so the __slab_free() will know from "was_frozen" that the partially
> empty slab is not on node partial list and is a cpu or cpu partial slab
> of some cpu.
>
> But we will change this, make partial slabs leave the node partial list
> with unfrozen state, so we need to change __slab_free() to use the new
> slab_test_node_partial() we just introduced.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index eed8ae0dbaf9..1880b483350e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3631,6 +3631,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>         unsigned long counters;
>         struct kmem_cache_node *n = NULL;
>         unsigned long flags;
> +       bool on_node_partial;
>
>         stat(s, FREE_SLOWPATH);
>
> @@ -3678,6 +3679,7 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                                  */
>                                 spin_lock_irqsave(&n->list_lock, flags);
>
> +                               on_node_partial = slab_test_node_partial(slab);
>                         }
>                 }
>
> @@ -3706,6 +3708,15 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                 return;
>         }
>
> +       /*
> +        * This slab was partially empty but not on the per-node partial list,
> +        * in which case we shouldn't manipulate its list, just return.
> +        */
> +       if (prior && !on_node_partial) {
> +               spin_unlock_irqrestore(&n->list_lock, flags);
> +               return;
> +       }
> +
>         if (unlikely(!new.inuse && n->nr_partial >= s->min_partial))
>                 goto slab_empty;
>

Looks good to me,
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 5/9] slub: Introduce freeze_slab()
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (3 preceding siblings ...)
  2023-11-02  3:23 ` [PATCH v5 4/9] slub: Prepare __slab_free() for unfrozen partial slab out of node " chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-11-02  3:23 ` [PATCH v5 6/9] slub: Delay freezing of partial slabs chengming.zhou
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

We will have unfrozen slabs out of the node partial list later, so we
need a freeze_slab() function to freeze the partial slab and get its
freelist.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 1880b483350e..edf567971679 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3098,6 +3098,33 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
 	return freelist;
 }
 
+/*
+ * Freeze the partial slab and return the pointer to the freelist.
+ */
+static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
+{
+	struct slab new;
+	unsigned long counters;
+	void *freelist;
+
+	do {
+		freelist = slab->freelist;
+		counters = slab->counters;
+
+		new.counters = counters;
+		VM_BUG_ON(new.frozen);
+
+		new.inuse = slab->objects;
+		new.frozen = 1;
+
+	} while (!slab_update_freelist(s, slab,
+		freelist, counters,
+		NULL, new.counters,
+		"freeze_slab"));
+
+	return freelist;
+}
+
 /*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (4 preceding siblings ...)
  2023-11-02  3:23 ` [PATCH v5 5/9] slub: Introduce freeze_slab() chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-11-14  5:44   ` kernel test robot
                     ` (2 more replies)
  2023-11-02  3:23 ` [PATCH v5 7/9] slub: Optimize deactivate_slab() chengming.zhou
                   ` (3 subsequent siblings)
  9 siblings, 3 replies; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

Now we will freeze slabs when moving them out of node partial list to
cpu partial list, this method needs two cmpxchg_double operations:

1. freeze slab (acquire_slab()) under the node list_lock
2. get_freelist() when pick used in ___slab_alloc()

Actually we don't need to freeze when moving slabs out of node partial
list, we can delay freezing to when use slab freelist in ___slab_alloc(),
so we can save one cmpxchg_double().

And there are other good points:
 - The moving of slabs between node partial list and cpu partial list
   becomes simpler, since we don't need to freeze or unfreeze at all.

 - The node list_lock contention would be less, since we don't need to
   freeze any slab under the node list_lock.

We can achieve this because there is no concurrent path would manipulate
the partial slab list except the __slab_free() path, which is now
serialized by slab_test_node_partial() under the list_lock.

Since the slab returned by get_partial() interfaces is not frozen anymore
and no freelist is returned in the partial_context, so we need to use the
introduced freeze_slab() to freeze it and get its freelist.

Similarly, the slabs on the CPU partial list are not frozen anymore,
we need to freeze_slab() on it before use.

We can now delete acquire_slab() as it became unused.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 113 +++++++++++-------------------------------------------
 1 file changed, 23 insertions(+), 90 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index edf567971679..bcb5b2c4e213 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2234,51 +2234,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
 	return object;
 }
 
-/*
- * Remove slab from the partial list, freeze it and
- * return the pointer to the freelist.
- *
- * Returns a list of objects or NULL if it fails.
- */
-static inline void *acquire_slab(struct kmem_cache *s,
-		struct kmem_cache_node *n, struct slab *slab,
-		int mode)
-{
-	void *freelist;
-	unsigned long counters;
-	struct slab new;
-
-	lockdep_assert_held(&n->list_lock);
-
-	/*
-	 * Zap the freelist and set the frozen bit.
-	 * The old freelist is the list of objects for the
-	 * per cpu allocation list.
-	 */
-	freelist = slab->freelist;
-	counters = slab->counters;
-	new.counters = counters;
-	if (mode) {
-		new.inuse = slab->objects;
-		new.freelist = NULL;
-	} else {
-		new.freelist = freelist;
-	}
-
-	VM_BUG_ON(new.frozen);
-	new.frozen = 1;
-
-	if (!__slab_update_freelist(s, slab,
-			freelist, counters,
-			new.freelist, new.counters,
-			"acquire_slab"))
-		return NULL;
-
-	remove_partial(n, slab);
-	WARN_ON(!freelist);
-	return freelist;
-}
-
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
 #else
@@ -2295,7 +2250,6 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 				     struct partial_context *pc)
 {
 	struct slab *slab, *slab2, *partial = NULL;
-	void *object = NULL;
 	unsigned long flags;
 	unsigned int partial_slabs = 0;
 
@@ -2314,7 +2268,7 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 			continue;
 
 		if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
-			object = alloc_single_from_partial(s, n, slab,
+			void *object = alloc_single_from_partial(s, n, slab,
 							pc->orig_size);
 			if (object) {
 				partial = slab;
@@ -2324,13 +2278,10 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 			continue;
 		}
 
-		object = acquire_slab(s, n, slab, object == NULL);
-		if (!object)
-			break;
+		remove_partial(n, slab);
 
 		if (!partial) {
 			partial = slab;
-			pc->object = object;
 			stat(s, ALLOC_FROM_PARTIAL);
 		} else {
 			put_cpu_partial(s, slab, 0);
@@ -2629,9 +2580,6 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
 	unsigned long flags = 0;
 
 	while (partial_slab) {
-		struct slab new;
-		struct slab old;
-
 		slab = partial_slab;
 		partial_slab = slab->next;
 
@@ -2644,23 +2592,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
 			spin_lock_irqsave(&n->list_lock, flags);
 		}
 
-		do {
-
-			old.freelist = slab->freelist;
-			old.counters = slab->counters;
-			VM_BUG_ON(!old.frozen);
-
-			new.counters = old.counters;
-			new.freelist = old.freelist;
-
-			new.frozen = 0;
-
-		} while (!__slab_update_freelist(s, slab,
-				old.freelist, old.counters,
-				new.freelist, new.counters,
-				"unfreezing slab"));
-
-		if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
+		if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
 			slab->next = slab_to_discard;
 			slab_to_discard = slab;
 		} else {
@@ -3167,7 +3099,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			node = NUMA_NO_NODE;
 		goto new_slab;
 	}
-redo:
 
 	if (unlikely(!node_match(slab, node))) {
 		/*
@@ -3243,7 +3174,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 new_slab:
 
-	if (slub_percpu_partial(c)) {
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+	while (slub_percpu_partial(c)) {
 		local_lock_irqsave(&s->cpu_slab->lock, flags);
 		if (unlikely(c->slab)) {
 			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
@@ -3255,12 +3187,22 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			goto new_objects;
 		}
 
-		slab = c->slab = slub_percpu_partial(c);
+		slab = slub_percpu_partial(c);
 		slub_set_percpu_partial(c, slab);
 		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		stat(s, CPU_PARTIAL_ALLOC);
-		goto redo;
+
+		if (unlikely(!node_match(slab, node) ||
+			     !pfmemalloc_match(slab, gfpflags))) {
+			slab->next = NULL;
+			__unfreeze_partials(s, slab);
+			continue;
+		}
+
+		freelist = freeze_slab(s, slab);
+		goto retry_load_slab;
 	}
+#endif
 
 new_objects:
 
@@ -3268,8 +3210,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	pc.orig_size = orig_size;
 	slab = get_partial(s, node, &pc);
 	if (slab) {
-		freelist = pc.object;
 		if (kmem_cache_debug(s)) {
+			freelist = pc.object;
 			/*
 			 * For debug caches here we had to go through
 			 * alloc_single_from_partial() so just store the
@@ -3281,6 +3223,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			return freelist;
 		}
 
+		freelist = freeze_slab(s, slab);
 		goto retry_load_slab;
 	}
 
@@ -3682,18 +3625,8 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		was_frozen = new.frozen;
 		new.inuse -= cnt;
 		if ((!new.inuse || !prior) && !was_frozen) {
-
-			if (kmem_cache_has_cpu_partial(s) && !prior) {
-
-				/*
-				 * Slab was on no list before and will be
-				 * partially empty
-				 * We can defer the list move and instead
-				 * freeze it.
-				 */
-				new.frozen = 1;
-
-			} else { /* Needs to be taken off a list */
+			/* Needs to be taken off a list */
+			if (!kmem_cache_has_cpu_partial(s) || prior) {
 
 				n = get_node(s, slab_nid(slab));
 				/*
@@ -3723,9 +3656,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 			 * activity can be necessary.
 			 */
 			stat(s, FREE_FROZEN);
-		} else if (new.frozen) {
+		} else if (kmem_cache_has_cpu_partial(s) && !prior) {
 			/*
-			 * If we just froze the slab then put it onto the
+			 * If we started with a full slab then put it onto the
 			 * per cpu partial list.
 			 */
 			put_cpu_partial(s, slab, 1);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-02  3:23 ` [PATCH v5 6/9] slub: Delay freezing of partial slabs chengming.zhou
@ 2023-11-14  5:44   ` kernel test robot
  2023-11-20 18:49   ` Mark Brown
  2023-12-03  6:53   ` Hyeonggon Yoo
  2 siblings, 0 replies; 44+ messages in thread
From: kernel test robot @ 2023-11-14  5:44 UTC (permalink / raw)
  To: chengming.zhou
  Cc: oe-lkp, lkp, Vlastimil Babka, Hyeonggon Yoo, linux-mm, ying.huang,
	feng.tang, fengwei.yin, cl, penberg, rientjes, iamjoonsoo.kim,
	akpm, roman.gushchin, linux-kernel, chengming.zhou,
	Chengming Zhou, oliver.sang



Hello,

kernel test robot noticed a 34.2% improvement of stress-ng.rawudp.ops_per_sec on:


commit: b73583016198aecef1dea07033a808da7875ede1 ("[PATCH v5 6/9] slub: Delay freezing of partial slabs")
url: https://github.com/intel-lab-lkp/linux/commits/chengming-zhou-linux-dev/slub-Reflow-___slab_alloc/20231102-112748
base: git://git.kernel.org/cgit/linux/kernel/git/vbabka/slab.git for-next
patch link: https://lore.kernel.org/all/20231102032330.1036151-7-chengming.zhou@linux.dev/
patch subject: [PATCH v5 6/9] slub: Delay freezing of partial slabs

testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	class: network
	test: rawudp
	cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231114/202311141204.e918dbda-oliver.sang@intel.com

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  network/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp8/rawudp/stress-ng/60s

commit: 
  fae950af3a ("slub: Introduce freeze_slab()")
  b735830161 ("slub: Delay freezing of partial slabs")

fae950af3a484ec4 b73583016198aecef1dea07033a 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      0.51 ± 15%      +0.1        0.66 ±  3%  mpstat.cpu.all.usr%
     30758 ± 11%     +41.0%      43361 ±  4%  vmstat.system.cs
  14033150           +13.1%   15877158        numa-numastat.node0.local_node
  14069328           +13.1%   15912745        numa-numastat.node0.numa_hit
  14246260           +11.6%   15894273        numa-numastat.node1.local_node
  14274705           +11.6%   15927700        numa-numastat.node1.numa_hit
  14069250           +13.1%   15912955        numa-vmstat.node0.numa_hit
  14033072           +13.1%   15877368        numa-vmstat.node0.numa_local
  14274619           +11.6%   15927488        numa-vmstat.node1.numa_hit
  14246174           +11.6%   15894061        numa-vmstat.node1.numa_local
      0.30 ± 19%     +38.5%       0.42 ±  7%  sched_debug.cfs_rq:/.nr_running.stddev
      2434 ±  4%     -10.6%       2176 ±  6%  sched_debug.cpu.curr->pid.avg
      1079 ± 20%     +36.7%       1474 ± 10%  sched_debug.cpu.curr->pid.stddev
     18781 ±  2%     +28.4%      24113 ±  2%  sched_debug.cpu.nr_switches.avg
     11540 ±  4%     +41.8%      16360 ±  5%  sched_debug.cpu.nr_switches.min
      0.03           +33.3%       0.04        stress-ng.rawudp.MB_recv'd_per_sec
   4064768           +34.2%    5456370        stress-ng.rawudp.ops
     67734           +34.2%      90929        stress-ng.rawudp.ops_per_sec
    757287 ±  2%     +31.6%     996382        stress-ng.time.involuntary_context_switches
   1112941 ±  2%     +40.5%    1564161 ±  2%  stress-ng.time.voluntary_context_switches
  28346406           +12.3%   31843356        proc-vmstat.numa_hit
  28281784           +12.3%   31774341        proc-vmstat.numa_local
    103553 ±  2%      +6.3%     110100 ±  3%  proc-vmstat.numa_pte_updates
    128452           +10.1%     141426 ±  2%  proc-vmstat.pgactivate
  70199954           +10.8%   77804409        proc-vmstat.pgalloc_normal
  70032258           +10.8%   77627740        proc-vmstat.pgfree
  81626657 ± 11%     +20.5%   98326255 ±  3%  perf-stat.i.branch-misses
 8.365e+08 ± 11%     +16.8%   9.77e+08 ±  3%  perf-stat.i.cache-references
     31476 ± 12%     +42.8%      44947 ±  4%  perf-stat.i.context-switches
      2.46 ±  2%      -5.3%       2.33        perf-stat.i.cpi
      7610 ± 11%     +51.7%      11547 ±  3%  perf-stat.i.cpu-migrations
      1807 ±  6%     -35.8%       1160 ±  9%  perf-stat.i.metric.K/sec
  67795798 ± 11%     +17.2%   79489592 ±  3%  perf-stat.i.node-load-misses
     42.11 ±  5%      +2.2       44.32        perf-stat.i.node-store-miss-rate%
  53640005 ± 11%     +16.1%   62254015 ±  3%  perf-stat.i.node-store-misses
      4.42            +3.9%       4.59        perf-stat.overall.MPKI
      0.53            +0.0        0.57        perf-stat.overall.branch-miss-rate%
      2.46            -4.0%       2.36        perf-stat.overall.cpi
    556.40            -7.6%     513.88        perf-stat.overall.cycles-between-cache-misses
      0.41            +4.2%       0.42        perf-stat.overall.ipc
     59.01            +0.9       59.88        perf-stat.overall.node-load-miss-rate%
  80834242 ± 10%     +19.7%   96751983 ±  2%  perf-stat.ps.branch-misses
 3.595e+08 ± 10%     +14.6%  4.121e+08 ±  3%  perf-stat.ps.cache-misses
 8.308e+08 ± 10%     +16.1%  9.643e+08 ±  3%  perf-stat.ps.cache-references
     31157 ± 10%     +42.0%      44245 ±  3%  perf-stat.ps.context-switches
      7566 ± 10%     +50.5%      11389 ±  3%  perf-stat.ps.cpu-migrations
  67342566 ± 10%     +16.5%   78472343 ±  2%  perf-stat.ps.node-load-misses
  53274045 ± 10%     +15.4%   61453717 ±  2%  perf-stat.ps.node-store-misses
  66741521 ± 10%     +13.4%   75684953 ±  2%  perf-stat.ps.node-stores
 5.522e+12            +4.0%  5.742e+12        perf-stat.total.instructions
      0.40 ±  5%     -28.4%       0.29 ±  8%  perf-sched.sch_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      0.58 ±  5%      -9.2%       0.53 ±  4%  perf-sched.sch_delay.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_datagram.skb_recv_datagram
      0.37 ± 11%     -38.4%       0.22 ± 16%  perf-sched.sch_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    128.50 ± 19%     -30.6%      89.17 ± 18%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      0.52 ±  3%     -13.8%       0.45 ±  4%  perf-sched.total_sch_delay.average.ms
      9.51 ±  5%     -24.7%       7.16 ±  5%  perf-sched.total_wait_and_delay.average.ms
    150858 ±  3%     +33.8%     201844 ±  6%  perf-sched.total_wait_and_delay.count.ms
      8.99 ±  5%     -25.3%       6.71 ±  5%  perf-sched.total_wait_time.average.ms
      4.32 ±  4%     -34.8%       2.81 ±  7%  perf-sched.wait_and_delay.avg.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
      3.22 ±  8%     -34.3%       2.12 ±  8%  perf-sched.wait_and_delay.avg.ms.__cond_resched.slab_pre_alloc_hook.constprop.0.kmem_cache_alloc_lru
     47.40 ±  6%     -13.9%      40.79 ± 11%  perf-sched.wait_and_delay.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      4.93 ±  4%     -29.4%       3.48 ±  7%  perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      2.14 ±  5%     -16.3%       1.79 ±  4%  perf-sched.wait_and_delay.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_datagram.skb_recv_datagram
    508.50 ±  7%     +39.9%     711.50 ±  9%  perf-sched.wait_and_delay.count.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
      1533 ±  6%     +22.0%       1871 ±  6%  perf-sched.wait_and_delay.count.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
      1003 ±  4%     +28.4%       1288 ±  6%  perf-sched.wait_and_delay.count.__cond_resched.slab_pre_alloc_hook.constprop.0.kmem_cache_alloc_lru
     48937 ±  3%     +28.2%      62737 ±  5%  perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
     86063 ±  4%     +39.4%     120002 ±  7%  perf-sched.wait_and_delay.count.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_datagram.skb_recv_datagram
      6201 ±  3%     +14.1%       7077 ± 11%  perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     85.45 ±120%    +384.6%     414.09 ± 82%  perf-sched.wait_and_delay.max.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
     22.16 ±105%    +162.1%      58.07 ± 36%  perf-sched.wait_and_delay.max.ms.__cond_resched.stop_one_cpu.sched_exec.bprm_execve.part
      1.37 ± 17%     -38.1%       0.85 ± 26%  perf-sched.wait_time.avg.ms.__cond_resched.__kmem_cache_alloc_node.kmalloc_trace.apparmor_sk_alloc_security.security_sk_alloc
      2.32 ± 14%     -48.3%       1.20 ± 33%  perf-sched.wait_time.avg.ms.__cond_resched.aa_sk_perm.security_socket_sendmsg.sock_sendmsg.__sys_sendto
      4.31 ±  4%     -34.9%       2.81 ±  7%  perf-sched.wait_time.avg.ms.__cond_resched.dput.__fput.__x64_sys_close.do_syscall_64
      2.24 ± 11%     -36.7%       1.42 ± 24%  perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_node.__alloc_skb.alloc_skb_with_frags.sock_alloc_send_pskb
      3.89 ± 41%     -57.2%       1.67 ± 36%  perf-sched.wait_time.avg.ms.__cond_resched.lock_sock_nested.raw_destroy.sk_common_release.inet_release
      3.22 ±  8%     -34.2%       2.12 ±  8%  perf-sched.wait_time.avg.ms.__cond_resched.slab_pre_alloc_hook.constprop.0.kmem_cache_alloc_lru
     47.20 ±  6%     -14.0%      40.58 ± 11%  perf-sched.wait_time.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      4.52 ±  4%     -29.5%       3.19 ±  7%  perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
      5.84 ± 24%     -45.9%       3.16 ± 40%  perf-sched.wait_time.avg.ms.schedule_preempt_disabled.__mutex_lock.constprop.0.ip_ra_control
      1.56 ±  5%     -19.0%       1.26 ±  4%  perf-sched.wait_time.avg.ms.schedule_timeout.__skb_wait_for_more_packets.__skb_recv_datagram.skb_recv_datagram
    107.91 ± 77%    +281.4%     411.55 ± 82%  perf-sched.wait_time.max.ms.__cond_resched.mutex_lock.perf_poll.do_poll.constprop
      7.19            -2.9        4.31 ±  4%  perf-profile.calltrace.cycles-pp.get_partial_node.___slab_alloc.kmem_cache_alloc.skb_clone.raw_v4_input
     10.26            -2.8        7.48 ±  3%  perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc.skb_clone.raw_v4_input.ip_protocol_deliver_rcu
     12.04            -2.7        9.31 ±  2%  perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
      6.51            -2.7        3.79 ±  5%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.get_partial_node.___slab_alloc.kmem_cache_alloc.skb_clone
     12.55            -2.7        9.83 ±  2%  perf-profile.calltrace.cycles-pp.skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
      6.37            -2.7        3.67 ±  5%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.get_partial_node.___slab_alloc.kmem_cache_alloc
      6.44 ±  2%      -2.1        4.34 ±  5%  perf-profile.calltrace.cycles-pp.__unfreeze_partials.raw_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
      3.03 ±  4%      -2.0        0.99 ± 28%  perf-profile.calltrace.cycles-pp.__unfreeze_partials.inet_sock_destruct.__sk_destruct.rcu_do_batch.rcu_core
      5.59 ±  2%      -1.9        3.67 ±  6%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__unfreeze_partials.raw_recvmsg.inet_recvmsg
      2.81 ±  4%      -1.9        0.92 ± 29%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__unfreeze_partials.inet_sock_destruct.__sk_destruct.rcu_do_batch
      2.69 ±  4%      -1.9        0.82 ± 30%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__unfreeze_partials.inet_sock_destruct.__sk_destruct
      5.88 ±  2%      -1.8        4.07 ±  5%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__unfreeze_partials.raw_recvmsg.inet_recvmsg.sock_recvmsg
     15.12            -1.3       13.86 ±  2%  perf-profile.calltrace.cycles-pp.inet_sock_destruct.__sk_destruct.rcu_do_batch.rcu_core.__do_softirq
     12.74            -1.1       11.66 ±  2%  perf-profile.calltrace.cycles-pp.__sk_destruct.rcu_do_batch.rcu_core.__do_softirq.run_ksoftirqd
     13.45            -1.1       12.39 ±  2%  perf-profile.calltrace.cycles-pp.rcu_do_batch.rcu_core.__do_softirq.run_ksoftirqd.smpboot_thread_fn
     13.46            -1.1       12.40 ±  2%  perf-profile.calltrace.cycles-pp.rcu_core.__do_softirq.run_ksoftirqd.smpboot_thread_fn.kthread
     13.48            -1.1       12.42 ±  2%  perf-profile.calltrace.cycles-pp.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     13.49            -1.1       12.43 ±  2%  perf-profile.calltrace.cycles-pp.ret_from_fork_asm
     13.49            -1.1       12.43 ±  2%  perf-profile.calltrace.cycles-pp.ret_from_fork.ret_from_fork_asm
     13.49            -1.1       12.43 ±  2%  perf-profile.calltrace.cycles-pp.kthread.ret_from_fork.ret_from_fork_asm
     13.46            -1.1       12.40 ±  2%  perf-profile.calltrace.cycles-pp.__do_softirq.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork
     13.46            -1.1       12.40 ±  2%  perf-profile.calltrace.cycles-pp.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     11.50            -0.6       10.88 ±  2%  perf-profile.calltrace.cycles-pp.raw_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
     11.54            -0.6       10.93 ±  2%  perf-profile.calltrace.cycles-pp.inet_recvmsg.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64
     11.72            -0.5       11.18 ±  2%  perf-profile.calltrace.cycles-pp.sock_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.18 ±  3%      -0.5        0.71 ±  4%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.get_partial_node.get_any_partial.___slab_alloc.kmem_cache_alloc
      1.15 ±  3%      -0.5        0.68 ±  4%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.get_partial_node.get_any_partial.___slab_alloc
      1.23 ±  3%      -0.5        0.77 ±  3%  perf-profile.calltrace.cycles-pp.get_partial_node.get_any_partial.___slab_alloc.kmem_cache_alloc.skb_clone
     12.05            -0.4       11.61 ±  2%  perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe.recv
     12.07            -0.4       11.64 ±  2%  perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe.recv
      1.38 ±  3%      -0.4        0.94 ±  3%  perf-profile.calltrace.cycles-pp.get_any_partial.___slab_alloc.kmem_cache_alloc.skb_clone.raw_v4_input
      1.84 ±  2%      -0.3        1.56 ±  6%  perf-profile.calltrace.cycles-pp.__slab_free.inet_sock_destruct.__sk_destruct.rcu_do_batch.rcu_core
      0.78 ±  7%      -0.1        0.66 ±  5%  perf-profile.calltrace.cycles-pp.__do_softirq.do_softirq.__local_bh_enable_ip.sk_common_release.inet_release
      0.78 ±  7%      -0.1        0.66 ±  5%  perf-profile.calltrace.cycles-pp.do_softirq.__local_bh_enable_ip.sk_common_release.inet_release.__sock_release
      0.78 ±  7%      -0.1        0.66 ±  5%  perf-profile.calltrace.cycles-pp.rcu_core.__do_softirq.do_softirq.__local_bh_enable_ip.sk_common_release
      0.78 ±  7%      -0.1        0.67 ±  5%  perf-profile.calltrace.cycles-pp.__local_bh_enable_ip.sk_common_release.inet_release.__sock_release.sock_close
      0.52            +0.0        0.55        perf-profile.calltrace.cycles-pp.icmp_route_lookup.__icmp_send.__udp4_lib_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish
      0.56            +0.0        0.60 ±  2%  perf-profile.calltrace.cycles-pp.new_inode_pseudo.sock_alloc.__sock_create.__sys_socket.__x64_sys_socket
      0.57            +0.0        0.61 ±  2%  perf-profile.calltrace.cycles-pp.sock_alloc.__sock_create.__sys_socket.__x64_sys_socket.do_syscall_64
      0.52            +0.0        0.57 ±  2%  perf-profile.calltrace.cycles-pp.alloc_file.alloc_file_pseudo.sock_alloc_file.__sys_socket.__x64_sys_socket
      0.73            +0.1        0.78        perf-profile.calltrace.cycles-pp.sock_alloc_file.__sys_socket.__x64_sys_socket.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.72            +0.1        0.78 ±  2%  perf-profile.calltrace.cycles-pp.alloc_file_pseudo.sock_alloc_file.__sys_socket.__x64_sys_socket.do_syscall_64
      1.03 ±  3%      +0.1        1.10 ±  3%  perf-profile.calltrace.cycles-pp.sk_filter_trim_cap.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu
      0.59 ±  3%      +0.1        0.68 ±  2%  perf-profile.calltrace.cycles-pp.allocate_slab.___slab_alloc.kmem_cache_alloc.skb_clone.raw_v4_input
      1.14            +0.1        1.22        perf-profile.calltrace.cycles-pp.dst_release.ipv4_pktinfo_prepare.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu
      1.96            +0.1        2.05        perf-profile.calltrace.cycles-pp.skb_release_data.kfree_skb_reason.inet_sock_destruct.__sk_destruct.rcu_do_batch
      1.64            +0.1        1.75 ±  2%  perf-profile.calltrace.cycles-pp.ipv4_pktinfo_prepare.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
      0.42 ± 44%      +0.1        0.55 ±  2%  perf-profile.calltrace.cycles-pp.alloc_empty_file.alloc_file.alloc_file_pseudo.sock_alloc_file.__sys_socket
      2.42 ±  3%      +0.1        2.55 ±  3%  perf-profile.calltrace.cycles-pp.icmp_socket_deliver.icmp_unreach.icmp_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish
      2.41 ±  3%      +0.1        2.54 ±  3%  perf-profile.calltrace.cycles-pp.raw_icmp_error.icmp_socket_deliver.icmp_unreach.icmp_rcv.ip_protocol_deliver_rcu
      2.47 ±  3%      +0.1        2.60 ±  3%  perf-profile.calltrace.cycles-pp.icmp_unreach.icmp_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
      0.66            +0.1        0.80        perf-profile.calltrace.cycles-pp.copyout._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.raw_recvmsg
      2.70 ±  2%      +0.2        2.84 ±  3%  perf-profile.calltrace.cycles-pp.icmp_rcv.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog
      0.70            +0.2        0.88        perf-profile.calltrace.cycles-pp._copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.raw_recvmsg.inet_recvmsg
      0.62 ±  7%      +0.2        0.79 ± 11%  perf-profile.calltrace.cycles-pp.free_unref_page.inet_sock_destruct.__sk_destruct.rcu_do_batch.rcu_core
      0.66 ±  2%      +0.2        0.85 ±  2%  perf-profile.calltrace.cycles-pp.__check_object_size.simple_copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.raw_recvmsg
      0.66            +0.2        0.84        perf-profile.calltrace.cycles-pp.skb_release_data.consume_skb.raw_recvmsg.inet_recvmsg.sock_recvmsg
      0.70 ±  2%      +0.2        0.90 ±  2%  perf-profile.calltrace.cycles-pp.simple_copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter.raw_recvmsg.inet_recvmsg
      0.94            +0.3        1.24        perf-profile.calltrace.cycles-pp.consume_skb.raw_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
      0.81            +0.3        1.14 ±  2%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__skb_try_recv_datagram.__skb_recv_datagram.skb_recv_datagram.raw_recvmsg
      0.99            +0.4        1.37 ±  3%  perf-profile.calltrace.cycles-pp.__skb_try_recv_datagram.__skb_recv_datagram.skb_recv_datagram.raw_recvmsg.inet_recvmsg
      1.53            +0.4        1.94 ±  2%  perf-profile.calltrace.cycles-pp.__skb_datagram_iter.skb_copy_datagram_iter.raw_recvmsg.inet_recvmsg.sock_recvmsg
      1.55 ±  2%      +0.4        1.97 ±  2%  perf-profile.calltrace.cycles-pp.skb_copy_datagram_iter.raw_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
      1.10 ±  2%      +0.4        1.52 ±  3%  perf-profile.calltrace.cycles-pp.__skb_recv_datagram.skb_recv_datagram.raw_recvmsg.inet_recvmsg.sock_recvmsg
      1.12 ±  2%      +0.4        1.54 ±  3%  perf-profile.calltrace.cycles-pp.skb_recv_datagram.raw_recvmsg.inet_recvmsg.sock_recvmsg.__sys_recvfrom
      0.00            +0.5        0.53 ±  2%  perf-profile.calltrace.cycles-pp.alloc_inode.new_inode_pseudo.sock_alloc.__sock_create.__sys_socket
      4.85            +0.6        5.40        perf-profile.calltrace.cycles-pp.__copy_skb_header.__skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
      7.74            +0.6        8.32        perf-profile.calltrace.cycles-pp.kfree_skb_reason.inet_sock_destruct.__sk_destruct.rcu_do_batch.rcu_core
      0.00            +0.6        0.61 ±  2%  perf-profile.calltrace.cycles-pp.check_heap_object.__check_object_size.simple_copy_to_iter.__skb_datagram_iter.skb_copy_datagram_iter
      0.00            +0.7        0.65 ±  3%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__skb_try_recv_datagram.__skb_recv_datagram.skb_recv_datagram
      4.56            +0.7        5.25        perf-profile.calltrace.cycles-pp.sock_def_readable.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input
      7.56            +0.7        8.31        perf-profile.calltrace.cycles-pp.__skb_clone.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
     50.71            +0.8       51.47        perf-profile.calltrace.cycles-pp.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog
      3.67            +0.9        4.59 ±  2%  perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv
     55.46            +1.0       56.43        perf-profile.calltrace.cycles-pp.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action
     55.45            +1.0       56.42        perf-profile.calltrace.cycles-pp.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core.process_backlog.__napi_poll
     58.02            +1.0       59.01        perf-profile.calltrace.cycles-pp.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.raw_send_hdrinc.raw_sendmsg
     57.99            +1.0       58.98        perf-profile.calltrace.cycles-pp.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2
     58.01            +1.0       59.00        perf-profile.calltrace.cycles-pp.do_softirq.__local_bh_enable_ip.__dev_queue_xmit.ip_finish_output2.raw_send_hdrinc
     55.61            +1.0       56.60        perf-profile.calltrace.cycles-pp.__netif_receive_skb_one_core.process_backlog.__napi_poll.net_rx_action.__do_softirq
     58.10            +1.0       59.10        perf-profile.calltrace.cycles-pp.ip_finish_output2.raw_send_hdrinc.raw_sendmsg.sock_sendmsg.__sys_sendto
     58.09            +1.0       59.09        perf-profile.calltrace.cycles-pp.__dev_queue_xmit.ip_finish_output2.raw_send_hdrinc.raw_sendmsg.sock_sendmsg
     55.68            +1.0       56.68        perf-profile.calltrace.cycles-pp.process_backlog.__napi_poll.net_rx_action.__do_softirq.do_softirq
     55.70            +1.0       56.70        perf-profile.calltrace.cycles-pp.net_rx_action.__do_softirq.do_softirq.__local_bh_enable_ip.__dev_queue_xmit
     55.68            +1.0       56.68        perf-profile.calltrace.cycles-pp.__napi_poll.net_rx_action.__do_softirq.do_softirq.__local_bh_enable_ip
     59.51            +1.0       60.54        perf-profile.calltrace.cycles-pp.raw_sendmsg.sock_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
     59.58            +1.0       60.61        perf-profile.calltrace.cycles-pp.sock_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
     58.64            +1.0       59.68        perf-profile.calltrace.cycles-pp.raw_send_hdrinc.raw_sendmsg.sock_sendmsg.__sys_sendto.__x64_sys_sendto
     59.70            +1.0       60.74        perf-profile.calltrace.cycles-pp.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto
     59.69            +1.0       60.73        perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto
      4.97            +1.0        6.02 ±  2%  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input
     59.76            +1.0       60.81        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto
     59.78            +1.1       60.84        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.sendto
     59.85            +1.1       60.91        perf-profile.calltrace.cycles-pp.sendto
     22.21            +2.3       24.51        perf-profile.calltrace.cycles-pp.__sock_queue_rcv_skb.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu
     23.65            +2.4       26.03        perf-profile.calltrace.cycles-pp.sock_queue_rcv_skb_reason.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish
     29.20            +2.7       31.86        perf-profile.calltrace.cycles-pp.raw_rcv.raw_v4_input.ip_protocol_deliver_rcu.ip_local_deliver_finish.__netif_receive_skb_one_core
     28.65            -5.8       22.86 ±  2%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
     24.32            -5.6       18.77 ±  2%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
     10.24 ±  2%      -4.1        6.14 ±  5%  perf-profile.children.cycles-pp.__unfreeze_partials
      8.45            -3.3        5.11 ±  4%  perf-profile.children.cycles-pp.get_partial_node
     10.66            -2.8        7.90 ±  3%  perf-profile.children.cycles-pp.___slab_alloc
     12.63            -2.7        9.92 ±  2%  perf-profile.children.cycles-pp.skb_clone
     12.72            -2.7       10.02 ±  2%  perf-profile.children.cycles-pp.kmem_cache_alloc
     17.07            -1.4       15.72 ±  2%  perf-profile.children.cycles-pp.inet_sock_destruct
     17.58            -1.3       16.26 ±  2%  perf-profile.children.cycles-pp.__sk_destruct
     18.54            -1.3       17.26 ±  2%  perf-profile.children.cycles-pp.rcu_do_batch
     18.54            -1.3       17.27 ±  2%  perf-profile.children.cycles-pp.rcu_core
     13.48            -1.1       12.42 ±  2%  perf-profile.children.cycles-pp.smpboot_thread_fn
     13.49            -1.1       12.43 ±  2%  perf-profile.children.cycles-pp.ret_from_fork_asm
     13.49            -1.1       12.43 ±  2%  perf-profile.children.cycles-pp.ret_from_fork
     13.49            -1.1       12.43 ±  2%  perf-profile.children.cycles-pp.kthread
     13.46            -1.1       12.40 ±  2%  perf-profile.children.cycles-pp.run_ksoftirqd
     11.51            -0.6       10.89 ±  2%  perf-profile.children.cycles-pp.raw_recvmsg
     11.55            -0.6       10.94 ±  2%  perf-profile.children.cycles-pp.inet_recvmsg
     11.73            -0.5       11.19 ±  2%  perf-profile.children.cycles-pp.sock_recvmsg
     12.06            -0.4       11.63 ±  2%  perf-profile.children.cycles-pp.__sys_recvfrom
     12.08            -0.4       11.65 ±  2%  perf-profile.children.cycles-pp.__x64_sys_recvfrom
      1.39 ±  3%      -0.4        0.96 ±  3%  perf-profile.children.cycles-pp.get_any_partial
      3.16            -0.2        3.00 ±  3%  perf-profile.children.cycles-pp.__slab_free
      0.05            +0.0        0.06 ±  6%  perf-profile.children.cycles-pp.syscall_return_via_sysret
      0.13 ±  2%      +0.0        0.14 ±  3%  perf-profile.children.cycles-pp.apparmor_capable
      0.12 ±  3%      +0.0        0.14 ±  5%  perf-profile.children.cycles-pp.apparmor_file_alloc_security
      0.05 ±  7%      +0.0        0.07 ±  7%  perf-profile.children.cycles-pp.__wake_up_common
      0.15            +0.0        0.16 ±  3%  perf-profile.children.cycles-pp._raw_spin_trylock
      0.34 ±  2%      +0.0        0.36        perf-profile.children.cycles-pp.ip_route_output_key_hash
      0.11 ±  6%      +0.0        0.13 ±  2%  perf-profile.children.cycles-pp.rmqueue
      0.12 ±  3%      +0.0        0.14 ±  4%  perf-profile.children.cycles-pp.release_sock
      0.05            +0.0        0.07 ± 10%  perf-profile.children.cycles-pp.schedule_timeout
      0.05            +0.0        0.07 ± 10%  perf-profile.children.cycles-pp.try_to_wake_up
      0.08 ±  6%      +0.0        0.09 ±  5%  perf-profile.children.cycles-pp.__virt_addr_valid
      0.16 ±  2%      +0.0        0.18 ±  2%  perf-profile.children.cycles-pp.__free_one_page
      0.07 ±  6%      +0.0        0.09 ±  5%  perf-profile.children.cycles-pp.exit_to_user_mode_prepare
      0.16 ±  5%      +0.0        0.18 ±  3%  perf-profile.children.cycles-pp.get_page_from_freelist
      0.32 ±  2%      +0.0        0.34        perf-profile.children.cycles-pp.security_socket_post_create
      0.06 ±  7%      +0.0        0.09 ±  5%  perf-profile.children.cycles-pp.stress_rawudp_server
      0.24            +0.0        0.26        perf-profile.children.cycles-pp._raw_spin_lock_bh
      0.06 ±  6%      +0.0        0.08 ±  8%  perf-profile.children.cycles-pp.__skb_wait_for_more_packets
      0.04 ± 44%      +0.0        0.06 ±  7%  perf-profile.children.cycles-pp.__netif_receive_skb_core
      0.24            +0.0        0.26 ±  2%  perf-profile.children.cycles-pp.init_file
      0.18 ±  4%      +0.0        0.21 ±  2%  perf-profile.children.cycles-pp.__alloc_pages
      0.33            +0.0        0.36 ±  3%  perf-profile.children.cycles-pp.sock_alloc_inode
      0.11 ±  8%      +0.0        0.14 ±  5%  perf-profile.children.cycles-pp.__fget_light
      0.15 ±  2%      +0.0        0.18 ±  2%  perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
      0.44            +0.0        0.47        perf-profile.children.cycles-pp.kmem_cache_alloc_lru
      0.29 ±  4%      +0.0        0.32 ±  2%  perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      0.14 ±  2%      +0.0        0.17 ±  2%  perf-profile.children.cycles-pp.put_cpu_partial
      0.23 ±  6%      +0.0        0.26 ±  5%  perf-profile.children.cycles-pp.tick_sched_timer
      0.32 ±  3%      +0.0        0.35 ±  3%  perf-profile.children.cycles-pp.setsockopt
      0.08 ±  5%      +0.0        0.12 ±  6%  perf-profile.children.cycles-pp.__schedule
      0.07 ±  6%      +0.0        0.10 ±  7%  perf-profile.children.cycles-pp.schedule
      0.16 ±  2%      +0.0        0.19 ±  2%  perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.14 ±  6%      +0.0        0.17 ±  4%  perf-profile.children.cycles-pp.sockfd_lookup_light
      0.29            +0.0        0.32 ±  4%  perf-profile.children.cycles-pp.setup_object
      0.28 ±  4%      +0.0        0.32 ±  3%  perf-profile.children.cycles-pp.hrtimer_interrupt
      0.52            +0.0        0.56        perf-profile.children.cycles-pp.icmp_route_lookup
      0.49            +0.0        0.53 ±  2%  perf-profile.children.cycles-pp.alloc_inode
      0.13 ±  4%      +0.0        0.17 ±  2%  perf-profile.children.cycles-pp.syscall_exit_to_user_mode
      0.51            +0.0        0.55 ±  2%  perf-profile.children.cycles-pp.alloc_empty_file
      0.22 ±  2%      +0.0        0.26 ±  3%  perf-profile.children.cycles-pp.__entry_text_start
      0.57            +0.0        0.61        perf-profile.children.cycles-pp.sock_alloc
      0.44            +0.0        0.48        perf-profile.children.cycles-pp.__list_del_entry_valid_or_report
      0.56            +0.0        0.61        perf-profile.children.cycles-pp.new_inode_pseudo
      0.52            +0.0        0.57 ±  2%  perf-profile.children.cycles-pp.alloc_file
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.__ip_finish_output
      0.72 ±  2%      +0.1        0.78 ±  2%  perf-profile.children.cycles-pp.alloc_file_pseudo
      0.15 ±  3%      +0.1        0.20 ±  2%  perf-profile.children.cycles-pp.is_vmalloc_addr
      0.01 ±223%      +0.1        0.06 ±  6%  perf-profile.children.cycles-pp.autoremove_wake_function
      0.73            +0.1        0.78        perf-profile.children.cycles-pp.sock_alloc_file
      0.16 ±  3%      +0.1        0.22 ±  3%  perf-profile.children.cycles-pp.security_socket_recvmsg
      0.20 ±  2%      +0.1        0.27 ±  3%  perf-profile.children.cycles-pp.aa_sk_perm
      1.12 ±  3%      +0.1        1.19 ±  3%  perf-profile.children.cycles-pp.sk_filter_trim_cap
      0.14 ±  3%      +0.1        0.21 ±  5%  perf-profile.children.cycles-pp.__wake_up_common_lock
      0.67            +0.1        0.75        perf-profile.children.cycles-pp.shuffle_freelist
      0.12 ±  4%      +0.1        0.20 ±  4%  perf-profile.children.cycles-pp.__list_add_valid_or_report
      1.24            +0.1        1.33        perf-profile.children.cycles-pp.dst_release
      2.12            +0.1        2.22        perf-profile.children.cycles-pp.kmem_cache_free
      0.89 ±  2%      +0.1        1.00        perf-profile.children.cycles-pp.allocate_slab
      1.72            +0.1        1.84        perf-profile.children.cycles-pp.ipv4_pktinfo_prepare
      2.42 ±  2%      +0.1        2.55 ±  3%  perf-profile.children.cycles-pp.icmp_socket_deliver
      2.41 ±  3%      +0.1        2.54 ±  3%  perf-profile.children.cycles-pp.raw_icmp_error
      2.47 ±  3%      +0.1        2.60 ±  3%  perf-profile.children.cycles-pp.icmp_unreach
      2.70 ±  2%      +0.1        2.85 ±  3%  perf-profile.children.cycles-pp.icmp_rcv
      0.67 ±  2%      +0.2        0.83        perf-profile.children.cycles-pp.copyout
      0.60 ±  2%      +0.2        0.77 ±  2%  perf-profile.children.cycles-pp.check_heap_object
      0.71 ±  2%      +0.2        0.88        perf-profile.children.cycles-pp._copy_to_iter
      0.73 ±  2%      +0.2        0.93 ±  2%  perf-profile.children.cycles-pp.__check_object_size
      0.71 ±  2%      +0.2        0.91 ±  2%  perf-profile.children.cycles-pp.simple_copy_to_iter
      0.82 ±  7%      +0.2        1.03 ± 10%  perf-profile.children.cycles-pp.free_unref_page
      2.39            +0.2        2.63        perf-profile.children.cycles-pp.sock_rfree
      2.99            +0.3        3.28        perf-profile.children.cycles-pp.skb_release_head_state
      1.01            +0.3        1.31        perf-profile.children.cycles-pp.consume_skb
      4.47            +0.4        4.85        perf-profile.children.cycles-pp.skb_release_data
      1.01            +0.4        1.39 ±  3%  perf-profile.children.cycles-pp.__skb_try_recv_datagram
      1.54            +0.4        1.95 ±  2%  perf-profile.children.cycles-pp.__skb_datagram_iter
      1.56 ±  2%      +0.4        1.97 ±  2%  perf-profile.children.cycles-pp.skb_copy_datagram_iter
      1.11 ±  2%      +0.4        1.53 ±  3%  perf-profile.children.cycles-pp.__skb_recv_datagram
      1.12 ±  2%      +0.4        1.55 ±  3%  perf-profile.children.cycles-pp.skb_recv_datagram
      4.89            +0.6        5.45        perf-profile.children.cycles-pp.__copy_skb_header
      4.58            +0.7        5.27        perf-profile.children.cycles-pp.sock_def_readable
     10.70            +0.7       11.42        perf-profile.children.cycles-pp.kfree_skb_reason
      7.65            +0.8        8.40        perf-profile.children.cycles-pp.__skb_clone
     50.92            +0.8       51.69        perf-profile.children.cycles-pp.raw_v4_input
     59.46            +0.8       60.25        perf-profile.children.cycles-pp.do_softirq
     59.48            +0.8       60.28        perf-profile.children.cycles-pp.__local_bh_enable_ip
     85.69            +0.9       86.58        perf-profile.children.cycles-pp.do_syscall_64
     85.79            +0.9       86.70        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     55.46            +1.0       56.44        perf-profile.children.cycles-pp.ip_local_deliver_finish
     55.45            +1.0       56.43        perf-profile.children.cycles-pp.ip_protocol_deliver_rcu
     55.61            +1.0       56.61        perf-profile.children.cycles-pp.__netif_receive_skb_one_core
     55.68            +1.0       56.68        perf-profile.children.cycles-pp.__napi_poll
     55.71            +1.0       56.71        perf-profile.children.cycles-pp.net_rx_action
     55.68            +1.0       56.68        perf-profile.children.cycles-pp.process_backlog
     58.22            +1.0       59.22        perf-profile.children.cycles-pp.__dev_queue_xmit
     58.26            +1.0       59.26        perf-profile.children.cycles-pp.ip_finish_output2
     59.52            +1.0       60.54        perf-profile.children.cycles-pp.raw_sendmsg
     58.65            +1.0       59.68        perf-profile.children.cycles-pp.raw_send_hdrinc
     59.58            +1.0       60.61        perf-profile.children.cycles-pp.sock_sendmsg
     59.71            +1.0       60.75        perf-profile.children.cycles-pp.__x64_sys_sendto
     59.69            +1.0       60.74        perf-profile.children.cycles-pp.__sys_sendto
     59.87            +1.1       60.93        perf-profile.children.cycles-pp.sendto
     22.30            +2.3       24.61        perf-profile.children.cycles-pp.__sock_queue_rcv_skb
     23.78            +2.4       26.16        perf-profile.children.cycles-pp.sock_queue_rcv_skb_reason
     29.56            +2.7       32.22        perf-profile.children.cycles-pp.raw_rcv
     28.12            -5.9       22.26 ±  2%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.64            -0.5        0.12 ±  3%  perf-profile.self.cycles-pp.__unfreeze_partials
      0.45            -0.2        0.26 ±  3%  perf-profile.self.cycles-pp.get_partial_node
      0.05            +0.0        0.06        perf-profile.self.cycles-pp.__dev_queue_xmit
      0.05            +0.0        0.06        perf-profile.self.cycles-pp.__cond_resched
      0.05            +0.0        0.06        perf-profile.self.cycles-pp.syscall_return_via_sysret
      0.06 ±  6%      +0.0        0.07        perf-profile.self.cycles-pp.__check_object_size
      0.13            +0.0        0.14 ±  2%  perf-profile.self.cycles-pp.apparmor_capable
      0.10 ±  3%      +0.0        0.12 ±  4%  perf-profile.self.cycles-pp.__sk_destruct
      0.37            +0.0        0.39        perf-profile.self.cycles-pp.sock_queue_rcv_skb_reason
      0.15 ±  2%      +0.0        0.16 ±  2%  perf-profile.self.cycles-pp._raw_spin_trylock
      0.06 ±  7%      +0.0        0.08        perf-profile.self.cycles-pp.__virt_addr_valid
      0.43            +0.0        0.44        perf-profile.self.cycles-pp.memcg_slab_post_alloc_hook
      0.19 ±  3%      +0.0        0.21        perf-profile.self.cycles-pp.ip_route_output_key_hash_rcu
      0.10 ±  4%      +0.0        0.12 ±  3%  perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.06 ±  9%      +0.0        0.07 ±  5%  perf-profile.self.cycles-pp.is_vmalloc_addr
      0.12 ±  3%      +0.0        0.14        perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
      0.30            +0.0        0.32 ±  2%  perf-profile.self.cycles-pp.apparmor_socket_post_create
      0.06 ±  9%      +0.0        0.08 ±  6%  perf-profile.self.cycles-pp.stress_rawudp_server
      0.41            +0.0        0.43        perf-profile.self.cycles-pp.skb_clone
      0.24            +0.0        0.26        perf-profile.self.cycles-pp._raw_spin_lock_bh
      0.07 ±  5%      +0.0        0.10 ±  5%  perf-profile.self.cycles-pp.__skb_try_recv_datagram
      0.45            +0.0        0.48 ±  2%  perf-profile.self.cycles-pp._raw_spin_lock
      0.36 ±  2%      +0.0        0.38        perf-profile.self.cycles-pp.inet_sock_destruct
      0.10 ±  8%      +0.0        0.12 ±  6%  perf-profile.self.cycles-pp.__fget_light
      0.14 ±  3%      +0.0        0.16 ±  2%  perf-profile.self.cycles-pp.put_cpu_partial
      0.69            +0.0        0.72        perf-profile.self.cycles-pp.raw_rcv
      0.03 ± 70%      +0.0        0.06 ±  6%  perf-profile.self.cycles-pp.__netif_receive_skb_core
      0.15 ±  2%      +0.0        0.18 ±  4%  perf-profile.self.cycles-pp.get_any_partial
      0.11 ±  3%      +0.0        0.15 ±  3%  perf-profile.self.cycles-pp.recv
      0.16 ±  3%      +0.0        0.19 ±  4%  perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
      0.12 ±  5%      +0.0        0.16 ±  3%  perf-profile.self.cycles-pp.__skb_datagram_iter
      0.47            +0.0        0.51        perf-profile.self.cycles-pp.skb_release_head_state
      0.43            +0.0        0.47 ±  2%  perf-profile.self.cycles-pp.__list_del_entry_valid_or_report
      0.00            +0.1        0.05        perf-profile.self.cycles-pp._copy_to_iter
      0.00            +0.1        0.05 ±  7%  perf-profile.self.cycles-pp.__skb_recv_datagram
      0.45 ±  2%      +0.1        0.51 ±  2%  perf-profile.self.cycles-pp.shuffle_freelist
      1.12 ±  3%      +0.1        1.18 ±  5%  perf-profile.self.cycles-pp.raw_v4_input
      0.16 ±  2%      +0.1        0.22 ±  2%  perf-profile.self.cycles-pp.aa_sk_perm
      0.20 ±  2%      +0.1        0.26 ±  3%  perf-profile.self.cycles-pp.__sys_recvfrom
      0.11 ±  5%      +0.1        0.19 ±  3%  perf-profile.self.cycles-pp.__list_add_valid_or_report
      1.18            +0.1        1.26        perf-profile.self.cycles-pp.dst_release
      1.91            +0.1        2.00        perf-profile.self.cycles-pp.kmem_cache_free
      0.44            +0.1        0.57 ±  2%  perf-profile.self.cycles-pp.check_heap_object
      2.40 ±  3%      +0.1        2.53 ±  3%  perf-profile.self.cycles-pp.raw_icmp_error
      0.64 ±  2%      +0.2        0.79        perf-profile.self.cycles-pp.copyout
      2.25            +0.2        2.42        perf-profile.self.cycles-pp.__slab_free
      0.60 ±  2%      +0.2        0.78 ±  3%  perf-profile.self.cycles-pp.raw_recvmsg
      2.75            +0.2        2.94        perf-profile.self.cycles-pp.__skb_clone
      2.36            +0.2        2.60        perf-profile.self.cycles-pp.sock_rfree
      4.24            +0.4        4.59        perf-profile.self.cycles-pp.kfree_skb_reason
      4.24            +0.4        4.60        perf-profile.self.cycles-pp.skb_release_data
      2.52            +0.4        2.91        perf-profile.self.cycles-pp._raw_spin_lock_irqsave
     12.54            +0.6       13.09        perf-profile.self.cycles-pp.__sock_queue_rcv_skb
      4.83            +0.6        5.38        perf-profile.self.cycles-pp.__copy_skb_header
      4.40            +0.6        5.03        perf-profile.self.cycles-pp.sock_def_readable
      0.91            +0.7        1.60        perf-profile.self.cycles-pp.___slab_alloc




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-02  3:23 ` [PATCH v5 6/9] slub: Delay freezing of partial slabs chengming.zhou
  2023-11-14  5:44   ` kernel test robot
@ 2023-11-20 18:49   ` Mark Brown
  2023-11-21  0:58     ` Chengming Zhou
  2023-11-22  9:37     ` Vlastimil Babka
  2023-12-03  6:53   ` Hyeonggon Yoo
  2 siblings, 2 replies; 44+ messages in thread
From: Mark Brown @ 2023-11-20 18:49 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, 42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

[-- Attachment #1: Type: text/plain, Size: 6629 bytes --]

On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Now we will freeze slabs when moving them out of node partial list to
> cpu partial list, this method needs two cmpxchg_double operations:
> 
> 1. freeze slab (acquire_slab()) under the node list_lock
> 2. get_freelist() when pick used in ___slab_alloc()

Recently -next has been failing to boot on a Raspberry Pi 3 with an arm
multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this
patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having
introduced the issue.  I've included the full bisect log below.

When we see problems we see RCU stalls while logging in, for example:

debian-testing-armhf login: root (automatic login)
Linux debian-testing-armhf 6.7.0-rc1-00006-gc8d312e03903 #1 SMP @1699864348 armv7l
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
[   46.453323] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   46.459361] rcu: 	3-...0: (1 GPs behind) idle=def4/1/0x40000000 softirq=1304/1304 fqs=951
[   46.467669] rcu: 	(detected by 0, t=2103 jiffies, g=1161, q=499 ncpus=4)
[   46.474472] Sending NMI from CPU 0 to CPUs 3:
[   56.478894] rcu: rcu_sched kthread timer wakeup didn't happen for 1002 jiffies! g1161 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   56.490195] rcu: 	Possible timer handling issue on cpu=0 timer-softirq=1650
[   56.497259] rcu: rcu_sched kthread starved for 1005 jiffies! g1161 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   56.507589] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[   56.516681] rcu: RCU grace-period kthread stack dump:
[   56.521803] task:rcu_sched       state:I stack:0     pid:13    tgid:13    ppid:2      flags:0x00000000
[   56.531267]  __schedule from schedule+0x20/0xe8
[   56.535883]  schedule from schedule_timeout+0xa0/0x158
[   56.541111]  schedule_timeout from rcu_gp_fqs_loop+0x104/0x594
[   56.547048]  rcu_gp_fqs_loop from rcu_gp_kthread+0x14c/0x1c0
[   56.552801]  rcu_gp_kthread from kthread+0xe0/0xfc
[   56.557674]  kthread from ret_from_fork+0x14/0x28
[   56.562457] Exception stack(0xf084dfb0 to 0xf084dff8)
[   56.567584] dfa0:                                     00000000 00000000 00000000 00000000
[   56.575886] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[   56.584191] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[   56.590907] rcu: Stack dump where RCU GP kthread last ran:
[   56.596474] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc1-00006-gc8d312e03903 #1
[   56.604515] Hardware name: BCM2835
[   56.607965] PC is at default_idle_call+0x1c/0xb0
[   56.612654] LR is at ct_kernel_enter.constprop.0+0x48/0x11c
[   56.618311] pc : [<c1197054>]    lr : [<c1195c98>]    psr: 60010013
[   56.624672] sp : c1b01f70  ip : c1d5af7c  fp : 00000000
[   56.629974] r10: c19cda60  r9 : 00000000  r8 : 00000000
[   56.635277] r7 : c1b04d50  r6 : c1b04d18  r5 : c1d5b684  r4 : c1b09740
[   56.641902] r3 : 00000000  r2 : 00000000  r1 : 00000001  r0 : 002a3114
[   56.648528] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   56.655774] Control: 10c5383d  Table: 0237406a  DAC: 00000051
[   56.661605]  default_idle_call from do_idle+0x208/0x290
[   56.666920]  do_idle from cpu_startup_entry+0x28/0x2c
[   56.672059]  cpu_startup_entry from rest_init+0xac/0xb0
[   56.677371]  rest_init from arch_post_acpi_subsys_init+0x0/0x8
Login ti

A full log for that run can be seen at:

   https://validation.linaro.org/scheduler/job/4017095

Boots to initramfs with the same kernel image seem fine.  Other systems,
including other 32 bit arm ones, don't seem to be having similar issues
with this userspace.  I've not investigated beyond running the bisect,
the log for which is below:

git bisect start
# good: [64e6d94bfb47ed0732ad06aedf8ec6af5dd2ab84] Merge branch 'for-linux-next-fixes' of git://anongit.freedesktop.org/drm/drm-misc
git bisect good 64e6d94bfb47ed0732ad06aedf8ec6af5dd2ab84
# bad: [5a82d69d48c82e89aef44483d2a129f869f3506a] Add linux-next specific files for 20231120
git bisect bad 5a82d69d48c82e89aef44483d2a129f869f3506a
# good: [ce252a92a867da8a6622463bff637e5f7b904a46] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git
git bisect good ce252a92a867da8a6622463bff637e5f7b904a46
# good: [c22e026efad504e3b056d4436920d173a09c580e] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git
git bisect good c22e026efad504e3b056d4436920d173a09c580e
# good: [b7fc58ffb105470cb339163cc2b04e3f59387a45] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/fpga/linux-fpga.git
git bisect good b7fc58ffb105470cb339163cc2b04e3f59387a45
# good: [26f89f0614f03e4016578a992fc2e86b048a5cb4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl.git
git bisect good 26f89f0614f03e4016578a992fc2e86b048a5cb4
# good: [602bf18307981f3bfd9ebf19921791a4256d3fd1] Merge branch 'for-6.7' into for-next
git bisect good 602bf18307981f3bfd9ebf19921791a4256d3fd1
# good: [9f16a68069822b1df6bfb8a9ef7258a1e32b25e7] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
git bisect good 9f16a68069822b1df6bfb8a9ef7258a1e32b25e7
# good: [3ff57db6f6569ebc2cc333437e7e949749e59424] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git
git bisect good 3ff57db6f6569ebc2cc333437e7e949749e59424
# bad: [dd374e220ba492f95344a638b1efe5b2744fdd73] slub: Update frozen slabs documentations in the source
git bisect bad dd374e220ba492f95344a638b1efe5b2744fdd73
# good: [a3058965bb35490454953aa2c87ea51004839f2f] slub: Prepare __slab_free() for unfrozen partial slab out of node partial list
git bisect good a3058965bb35490454953aa2c87ea51004839f2f
# bad: [c8d312e039030edab25836a326bcaeb2a3d4db14] slub: Delay freezing of partial slabs
git bisect bad c8d312e039030edab25836a326bcaeb2a3d4db14
# good: [00b15a19ee543f0117cb217fcbab8b7b3fd50677] slub: Introduce freeze_slab()
git bisect good 00b15a19ee543f0117cb217fcbab8b7b3fd50677
# first bad commit: [c8d312e039030edab25836a326bcaeb2a3d4db14] slub: Delay freezing of partial slabs

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-20 18:49   ` Mark Brown
@ 2023-11-21  0:58     ` Chengming Zhou
  2023-11-21  1:29       ` Mark Brown
  2023-11-22  9:37     ` Vlastimil Babka
  1 sibling, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2023-11-21  0:58 UTC (permalink / raw)
  To: Mark Brown
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, 42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

On 2023/11/21 02:49, Mark Brown wrote:
> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> Now we will freeze slabs when moving them out of node partial list to
>> cpu partial list, this method needs two cmpxchg_double operations:
>>
>> 1. freeze slab (acquire_slab()) under the node list_lock
>> 2. get_freelist() when pick used in ___slab_alloc()
> 
> Recently -next has been failing to boot on a Raspberry Pi 3 with an arm
> multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this
> patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having
> introduced the issue.  I've included the full bisect log below.
> 
> When we see problems we see RCU stalls while logging in, for example:
> 
> debian-testing-armhf login: root (automatic login)
> Linux debian-testing-armhf 6.7.0-rc1-00006-gc8d312e03903 #1 SMP @1699864348 armv7l
> The programs included with the Debian GNU/Linux system are free software;
> the exact distribution terms for each program are described in the
> individual files in /usr/share/doc/*/copyright.
> Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
> permitted by applicable law.
> [   46.453323] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [   46.459361] rcu: 	3-...0: (1 GPs behind) idle=def4/1/0x40000000 softirq=1304/1304 fqs=951
> [   46.467669] rcu: 	(detected by 0, t=2103 jiffies, g=1161, q=499 ncpus=4)
> [   46.474472] Sending NMI from CPU 0 to CPUs 3:

IIUC, here should print the backtrace of CPU 3, right? It looks like CPU 3 is the cause,
but we couldn't see what it's doing from the log.

Thanks!

> [   56.478894] rcu: rcu_sched kthread timer wakeup didn't happen for 1002 jiffies! g1161 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [   56.490195] rcu: 	Possible timer handling issue on cpu=0 timer-softirq=1650
> [   56.497259] rcu: rcu_sched kthread starved for 1005 jiffies! g1161 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
> [   56.507589] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [   56.516681] rcu: RCU grace-period kthread stack dump:
> [   56.521803] task:rcu_sched       state:I stack:0     pid:13    tgid:13    ppid:2      flags:0x00000000
> [   56.531267]  __schedule from schedule+0x20/0xe8
> [   56.535883]  schedule from schedule_timeout+0xa0/0x158
> [   56.541111]  schedule_timeout from rcu_gp_fqs_loop+0x104/0x594
> [   56.547048]  rcu_gp_fqs_loop from rcu_gp_kthread+0x14c/0x1c0
> [   56.552801]  rcu_gp_kthread from kthread+0xe0/0xfc
> [   56.557674]  kthread from ret_from_fork+0x14/0x28
> [   56.562457] Exception stack(0xf084dfb0 to 0xf084dff8)
> [   56.567584] dfa0:                                     00000000 00000000 00000000 00000000
> [   56.575886] dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [   56.584191] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [   56.590907] rcu: Stack dump where RCU GP kthread last ran:
> [   56.596474] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.7.0-rc1-00006-gc8d312e03903 #1
> [   56.604515] Hardware name: BCM2835
> [   56.607965] PC is at default_idle_call+0x1c/0xb0
> [   56.612654] LR is at ct_kernel_enter.constprop.0+0x48/0x11c
> [   56.618311] pc : [<c1197054>]    lr : [<c1195c98>]    psr: 60010013
> [   56.624672] sp : c1b01f70  ip : c1d5af7c  fp : 00000000
> [   56.629974] r10: c19cda60  r9 : 00000000  r8 : 00000000
> [   56.635277] r7 : c1b04d50  r6 : c1b04d18  r5 : c1d5b684  r4 : c1b09740
> [   56.641902] r3 : 00000000  r2 : 00000000  r1 : 00000001  r0 : 002a3114
> [   56.648528] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   56.655774] Control: 10c5383d  Table: 0237406a  DAC: 00000051
> [   56.661605]  default_idle_call from do_idle+0x208/0x290
> [   56.666920]  do_idle from cpu_startup_entry+0x28/0x2c
> [   56.672059]  cpu_startup_entry from rest_init+0xac/0xb0
> [   56.677371]  rest_init from arch_post_acpi_subsys_init+0x0/0x8
> Login ti
> 
> A full log for that run can be seen at:
> 
>    https://validation.linaro.org/scheduler/job/4017095
> 
> Boots to initramfs with the same kernel image seem fine.  Other systems,
> including other 32 bit arm ones, don't seem to be having similar issues
> with this userspace.  I've not investigated beyond running the bisect,
> the log for which is below:
> 
> git bisect start
> # good: [64e6d94bfb47ed0732ad06aedf8ec6af5dd2ab84] Merge branch 'for-linux-next-fixes' of git://anongit.freedesktop.org/drm/drm-misc
> git bisect good 64e6d94bfb47ed0732ad06aedf8ec6af5dd2ab84
> # bad: [5a82d69d48c82e89aef44483d2a129f869f3506a] Add linux-next specific files for 20231120
> git bisect bad 5a82d69d48c82e89aef44483d2a129f869f3506a
> # good: [ce252a92a867da8a6622463bff637e5f7b904a46] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git
> git bisect good ce252a92a867da8a6622463bff637e5f7b904a46
> # good: [c22e026efad504e3b056d4436920d173a09c580e] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator.git
> git bisect good c22e026efad504e3b056d4436920d173a09c580e
> # good: [b7fc58ffb105470cb339163cc2b04e3f59387a45] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/fpga/linux-fpga.git
> git bisect good b7fc58ffb105470cb339163cc2b04e3f59387a45
> # good: [26f89f0614f03e4016578a992fc2e86b048a5cb4] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl.git
> git bisect good 26f89f0614f03e4016578a992fc2e86b048a5cb4
> # good: [602bf18307981f3bfd9ebf19921791a4256d3fd1] Merge branch 'for-6.7' into for-next
> git bisect good 602bf18307981f3bfd9ebf19921791a4256d3fd1
> # good: [9f16a68069822b1df6bfb8a9ef7258a1e32b25e7] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching
> git bisect good 9f16a68069822b1df6bfb8a9ef7258a1e32b25e7
> # good: [3ff57db6f6569ebc2cc333437e7e949749e59424] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git
> git bisect good 3ff57db6f6569ebc2cc333437e7e949749e59424
> # bad: [dd374e220ba492f95344a638b1efe5b2744fdd73] slub: Update frozen slabs documentations in the source
> git bisect bad dd374e220ba492f95344a638b1efe5b2744fdd73
> # good: [a3058965bb35490454953aa2c87ea51004839f2f] slub: Prepare __slab_free() for unfrozen partial slab out of node partial list
> git bisect good a3058965bb35490454953aa2c87ea51004839f2f
> # bad: [c8d312e039030edab25836a326bcaeb2a3d4db14] slub: Delay freezing of partial slabs
> git bisect bad c8d312e039030edab25836a326bcaeb2a3d4db14
> # good: [00b15a19ee543f0117cb217fcbab8b7b3fd50677] slub: Introduce freeze_slab()
> git bisect good 00b15a19ee543f0117cb217fcbab8b7b3fd50677
> # first bad commit: [c8d312e039030edab25836a326bcaeb2a3d4db14] slub: Delay freezing of partial slabs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-21  0:58     ` Chengming Zhou
@ 2023-11-21  1:29       ` Mark Brown
  2023-11-21 15:47         ` Chengming Zhou
  0 siblings, 1 reply; 44+ messages in thread
From: Mark Brown @ 2023-11-21  1:29 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, 42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

[-- Attachment #1: Type: text/plain, Size: 1193 bytes --]

On Tue, Nov 21, 2023 at 08:58:40AM +0800, Chengming Zhou wrote:
> On 2023/11/21 02:49, Mark Brown wrote:
> > On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:

> > When we see problems we see RCU stalls while logging in, for example:

> > [   46.453323] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> > [   46.459361] rcu: 	3-...0: (1 GPs behind) idle=def4/1/0x40000000 softirq=1304/1304 fqs=951
> > [   46.467669] rcu: 	(detected by 0, t=2103 jiffies, g=1161, q=499 ncpus=4)
> > [   46.474472] Sending NMI from CPU 0 to CPUs 3:

> IIUC, here should print the backtrace of CPU 3, right? It looks like CPU 3 is the cause,
> but we couldn't see what it's doing from the log.

AIUI yes, but it looks like we've just completely lost the CPU - there's
more attempts to talk to it visible in the log:

> > A full log for that run can be seen at:
> > 
> >    https://validation.linaro.org/scheduler/job/4017095

but none of them appear to cause CPU 3 to respond.  Note that 32 bit ARM
is just using a regular IPI rather than something that's actually a NMI
so this isn't hugely out of the ordinary, I'd guess it's stuck with
interrupts masked.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-21  1:29       ` Mark Brown
@ 2023-11-21 15:47         ` Chengming Zhou
  2023-11-21 18:21           ` Mark Brown
  0 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2023-11-21 15:47 UTC (permalink / raw)
  To: Mark Brown
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, 42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

On 2023/11/21 09:29, Mark Brown wrote:
> On Tue, Nov 21, 2023 at 08:58:40AM +0800, Chengming Zhou wrote:
>> On 2023/11/21 02:49, Mark Brown wrote:
>>> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
> 
>>> When we see problems we see RCU stalls while logging in, for example:
> 
>>> [   46.453323] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>>> [   46.459361] rcu: 	3-...0: (1 GPs behind) idle=def4/1/0x40000000 softirq=1304/1304 fqs=951
>>> [   46.467669] rcu: 	(detected by 0, t=2103 jiffies, g=1161, q=499 ncpus=4)
>>> [   46.474472] Sending NMI from CPU 0 to CPUs 3:
> 
>> IIUC, here should print the backtrace of CPU 3, right? It looks like CPU 3 is the cause,
>> but we couldn't see what it's doing from the log.
> 
> AIUI yes, but it looks like we've just completely lost the CPU - there's
> more attempts to talk to it visible in the log:
> 
>>> A full log for that run can be seen at:
>>>
>>>    https://validation.linaro.org/scheduler/job/4017095
> 
> but none of them appear to cause CPU 3 to respond.  Note that 32 bit ARM
> is just using a regular IPI rather than something that's actually a NMI
> so this isn't hugely out of the ordinary, I'd guess it's stuck with
> interrupts masked.

Ah yes, there is no NMI on ARM, so CPU 3 maybe running somewhere with
interrupts disabled. I searched the full log, but still haven't a clue.
And there is no any WARNING or BUG related to SLUB in the log.

I wonder how to reproduce it locally with a Qemu VM since I don't have
the ARM machine.

Thanks!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-21 15:47         ` Chengming Zhou
@ 2023-11-21 18:21           ` Mark Brown
  2023-11-22  8:52             ` Vlastimil Babka
  0 siblings, 1 reply; 44+ messages in thread
From: Mark Brown @ 2023-11-21 18:21 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, 42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

[-- Attachment #1: Type: text/plain, Size: 1434 bytes --]

On Tue, Nov 21, 2023 at 11:47:26PM +0800, Chengming Zhou wrote:

> Ah yes, there is no NMI on ARM, so CPU 3 maybe running somewhere with
> interrupts disabled. I searched the full log, but still haven't a clue.
> And there is no any WARNING or BUG related to SLUB in the log.

Yeah, nor anything else particularly.  I tried turning on some debug
options:

CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_DETECT_HUNG_TASK=y
CONFIG_WQ_WATCHDOG=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_DEBUG_LOCKING=y
CONFIG_DEBUG_ATOMIC_SLEEP=y

https://validation.linaro.org/scheduler/job/4017828

which has some additional warnings related to clock changes but AFAICT
those come from today's -next rather than the debug stuff:

https://validation.linaro.org/scheduler/job/4017823

so that's not super helpful.

> I wonder how to reproduce it locally with a Qemu VM since I don't have
> the ARM machine.

There's sample qemu jobs available from for example KernelCI:

   https://storage.kernelci.org/next/master/next-20231120/arm/multi_v7_defconfig/gcc-10/lab-baylibre/baseline-qemu_arm-virt-gicv3.html

(includes the command line, though it's not using Debian testing like my
test was).  Note that I'm testing a bunch of platforms with the same
kernel/rootfs combination and it was only the Raspberry Pi 3 which blew
up.  It is a bit tight for memory which might have some influence?

I'm really suspecting this may have made some underlying platform bug
more obvious :/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-21 18:21           ` Mark Brown
@ 2023-11-22  8:52             ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2023-11-22  8:52 UTC (permalink / raw)
  To: Mark Brown, Chengming Zhou
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

On 11/21/23 19:21, Mark Brown wrote:
> On Tue, Nov 21, 2023 at 11:47:26PM +0800, Chengming Zhou wrote:
> 
>> Ah yes, there is no NMI on ARM, so CPU 3 maybe running somewhere with
>> interrupts disabled. I searched the full log, but still haven't a clue.
>> And there is no any WARNING or BUG related to SLUB in the log.
> 
> Yeah, nor anything else particularly.  I tried turning on some debug
> options:
> 
> CONFIG_SOFTLOCKUP_DETECTOR=y
> CONFIG_DETECT_HUNG_TASK=y
> CONFIG_WQ_WATCHDOG=y
> CONFIG_DEBUG_PREEMPT=y
> CONFIG_DEBUG_LOCKING=y
> CONFIG_DEBUG_ATOMIC_SLEEP=y
> 
> https://validation.linaro.org/scheduler/job/4017828
> 
> which has some additional warnings related to clock changes but AFAICT
> those come from today's -next rather than the debug stuff:
> 
> https://validation.linaro.org/scheduler/job/4017823
> 
> so that's not super helpful.

For the record (and to help debugging focus) on IRC we discussed that with
CONFIG_SLUB_CPU_PARTIAL=n the problem persists:
https://validation.linaro.org/scheduler/job/4017863
Which limits the scope of where to look so that's good :)

>> I wonder how to reproduce it locally with a Qemu VM since I don't have
>> the ARM machine.
> 
> There's sample qemu jobs available from for example KernelCI:
> 
>    https://storage.kernelci.org/next/master/next-20231120/arm/multi_v7_defconfig/gcc-10/lab-baylibre/baseline-qemu_arm-virt-gicv3.html
> 
> (includes the command line, though it's not using Debian testing like my
> test was).  Note that I'm testing a bunch of platforms with the same
> kernel/rootfs combination and it was only the Raspberry Pi 3 which blew
> up.  It is a bit tight for memory which might have some influence?
> 
> I'm really suspecting this may have made some underlying platform bug
> more obvious :/



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-20 18:49   ` Mark Brown
  2023-11-21  0:58     ` Chengming Zhou
@ 2023-11-22  9:37     ` Vlastimil Babka
  2023-11-22 11:27       ` Mark Brown
  2023-11-22 11:35       ` Chengming Zhou
  1 sibling, 2 replies; 44+ messages in thread
From: Vlastimil Babka @ 2023-11-22  9:37 UTC (permalink / raw)
  To: Mark Brown, chengming.zhou
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou, Matthew Wilcox

On 11/20/23 19:49, Mark Brown wrote:
> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>> 
>> Now we will freeze slabs when moving them out of node partial list to
>> cpu partial list, this method needs two cmpxchg_double operations:
>> 
>> 1. freeze slab (acquire_slab()) under the node list_lock
>> 2. get_freelist() when pick used in ___slab_alloc()
> 
> Recently -next has been failing to boot on a Raspberry Pi 3 with an arm
> multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this
> patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having
> introduced the issue.  I've included the full bisect log below.
> 
> When we see problems we see RCU stalls while logging in, for example:

Can you try this, please?

----8<----
From 000030c1ff055ef6a2ca624d0142f08f3ef19d51 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 22 Nov 2023 10:32:41 +0100
Subject: [PATCH] mm/slub: try to fix hangs without cmpxchg64/128

If we don't have cmpxchg64/128 and resort to slab_lock()/slab_unlock()
which uses PG_locked, we can get RMW with the newly introduced
slab_set/clear_node_partial() operation that modify PG_workingset so all
the operations have to be atomic now.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index bcb5b2c4e213..f2cdb81ab02e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -522,7 +522,7 @@ static __always_inline void slab_unlock(struct slab *slab)
 	struct page *page = slab_page(slab);
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
-	__bit_spin_unlock(PG_locked, &page->flags);
+	bit_spin_unlock(PG_locked, &page->flags);
 }
 
 static inline bool
@@ -2127,12 +2127,12 @@ static inline bool slab_test_node_partial(const struct slab *slab)
 
 static inline void slab_set_node_partial(struct slab *slab)
 {
-	__set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+	set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
 }
 
 static inline void slab_clear_node_partial(struct slab *slab)
 {
-	__clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+	clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
 }
 
 /*
-- 
2.42.1





^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-22  9:37     ` Vlastimil Babka
@ 2023-11-22 11:27       ` Mark Brown
  2023-11-22 11:35       ` Chengming Zhou
  1 sibling, 0 replies; 44+ messages in thread
From: Mark Brown @ 2023-11-22 11:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: chengming.zhou, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, 42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou,
	Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 578 bytes --]

On Wed, Nov 22, 2023 at 10:37:39AM +0100, Vlastimil Babka wrote:

> Can you try this, please?

> Subject: [PATCH] mm/slub: try to fix hangs without cmpxchg64/128
> 
> If we don't have cmpxchg64/128 and resort to slab_lock()/slab_unlock()
> which uses PG_locked, we can get RMW with the newly introduced
> slab_set/clear_node_partial() operation that modify PG_workingset so all
> the operations have to be atomic now.

That seems to resolve the issue:

  https://validation.linaro.org/scheduler/job/4018096

Tested-by: Mark Brown <broonie@kernel.org>

Thanks!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-22  9:37     ` Vlastimil Babka
  2023-11-22 11:27       ` Mark Brown
@ 2023-11-22 11:35       ` Chengming Zhou
  2023-11-22 11:40         ` Vlastimil Babka
  1 sibling, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2023-11-22 11:35 UTC (permalink / raw)
  To: Vlastimil Babka, Mark Brown
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou, Matthew Wilcox

On 2023/11/22 17:37, Vlastimil Babka wrote:
> On 11/20/23 19:49, Mark Brown wrote:
>> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>
>>> Now we will freeze slabs when moving them out of node partial list to
>>> cpu partial list, this method needs two cmpxchg_double operations:
>>>
>>> 1. freeze slab (acquire_slab()) under the node list_lock
>>> 2. get_freelist() when pick used in ___slab_alloc()
>>
>> Recently -next has been failing to boot on a Raspberry Pi 3 with an arm
>> multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this
>> patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having
>> introduced the issue.  I've included the full bisect log below.
>>
>> When we see problems we see RCU stalls while logging in, for example:
> 
> Can you try this, please?
> 

Great! I manually disabled __CMPXCHG_DOUBLE to reproduce the problem,
and this patch can solve the machine hang problem.

BTW, I also did the performance testcase on the machine with 128 CPUs.

stress-ng --rawpkt 128 --rawpkt-ops 100000000

base    patched
2.22s   2.35s
2.21s   3.14s
2.19s   4.75s

Found this atomic version performance numbers are not stable.

Should I change back to reuse the slab->__unused (mapcount) field?

Or should we check "s->flags & __CMPXCHG_DOUBLE" in slab_set/clear_node_partial()
to avoid using the atomic version?

Thanks!

> ----8<----
> From 000030c1ff055ef6a2ca624d0142f08f3ef19d51 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 22 Nov 2023 10:32:41 +0100
> Subject: [PATCH] mm/slub: try to fix hangs without cmpxchg64/128
> 
> If we don't have cmpxchg64/128 and resort to slab_lock()/slab_unlock()
> which uses PG_locked, we can get RMW with the newly introduced
> slab_set/clear_node_partial() operation that modify PG_workingset so all
> the operations have to be atomic now.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index bcb5b2c4e213..f2cdb81ab02e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -522,7 +522,7 @@ static __always_inline void slab_unlock(struct slab *slab)
>  	struct page *page = slab_page(slab);
>  
>  	VM_BUG_ON_PAGE(PageTail(page), page);
> -	__bit_spin_unlock(PG_locked, &page->flags);
> +	bit_spin_unlock(PG_locked, &page->flags);
>  }
>  
>  static inline bool
> @@ -2127,12 +2127,12 @@ static inline bool slab_test_node_partial(const struct slab *slab)
>  
>  static inline void slab_set_node_partial(struct slab *slab)
>  {
> -	__set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> +	set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
>  }
>  
>  static inline void slab_clear_node_partial(struct slab *slab)
>  {
> -	__clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> +	clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-22 11:35       ` Chengming Zhou
@ 2023-11-22 11:40         ` Vlastimil Babka
  2023-11-22 11:54           ` Chengming Zhou
  0 siblings, 1 reply; 44+ messages in thread
From: Vlastimil Babka @ 2023-11-22 11:40 UTC (permalink / raw)
  To: Chengming Zhou, Mark Brown
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou, Matthew Wilcox

On 11/22/23 12:35, Chengming Zhou wrote:
> On 2023/11/22 17:37, Vlastimil Babka wrote:
>> On 11/20/23 19:49, Mark Brown wrote:
>>> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
>>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>>
>>>> Now we will freeze slabs when moving them out of node partial list to
>>>> cpu partial list, this method needs two cmpxchg_double operations:
>>>>
>>>> 1. freeze slab (acquire_slab()) under the node list_lock
>>>> 2. get_freelist() when pick used in ___slab_alloc()
>>>
>>> Recently -next has been failing to boot on a Raspberry Pi 3 with an arm
>>> multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this
>>> patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having
>>> introduced the issue.  I've included the full bisect log below.
>>>
>>> When we see problems we see RCU stalls while logging in, for example:
>> 
>> Can you try this, please?
>> 
> 
> Great! I manually disabled __CMPXCHG_DOUBLE to reproduce the problem,
> and this patch can solve the machine hang problem.
> 
> BTW, I also did the performance testcase on the machine with 128 CPUs.
> 
> stress-ng --rawpkt 128 --rawpkt-ops 100000000
> 
> base    patched
> 2.22s   2.35s
> 2.21s   3.14s
> 2.19s   4.75s
> 
> Found this atomic version performance numbers are not stable.

That's weirdly too bad. Is that measured also with __CMPXCHG_DOUBLE
disabled, or just the patch? The PG_workingset flag change should be
uncontended as we are doing it under list_lock, and with __CMPXCHG_DOUBLE
there should be no interfering PG_locked interference.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-22 11:40         ` Vlastimil Babka
@ 2023-11-22 11:54           ` Chengming Zhou
  2023-11-22 13:19             ` Vlastimil Babka
  0 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2023-11-22 11:54 UTC (permalink / raw)
  To: Vlastimil Babka, Mark Brown
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou, Matthew Wilcox

On 2023/11/22 19:40, Vlastimil Babka wrote:
> On 11/22/23 12:35, Chengming Zhou wrote:
>> On 2023/11/22 17:37, Vlastimil Babka wrote:
>>> On 11/20/23 19:49, Mark Brown wrote:
>>>> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
>>>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>>>
>>>>> Now we will freeze slabs when moving them out of node partial list to
>>>>> cpu partial list, this method needs two cmpxchg_double operations:
>>>>>
>>>>> 1. freeze slab (acquire_slab()) under the node list_lock
>>>>> 2. get_freelist() when pick used in ___slab_alloc()
>>>>
>>>> Recently -next has been failing to boot on a Raspberry Pi 3 with an arm
>>>> multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this
>>>> patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having
>>>> introduced the issue.  I've included the full bisect log below.
>>>>
>>>> When we see problems we see RCU stalls while logging in, for example:
>>>
>>> Can you try this, please?
>>>
>>
>> Great! I manually disabled __CMPXCHG_DOUBLE to reproduce the problem,
>> and this patch can solve the machine hang problem.
>>
>> BTW, I also did the performance testcase on the machine with 128 CPUs.
>>
>> stress-ng --rawpkt 128 --rawpkt-ops 100000000
>>
>> base    patched
>> 2.22s   2.35s
>> 2.21s   3.14s
>> 2.19s   4.75s
>>
>> Found this atomic version performance numbers are not stable.
> 
> That's weirdly too bad. Is that measured also with __CMPXCHG_DOUBLE
> disabled, or just the patch? The PG_workingset flag change should be

The performance test is just the patch.

> uncontended as we are doing it under list_lock, and with __CMPXCHG_DOUBLE
> there should be no interfering PG_locked interference.
> 

Yes, I don't know. Maybe it's related with my kernel config, making the
atomic operation much expensive? Will look again..

And I also tested the atomic-optional version like below, found the
performance numbers are much stable.

diff --git a/mm/slub.c b/mm/slub.c
index a307d319e82c..e11d34d51a14 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -531,7 +531,7 @@ static __always_inline void slab_unlock(struct slab *slab)
        struct page *page = slab_page(slab);

        VM_BUG_ON_PAGE(PageTail(page), page);
-       __bit_spin_unlock(PG_locked, &page->flags);
+       bit_spin_unlock(PG_locked, &page->flags);
 }

 static inline bool
@@ -2136,12 +2136,18 @@ static inline bool slab_test_node_partial(const struct slab *slab)

 static inline void slab_set_node_partial(struct slab *slab)
 {
-       __set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+       if (slab->slab_cache->flags & __CMPXCHG_DOUBLE)
+               __set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+       else
+               set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
 }

 static inline void slab_clear_node_partial(struct slab *slab)
 {
-       __clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+       if (slab->slab_cache->flags & __CMPXCHG_DOUBLE)
+               __clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
+       else
+               clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
 }


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-22 11:54           ` Chengming Zhou
@ 2023-11-22 13:19             ` Vlastimil Babka
  2023-11-22 14:28               ` Chengming Zhou
  0 siblings, 1 reply; 44+ messages in thread
From: Vlastimil Babka @ 2023-11-22 13:19 UTC (permalink / raw)
  To: Chengming Zhou, Mark Brown
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou, Matthew Wilcox

On 11/22/23 12:54, Chengming Zhou wrote:
> On 2023/11/22 19:40, Vlastimil Babka wrote:
>> On 11/22/23 12:35, Chengming Zhou wrote:
>>> On 2023/11/22 17:37, Vlastimil Babka wrote:
>>>> On 11/20/23 19:49, Mark Brown wrote:
>>>>> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
>>>>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>>>>
>>>>>> Now we will freeze slabs when moving them out of node partial list to
>>>>>> cpu partial list, this method needs two cmpxchg_double operations:
>>>>>>
>>>>>> 1. freeze slab (acquire_slab()) under the node list_lock
>>>>>> 2. get_freelist() when pick used in ___slab_alloc()
>>>>>
>>>>> Recently -next has been failing to boot on a Raspberry Pi 3 with an arm
>>>>> multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this
>>>>> patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having
>>>>> introduced the issue.  I've included the full bisect log below.
>>>>>
>>>>> When we see problems we see RCU stalls while logging in, for example:
>>>>
>>>> Can you try this, please?
>>>>
>>>
>>> Great! I manually disabled __CMPXCHG_DOUBLE to reproduce the problem,
>>> and this patch can solve the machine hang problem.
>>>
>>> BTW, I also did the performance testcase on the machine with 128 CPUs.
>>>
>>> stress-ng --rawpkt 128 --rawpkt-ops 100000000
>>>
>>> base    patched
>>> 2.22s   2.35s
>>> 2.21s   3.14s
>>> 2.19s   4.75s
>>>
>>> Found this atomic version performance numbers are not stable.
>> 
>> That's weirdly too bad. Is that measured also with __CMPXCHG_DOUBLE
>> disabled, or just the patch? The PG_workingset flag change should be
> 
> The performance test is just the patch.
> 
>> uncontended as we are doing it under list_lock, and with __CMPXCHG_DOUBLE
>> there should be no interfering PG_locked interference.
>> 
> 
> Yes, I don't know. Maybe it's related with my kernel config, making the
> atomic operation much expensive? Will look again..

I doubt it can explain going from 2.19s to 4.75s, must have been some
interference on the machine?

> And I also tested the atomic-optional version like below, found the
> performance numbers are much stable.

This gets rather ugly and fragile so I'd maybe rather go back to the
__unused field approach :/

> diff --git a/mm/slub.c b/mm/slub.c
> index a307d319e82c..e11d34d51a14 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -531,7 +531,7 @@ static __always_inline void slab_unlock(struct slab *slab)
>         struct page *page = slab_page(slab);
> 
>         VM_BUG_ON_PAGE(PageTail(page), page);
> -       __bit_spin_unlock(PG_locked, &page->flags);
> +       bit_spin_unlock(PG_locked, &page->flags);
>  }
> 
>  static inline bool
> @@ -2136,12 +2136,18 @@ static inline bool slab_test_node_partial(const struct slab *slab)
> 
>  static inline void slab_set_node_partial(struct slab *slab)
>  {
> -       __set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> +       if (slab->slab_cache->flags & __CMPXCHG_DOUBLE)
> +               __set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> +       else
> +               set_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
>  }
> 
>  static inline void slab_clear_node_partial(struct slab *slab)
>  {
> -       __clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> +       if (slab->slab_cache->flags & __CMPXCHG_DOUBLE)
> +               __clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
> +       else
> +               clear_bit(PG_workingset, folio_flags(slab_folio(slab), 0));
>  }



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-22 13:19             ` Vlastimil Babka
@ 2023-11-22 14:28               ` Chengming Zhou
  2023-11-22 14:32                 ` Vlastimil Babka
  0 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2023-11-22 14:28 UTC (permalink / raw)
  To: Vlastimil Babka, Mark Brown
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou, Matthew Wilcox

On 2023/11/22 21:19, Vlastimil Babka wrote:
> On 11/22/23 12:54, Chengming Zhou wrote:
>> On 2023/11/22 19:40, Vlastimil Babka wrote:
>>> On 11/22/23 12:35, Chengming Zhou wrote:
>>>> On 2023/11/22 17:37, Vlastimil Babka wrote:
>>>>> On 11/20/23 19:49, Mark Brown wrote:
>>>>>> On Thu, Nov 02, 2023 at 03:23:27AM +0000, chengming.zhou@linux.dev wrote:
>>>>>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>>>>>
>>>>>>> Now we will freeze slabs when moving them out of node partial list to
>>>>>>> cpu partial list, this method needs two cmpxchg_double operations:
>>>>>>>
>>>>>>> 1. freeze slab (acquire_slab()) under the node list_lock
>>>>>>> 2. get_freelist() when pick used in ___slab_alloc()
>>>>>>
>>>>>> Recently -next has been failing to boot on a Raspberry Pi 3 with an arm
>>>>>> multi_v7_defconfig and a NFS rootfs, a bisect appears to point to this
>>>>>> patch (in -next as c8d312e039030edab25836a326bcaeb2a3d4db14) as having
>>>>>> introduced the issue.  I've included the full bisect log below.
>>>>>>
>>>>>> When we see problems we see RCU stalls while logging in, for example:
>>>>>
>>>>> Can you try this, please?
>>>>>
>>>>
>>>> Great! I manually disabled __CMPXCHG_DOUBLE to reproduce the problem,
>>>> and this patch can solve the machine hang problem.
>>>>
>>>> BTW, I also did the performance testcase on the machine with 128 CPUs.
>>>>
>>>> stress-ng --rawpkt 128 --rawpkt-ops 100000000
>>>>
>>>> base    patched
>>>> 2.22s   2.35s
>>>> 2.21s   3.14s
>>>> 2.19s   4.75s
>>>>
>>>> Found this atomic version performance numbers are not stable.
>>>
>>> That's weirdly too bad. Is that measured also with __CMPXCHG_DOUBLE
>>> disabled, or just the patch? The PG_workingset flag change should be
>>
>> The performance test is just the patch.
>>
>>> uncontended as we are doing it under list_lock, and with __CMPXCHG_DOUBLE
>>> there should be no interfering PG_locked interference.
>>>
>>
>> Yes, I don't know. Maybe it's related with my kernel config, making the
>> atomic operation much expensive? Will look again..
> 
> I doubt it can explain going from 2.19s to 4.75s, must have been some
> interference on the machine?
> 

Yes, it looks so. There are some background services on the 128 CPUs machine.
Although "stress-ng --rawpkt 128 --rawpkt-ops 100000000" has so much regression,
I tried other less contented testcases:

1. stress-ng --rawpkt 64 --rawpkt-ops 100000000
2. perf bench sched messaging -g 5 -t -l 100000

The performance numbers of this atomic version are pretty much the same.

So this atomic version should be good in most cases IMHO.

>> And I also tested the atomic-optional version like below, found the
>> performance numbers are much stable.
> 
> This gets rather ugly and fragile so I'd maybe rather go back to the
> __unused field approach :/
> 

Agree. If we don't want this atomic version, the __unused field approach
seems better.

Thanks!



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-22 14:28               ` Chengming Zhou
@ 2023-11-22 14:32                 ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2023-11-22 14:32 UTC (permalink / raw)
  To: Chengming Zhou, Mark Brown
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou, Matthew Wilcox

On 11/22/23 15:28, Chengming Zhou wrote:
> 
> Yes, it looks so. There are some background services on the 128 CPUs machine.
> Although "stress-ng --rawpkt 128 --rawpkt-ops 100000000" has so much regression,
> I tried other less contented testcases:
> 
> 1. stress-ng --rawpkt 64 --rawpkt-ops 100000000
> 2. perf bench sched messaging -g 5 -t -l 100000
> 
> The performance numbers of this atomic version are pretty much the same.
> 
> So this atomic version should be good in most cases IMHO.

OK will fold the fix using full atomic version.

>>> And I also tested the atomic-optional version like below, found the
>>> performance numbers are much stable.
>> 
>> This gets rather ugly and fragile so I'd maybe rather go back to the
>> __unused field approach :/
>> 
> 
> Agree. If we don't want this atomic version, the __unused field approach
> seems better.
> 
> Thanks!
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-11-02  3:23 ` [PATCH v5 6/9] slub: Delay freezing of partial slabs chengming.zhou
  2023-11-14  5:44   ` kernel test robot
  2023-11-20 18:49   ` Mark Brown
@ 2023-12-03  6:53   ` Hyeonggon Yoo
  2023-12-03 10:15     ` Chengming Zhou
  2 siblings, 1 reply; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-12-03  6:53 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> Now we will freeze slabs when moving them out of node partial list to
> cpu partial list, this method needs two cmpxchg_double operations:
>
> 1. freeze slab (acquire_slab()) under the node list_lock
> 2. get_freelist() when pick used in ___slab_alloc()
>
> Actually we don't need to freeze when moving slabs out of node partial
> list, we can delay freezing to when use slab freelist in ___slab_alloc(),
> so we can save one cmpxchg_double().
>
> And there are other good points:
>  - The moving of slabs between node partial list and cpu partial list
>    becomes simpler, since we don't need to freeze or unfreeze at all.
>
>  - The node list_lock contention would be less, since we don't need to
>    freeze any slab under the node list_lock.
>
> We can achieve this because there is no concurrent path would manipulate
> the partial slab list except the __slab_free() path, which is now
> serialized by slab_test_node_partial() under the list_lock.
>
> Since the slab returned by get_partial() interfaces is not frozen anymore
> and no freelist is returned in the partial_context, so we need to use the
> introduced freeze_slab() to freeze it and get its freelist.
>
> Similarly, the slabs on the CPU partial list are not frozen anymore,
> we need to freeze_slab() on it before use.
>
> We can now delete acquire_slab() as it became unused.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 113 +++++++++++-------------------------------------------
>  1 file changed, 23 insertions(+), 90 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index edf567971679..bcb5b2c4e213 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2234,51 +2234,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
>         return object;
>  }
>
> -/*
> - * Remove slab from the partial list, freeze it and
> - * return the pointer to the freelist.
> - *
> - * Returns a list of objects or NULL if it fails.
> - */
> -static inline void *acquire_slab(struct kmem_cache *s,
> -               struct kmem_cache_node *n, struct slab *slab,
> -               int mode)

Nit: alloc_single_from_partial()'s comment still refers to acquire_slab().

> -{
> -       void *freelist;
> -       unsigned long counters;
> -       struct slab new;
> -
> -       lockdep_assert_held(&n->list_lock);
> -
> -       /*
> -        * Zap the freelist and set the frozen bit.
> -        * The old freelist is the list of objects for the
> -        * per cpu allocation list.
> -        */
> -       freelist = slab->freelist;
> -       counters = slab->counters;
> -       new.counters = counters;
> -       if (mode) {
> -               new.inuse = slab->objects;
> -               new.freelist = NULL;
> -       } else {
> -               new.freelist = freelist;
> -       }
> -
> -       VM_BUG_ON(new.frozen);
> -       new.frozen = 1;
> -
> -       if (!__slab_update_freelist(s, slab,
> -                       freelist, counters,
> -                       new.freelist, new.counters,
> -                       "acquire_slab"))
> -               return NULL;
> -
> -       remove_partial(n, slab);
> -       WARN_ON(!freelist);
> -       return freelist;
> -}
> -
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
>  #else
> @@ -2295,7 +2250,6 @@ static struct slab *get_partial_node(struct kmem_cache *s,
>                                      struct partial_context *pc)
>  {
>         struct slab *slab, *slab2, *partial = NULL;
> -       void *object = NULL;
>         unsigned long flags;
>         unsigned int partial_slabs = 0;
>
> @@ -2314,7 +2268,7 @@ static struct slab *get_partial_node(struct kmem_cache *s,
>                         continue;
>
>                 if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
> -                       object = alloc_single_from_partial(s, n, slab,
> +                       void *object = alloc_single_from_partial(s, n, slab,
>                                                         pc->orig_size);
>                         if (object) {
>                                 partial = slab;
> @@ -2324,13 +2278,10 @@ static struct slab *get_partial_node(struct kmem_cache *s,
>                         continue;
>                 }
>
> -               object = acquire_slab(s, n, slab, object == NULL);
> -               if (!object)
> -                       break;
> +               remove_partial(n, slab);
>
>                 if (!partial) {
>                         partial = slab;
> -                       pc->object = object;
>                         stat(s, ALLOC_FROM_PARTIAL);
>                 } else {
>                         put_cpu_partial(s, slab, 0);
> @@ -2629,9 +2580,6 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
>         unsigned long flags = 0;
>
>         while (partial_slab) {
> -               struct slab new;
> -               struct slab old;
> -
>                 slab = partial_slab;
>                 partial_slab = slab->next;
>
> @@ -2644,23 +2592,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
>                         spin_lock_irqsave(&n->list_lock, flags);
>                 }
>
> -               do {
> -
> -                       old.freelist = slab->freelist;
> -                       old.counters = slab->counters;
> -                       VM_BUG_ON(!old.frozen);
> -
> -                       new.counters = old.counters;
> -                       new.freelist = old.freelist;
> -
> -                       new.frozen = 0;
> -
> -               } while (!__slab_update_freelist(s, slab,
> -                               old.freelist, old.counters,
> -                               new.freelist, new.counters,
> -                               "unfreezing slab"));
> -
> -               if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
> +               if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
>                         slab->next = slab_to_discard;
>                         slab_to_discard = slab;
>                 } else {
> @@ -3167,7 +3099,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>                         node = NUMA_NO_NODE;
>                 goto new_slab;
>         }
> -redo:
>
>         if (unlikely(!node_match(slab, node))) {
>                 /*
> @@ -3243,7 +3174,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>
>  new_slab:
>
> -       if (slub_percpu_partial(c)) {
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
> +       while (slub_percpu_partial(c)) {
>                 local_lock_irqsave(&s->cpu_slab->lock, flags);
>                 if (unlikely(c->slab)) {
>                         local_unlock_irqrestore(&s->cpu_slab->lock, flags);
> @@ -3255,12 +3187,22 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>                         goto new_objects;
>                 }
>
> -               slab = c->slab = slub_percpu_partial(c);
> +               slab = slub_percpu_partial(c);
>                 slub_set_percpu_partial(c, slab);
>                 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>                 stat(s, CPU_PARTIAL_ALLOC);
> -               goto redo;
> +
> +               if (unlikely(!node_match(slab, node) ||
> +                            !pfmemalloc_match(slab, gfpflags))) {
> +                       slab->next = NULL;
> +                       __unfreeze_partials(s, slab);
> +                       continue;
> +               }
> +
> +               freelist = freeze_slab(s, slab);
> +               goto retry_load_slab;
>         }
> +#endif
>
>  new_objects:
>
> @@ -3268,8 +3210,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>         pc.orig_size = orig_size;
>         slab = get_partial(s, node, &pc);
>         if (slab) {
> -               freelist = pc.object;
>                 if (kmem_cache_debug(s)) {
> +                       freelist = pc.object;
>                         /*
>                          * For debug caches here we had to go through
>                          * alloc_single_from_partial() so just store the
> @@ -3281,6 +3223,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>                         return freelist;
>                 }
>
> +               freelist = freeze_slab(s, slab);
>                 goto retry_load_slab;
>         }
>
> @@ -3682,18 +3625,8 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                 was_frozen = new.frozen;
>                 new.inuse -= cnt;
>                 if ((!new.inuse || !prior) && !was_frozen) {
> -
> -                       if (kmem_cache_has_cpu_partial(s) && !prior) {
> -
> -                               /*
> -                                * Slab was on no list before and will be
> -                                * partially empty
> -                                * We can defer the list move and instead
> -                                * freeze it.
> -                                */
> -                               new.frozen = 1;
> -
> -                       } else { /* Needs to be taken off a list */
> +                       /* Needs to be taken off a list */
> +                       if (!kmem_cache_has_cpu_partial(s) || prior) {
>
>                                 n = get_node(s, slab_nid(slab));
>                                 /*
> @@ -3723,9 +3656,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>                          * activity can be necessary.
>                          */
>                         stat(s, FREE_FROZEN);
> -               } else if (new.frozen) {
> +               } else if (kmem_cache_has_cpu_partial(s) && !prior) {
>                         /*
> -                        * If we just froze the slab then put it onto the
> +                        * If we started with a full slab then put it onto the
>                          * per cpu partial list.
>                          */
>                         put_cpu_partial(s, slab, 1);
> --

Looks good to me,
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Thanks!

> 2.20.1
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-12-03  6:53   ` Hyeonggon Yoo
@ 2023-12-03 10:15     ` Chengming Zhou
  2023-12-04 16:58       ` Vlastimil Babka
  0 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2023-12-03 10:15 UTC (permalink / raw)
  To: Hyeonggon Yoo, vbabka
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	linux-mm, linux-kernel, Chengming Zhou

On 2023/12/3 14:53, Hyeonggon Yoo wrote:
> On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>>
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> Now we will freeze slabs when moving them out of node partial list to
>> cpu partial list, this method needs two cmpxchg_double operations:
>>
>> 1. freeze slab (acquire_slab()) under the node list_lock
>> 2. get_freelist() when pick used in ___slab_alloc()
>>
>> Actually we don't need to freeze when moving slabs out of node partial
>> list, we can delay freezing to when use slab freelist in ___slab_alloc(),
>> so we can save one cmpxchg_double().
>>
>> And there are other good points:
>>  - The moving of slabs between node partial list and cpu partial list
>>    becomes simpler, since we don't need to freeze or unfreeze at all.
>>
>>  - The node list_lock contention would be less, since we don't need to
>>    freeze any slab under the node list_lock.
>>
>> We can achieve this because there is no concurrent path would manipulate
>> the partial slab list except the __slab_free() path, which is now
>> serialized by slab_test_node_partial() under the list_lock.
>>
>> Since the slab returned by get_partial() interfaces is not frozen anymore
>> and no freelist is returned in the partial_context, so we need to use the
>> introduced freeze_slab() to freeze it and get its freelist.
>>
>> Similarly, the slabs on the CPU partial list are not frozen anymore,
>> we need to freeze_slab() on it before use.
>>
>> We can now delete acquire_slab() as it became unused.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> ---
>>  mm/slub.c | 113 +++++++++++-------------------------------------------
>>  1 file changed, 23 insertions(+), 90 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index edf567971679..bcb5b2c4e213 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2234,51 +2234,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
>>         return object;
>>  }
>>
>> -/*
>> - * Remove slab from the partial list, freeze it and
>> - * return the pointer to the freelist.
>> - *
>> - * Returns a list of objects or NULL if it fails.
>> - */
>> -static inline void *acquire_slab(struct kmem_cache *s,
>> -               struct kmem_cache_node *n, struct slab *slab,
>> -               int mode)
> 
> Nit: alloc_single_from_partial()'s comment still refers to acquire_slab().
> 

Ah, right! It should be changed to remove_partial().

diff --git a/mm/slub.c b/mm/slub.c
index 437485a2408d..623c17a4cdd6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2463,7 +2463,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
 }

 /*
- * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
+ * Called only for kmem_cache_debug() caches instead of remove_partial(), with a
  * slab from the n->partial list. Remove only a single object from the slab, do
  * the alloc_debug_processing() checks and leave the slab on the list, or move
  * it to full list if it was the last free object.

Hi Vlastimil, could you please help to fold it?

Thanks!

>> -{
>> -       void *freelist;
>> -       unsigned long counters;
>> -       struct slab new;
>> -
>> -       lockdep_assert_held(&n->list_lock);
>> -
>> -       /*
>> -        * Zap the freelist and set the frozen bit.
>> -        * The old freelist is the list of objects for the
>> -        * per cpu allocation list.
>> -        */
>> -       freelist = slab->freelist;
>> -       counters = slab->counters;
>> -       new.counters = counters;
>> -       if (mode) {
>> -               new.inuse = slab->objects;
>> -               new.freelist = NULL;
>> -       } else {
>> -               new.freelist = freelist;
>> -       }
>> -
>> -       VM_BUG_ON(new.frozen);
>> -       new.frozen = 1;
>> -
>> -       if (!__slab_update_freelist(s, slab,
>> -                       freelist, counters,
>> -                       new.freelist, new.counters,
>> -                       "acquire_slab"))
>> -               return NULL;
>> -
>> -       remove_partial(n, slab);
>> -       WARN_ON(!freelist);
>> -       return freelist;
>> -}
>> -
>>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>>  static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain);
>>  #else
>> @@ -2295,7 +2250,6 @@ static struct slab *get_partial_node(struct kmem_cache *s,
>>                                      struct partial_context *pc)
>>  {
>>         struct slab *slab, *slab2, *partial = NULL;
>> -       void *object = NULL;
>>         unsigned long flags;
>>         unsigned int partial_slabs = 0;
>>
>> @@ -2314,7 +2268,7 @@ static struct slab *get_partial_node(struct kmem_cache *s,
>>                         continue;
>>
>>                 if (IS_ENABLED(CONFIG_SLUB_TINY) || kmem_cache_debug(s)) {
>> -                       object = alloc_single_from_partial(s, n, slab,
>> +                       void *object = alloc_single_from_partial(s, n, slab,
>>                                                         pc->orig_size);
>>                         if (object) {
>>                                 partial = slab;
>> @@ -2324,13 +2278,10 @@ static struct slab *get_partial_node(struct kmem_cache *s,
>>                         continue;
>>                 }
>>
>> -               object = acquire_slab(s, n, slab, object == NULL);
>> -               if (!object)
>> -                       break;
>> +               remove_partial(n, slab);
>>
>>                 if (!partial) {
>>                         partial = slab;
>> -                       pc->object = object;
>>                         stat(s, ALLOC_FROM_PARTIAL);
>>                 } else {
>>                         put_cpu_partial(s, slab, 0);
>> @@ -2629,9 +2580,6 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
>>         unsigned long flags = 0;
>>
>>         while (partial_slab) {
>> -               struct slab new;
>> -               struct slab old;
>> -
>>                 slab = partial_slab;
>>                 partial_slab = slab->next;
>>
>> @@ -2644,23 +2592,7 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
>>                         spin_lock_irqsave(&n->list_lock, flags);
>>                 }
>>
>> -               do {
>> -
>> -                       old.freelist = slab->freelist;
>> -                       old.counters = slab->counters;
>> -                       VM_BUG_ON(!old.frozen);
>> -
>> -                       new.counters = old.counters;
>> -                       new.freelist = old.freelist;
>> -
>> -                       new.frozen = 0;
>> -
>> -               } while (!__slab_update_freelist(s, slab,
>> -                               old.freelist, old.counters,
>> -                               new.freelist, new.counters,
>> -                               "unfreezing slab"));
>> -
>> -               if (unlikely(!new.inuse && n->nr_partial >= s->min_partial)) {
>> +               if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial)) {
>>                         slab->next = slab_to_discard;
>>                         slab_to_discard = slab;
>>                 } else {
>> @@ -3167,7 +3099,6 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>                         node = NUMA_NO_NODE;
>>                 goto new_slab;
>>         }
>> -redo:
>>
>>         if (unlikely(!node_match(slab, node))) {
>>                 /*
>> @@ -3243,7 +3174,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>
>>  new_slab:
>>
>> -       if (slub_percpu_partial(c)) {
>> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>> +       while (slub_percpu_partial(c)) {
>>                 local_lock_irqsave(&s->cpu_slab->lock, flags);
>>                 if (unlikely(c->slab)) {
>>                         local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>> @@ -3255,12 +3187,22 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>                         goto new_objects;
>>                 }
>>
>> -               slab = c->slab = slub_percpu_partial(c);
>> +               slab = slub_percpu_partial(c);
>>                 slub_set_percpu_partial(c, slab);
>>                 local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>>                 stat(s, CPU_PARTIAL_ALLOC);
>> -               goto redo;
>> +
>> +               if (unlikely(!node_match(slab, node) ||
>> +                            !pfmemalloc_match(slab, gfpflags))) {
>> +                       slab->next = NULL;
>> +                       __unfreeze_partials(s, slab);
>> +                       continue;
>> +               }
>> +
>> +               freelist = freeze_slab(s, slab);
>> +               goto retry_load_slab;
>>         }
>> +#endif
>>
>>  new_objects:
>>
>> @@ -3268,8 +3210,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>         pc.orig_size = orig_size;
>>         slab = get_partial(s, node, &pc);
>>         if (slab) {
>> -               freelist = pc.object;
>>                 if (kmem_cache_debug(s)) {
>> +                       freelist = pc.object;
>>                         /*
>>                          * For debug caches here we had to go through
>>                          * alloc_single_from_partial() so just store the
>> @@ -3281,6 +3223,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>                         return freelist;
>>                 }
>>
>> +               freelist = freeze_slab(s, slab);
>>                 goto retry_load_slab;
>>         }
>>
>> @@ -3682,18 +3625,8 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>                 was_frozen = new.frozen;
>>                 new.inuse -= cnt;
>>                 if ((!new.inuse || !prior) && !was_frozen) {
>> -
>> -                       if (kmem_cache_has_cpu_partial(s) && !prior) {
>> -
>> -                               /*
>> -                                * Slab was on no list before and will be
>> -                                * partially empty
>> -                                * We can defer the list move and instead
>> -                                * freeze it.
>> -                                */
>> -                               new.frozen = 1;
>> -
>> -                       } else { /* Needs to be taken off a list */
>> +                       /* Needs to be taken off a list */
>> +                       if (!kmem_cache_has_cpu_partial(s) || prior) {
>>
>>                                 n = get_node(s, slab_nid(slab));
>>                                 /*
>> @@ -3723,9 +3656,9 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>>                          * activity can be necessary.
>>                          */
>>                         stat(s, FREE_FROZEN);
>> -               } else if (new.frozen) {
>> +               } else if (kmem_cache_has_cpu_partial(s) && !prior) {
>>                         /*
>> -                        * If we just froze the slab then put it onto the
>> +                        * If we started with a full slab then put it onto the
>>                          * per cpu partial list.
>>                          */
>>                         put_cpu_partial(s, slab, 1);
>> --
> 
> Looks good to me,
> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> 
> Thanks!
> 
>> 2.20.1
>>


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 6/9] slub: Delay freezing of partial slabs
  2023-12-03 10:15     ` Chengming Zhou
@ 2023-12-04 16:58       ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2023-12-04 16:58 UTC (permalink / raw)
  To: Chengming Zhou, Hyeonggon Yoo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	linux-mm, linux-kernel, Chengming Zhou

On 12/3/23 11:15, Chengming Zhou wrote:
> On 2023/12/3 14:53, Hyeonggon Yoo wrote:
>> On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>>>
>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>
>>> Now we will freeze slabs when moving them out of node partial list to
>>> cpu partial list, this method needs two cmpxchg_double operations:
>>>
>>> 1. freeze slab (acquire_slab()) under the node list_lock
>>> 2. get_freelist() when pick used in ___slab_alloc()
>>>
>>> Actually we don't need to freeze when moving slabs out of node partial
>>> list, we can delay freezing to when use slab freelist in ___slab_alloc(),
>>> so we can save one cmpxchg_double().
>>>
>>> And there are other good points:
>>>  - The moving of slabs between node partial list and cpu partial list
>>>    becomes simpler, since we don't need to freeze or unfreeze at all.
>>>
>>>  - The node list_lock contention would be less, since we don't need to
>>>    freeze any slab under the node list_lock.
>>>
>>> We can achieve this because there is no concurrent path would manipulate
>>> the partial slab list except the __slab_free() path, which is now
>>> serialized by slab_test_node_partial() under the list_lock.
>>>
>>> Since the slab returned by get_partial() interfaces is not frozen anymore
>>> and no freelist is returned in the partial_context, so we need to use the
>>> introduced freeze_slab() to freeze it and get its freelist.
>>>
>>> Similarly, the slabs on the CPU partial list are not frozen anymore,
>>> we need to freeze_slab() on it before use.
>>>
>>> We can now delete acquire_slab() as it became unused.
>>>
>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>>> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>> ---
>>>  mm/slub.c | 113 +++++++++++-------------------------------------------
>>>  1 file changed, 23 insertions(+), 90 deletions(-)
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index edf567971679..bcb5b2c4e213 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -2234,51 +2234,6 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
>>>         return object;
>>>  }
>>>
>>> -/*
>>> - * Remove slab from the partial list, freeze it and
>>> - * return the pointer to the freelist.
>>> - *
>>> - * Returns a list of objects or NULL if it fails.
>>> - */
>>> -static inline void *acquire_slab(struct kmem_cache *s,
>>> -               struct kmem_cache_node *n, struct slab *slab,
>>> -               int mode)
>> 
>> Nit: alloc_single_from_partial()'s comment still refers to acquire_slab().
>> 
> 
> Ah, right! It should be changed to remove_partial().
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 437485a2408d..623c17a4cdd6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2463,7 +2463,7 @@ static inline void remove_partial(struct kmem_cache_node *n,
>  }
> 
>  /*
> - * Called only for kmem_cache_debug() caches instead of acquire_slab(), with a
> + * Called only for kmem_cache_debug() caches instead of remove_partial(), with a
>   * slab from the n->partial list. Remove only a single object from the slab, do
>   * the alloc_debug_processing() checks and leave the slab on the list, or move
>   * it to full list if it was the last free object.
> 
> Hi Vlastimil, could you please help to fold it?

Done, thanks.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 7/9] slub: Optimize deactivate_slab()
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (5 preceding siblings ...)
  2023-11-02  3:23 ` [PATCH v5 6/9] slub: Delay freezing of partial slabs chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-12-03  9:23   ` Hyeonggon Yoo
  2023-11-02  3:23 ` [PATCH v5 8/9] slub: Rename all *unfreeze_partials* functions to *put_partials* chengming.zhou
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

Since the introduce of unfrozen slabs on cpu partial list, we don't
need to synchronize the slab frozen state under the node list_lock.

The caller of deactivate_slab() and the caller of __slab_free() won't
manipulate the slab list concurrently.

So we can get node list_lock in the last stage if we really need to
manipulate the slab list in this path.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 79 ++++++++++++++++++-------------------------------------
 1 file changed, 26 insertions(+), 53 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index bcb5b2c4e213..d137468fe4b9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2468,10 +2468,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 			    void *freelist)
 {
-	enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
 	struct kmem_cache_node *n = get_node(s, slab_nid(slab));
 	int free_delta = 0;
-	enum slab_modes mode = M_NONE;
 	void *nextfree, *freelist_iter, *freelist_tail;
 	int tail = DEACTIVATE_TO_HEAD;
 	unsigned long flags = 0;
@@ -2509,65 +2507,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 	/*
 	 * Stage two: Unfreeze the slab while splicing the per-cpu
 	 * freelist to the head of slab's freelist.
-	 *
-	 * Ensure that the slab is unfrozen while the list presence
-	 * reflects the actual number of objects during unfreeze.
-	 *
-	 * We first perform cmpxchg holding lock and insert to list
-	 * when it succeed. If there is mismatch then the slab is not
-	 * unfrozen and number of objects in the slab may have changed.
-	 * Then release lock and retry cmpxchg again.
 	 */
-redo:
-
-	old.freelist = READ_ONCE(slab->freelist);
-	old.counters = READ_ONCE(slab->counters);
-	VM_BUG_ON(!old.frozen);
-
-	/* Determine target state of the slab */
-	new.counters = old.counters;
-	if (freelist_tail) {
-		new.inuse -= free_delta;
-		set_freepointer(s, freelist_tail, old.freelist);
-		new.freelist = freelist;
-	} else
-		new.freelist = old.freelist;
-
-	new.frozen = 0;
+	do {
+		old.freelist = READ_ONCE(slab->freelist);
+		old.counters = READ_ONCE(slab->counters);
+		VM_BUG_ON(!old.frozen);
+
+		/* Determine target state of the slab */
+		new.counters = old.counters;
+		new.frozen = 0;
+		if (freelist_tail) {
+			new.inuse -= free_delta;
+			set_freepointer(s, freelist_tail, old.freelist);
+			new.freelist = freelist;
+		} else {
+			new.freelist = old.freelist;
+		}
+	} while (!slab_update_freelist(s, slab,
+		old.freelist, old.counters,
+		new.freelist, new.counters,
+		"unfreezing slab"));
 
+	/*
+	 * Stage three: Manipulate the slab list based on the updated state.
+	 */
 	if (!new.inuse && n->nr_partial >= s->min_partial) {
-		mode = M_FREE;
+		stat(s, DEACTIVATE_EMPTY);
+		discard_slab(s, slab);
+		stat(s, FREE_SLAB);
 	} else if (new.freelist) {
-		mode = M_PARTIAL;
-		/*
-		 * Taking the spinlock removes the possibility that
-		 * acquire_slab() will see a slab that is frozen
-		 */
 		spin_lock_irqsave(&n->list_lock, flags);
-	} else {
-		mode = M_FULL_NOLIST;
-	}
-
-
-	if (!slab_update_freelist(s, slab,
-				old.freelist, old.counters,
-				new.freelist, new.counters,
-				"unfreezing slab")) {
-		if (mode == M_PARTIAL)
-			spin_unlock_irqrestore(&n->list_lock, flags);
-		goto redo;
-	}
-
-
-	if (mode == M_PARTIAL) {
 		add_partial(n, slab, tail);
 		spin_unlock_irqrestore(&n->list_lock, flags);
 		stat(s, tail);
-	} else if (mode == M_FREE) {
-		stat(s, DEACTIVATE_EMPTY);
-		discard_slab(s, slab);
-		stat(s, FREE_SLAB);
-	} else if (mode == M_FULL_NOLIST) {
+	} else {
 		stat(s, DEACTIVATE_FULL);
 	}
 }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 7/9] slub: Optimize deactivate_slab()
  2023-11-02  3:23 ` [PATCH v5 7/9] slub: Optimize deactivate_slab() chengming.zhou
@ 2023-12-03  9:23   ` Hyeonggon Yoo
  2023-12-03 10:26     ` Chengming Zhou
  2023-12-04 17:55     ` Vlastimil Babka
  0 siblings, 2 replies; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-12-03  9:23 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> Since the introduce of unfrozen slabs on cpu partial list, we don't
> need to synchronize the slab frozen state under the node list_lock.
>
> The caller of deactivate_slab() and the caller of __slab_free() won't
> manipulate the slab list concurrently.
>
> So we can get node list_lock in the last stage if we really need to
> manipulate the slab list in this path.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 79 ++++++++++++++++++-------------------------------------
>  1 file changed, 26 insertions(+), 53 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index bcb5b2c4e213..d137468fe4b9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2468,10 +2468,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
>  static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>                             void *freelist)
>  {
> -       enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
>         struct kmem_cache_node *n = get_node(s, slab_nid(slab));
>         int free_delta = 0;
> -       enum slab_modes mode = M_NONE;
>         void *nextfree, *freelist_iter, *freelist_tail;
>         int tail = DEACTIVATE_TO_HEAD;
>         unsigned long flags = 0;
> @@ -2509,65 +2507,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>         /*
>          * Stage two: Unfreeze the slab while splicing the per-cpu
>          * freelist to the head of slab's freelist.
> -        *
> -        * Ensure that the slab is unfrozen while the list presence
> -        * reflects the actual number of objects during unfreeze.
> -        *
> -        * We first perform cmpxchg holding lock and insert to list
> -        * when it succeed. If there is mismatch then the slab is not
> -        * unfrozen and number of objects in the slab may have changed.
> -        * Then release lock and retry cmpxchg again.
>          */
> -redo:
> -
> -       old.freelist = READ_ONCE(slab->freelist);
> -       old.counters = READ_ONCE(slab->counters);
> -       VM_BUG_ON(!old.frozen);
> -
> -       /* Determine target state of the slab */
> -       new.counters = old.counters;
> -       if (freelist_tail) {
> -               new.inuse -= free_delta;
> -               set_freepointer(s, freelist_tail, old.freelist);
> -               new.freelist = freelist;
> -       } else
> -               new.freelist = old.freelist;
> -
> -       new.frozen = 0;
> +       do {
> +               old.freelist = READ_ONCE(slab->freelist);
> +               old.counters = READ_ONCE(slab->counters);
> +               VM_BUG_ON(!old.frozen);
> +
> +               /* Determine target state of the slab */
> +               new.counters = old.counters;
> +               new.frozen = 0;
> +               if (freelist_tail) {
> +                       new.inuse -= free_delta;
> +                       set_freepointer(s, freelist_tail, old.freelist);
> +                       new.freelist = freelist;
> +               } else {
> +                       new.freelist = old.freelist;
> +               }
> +       } while (!slab_update_freelist(s, slab,
> +               old.freelist, old.counters,
> +               new.freelist, new.counters,
> +               "unfreezing slab"));
>
> +       /*
> +        * Stage three: Manipulate the slab list based on the updated state.
> +        */

deactivate_slab() might unconsciously put empty slabs into partial list, like:

deactivate_slab()                    __slab_free()
cmpxchg(), slab's not empty
                                               cmpxchg(), slab's empty
and unfrozen
                                               spin_lock(&n->list_lock)
                                               (slab's empty but not
on partial list,

spin_unlock(&n->list_lock) and return)
spin_lock(&n->list_lock)
put slab into partial list
spin_unlock(&n->list_lock)

IMHO it should be fine in the real world, but just wanted to
mention as it doesn't seem to be intentional.

Otherwise it looks good to me!

>         if (!new.inuse && n->nr_partial >= s->min_partial) {
> -               mode = M_FREE;
> +               stat(s, DEACTIVATE_EMPTY);
> +               discard_slab(s, slab);
> +               stat(s, FREE_SLAB);
>         } else if (new.freelist) {
> -               mode = M_PARTIAL;
> -               /*
> -                * Taking the spinlock removes the possibility that
> -                * acquire_slab() will see a slab that is frozen
> -                */
>                 spin_lock_irqsave(&n->list_lock, flags);
> -       } else {
> -               mode = M_FULL_NOLIST;
> -       }
> -
> -
> -       if (!slab_update_freelist(s, slab,
> -                               old.freelist, old.counters,
> -                               new.freelist, new.counters,
> -                               "unfreezing slab")) {
> -               if (mode == M_PARTIAL)
> -                       spin_unlock_irqrestore(&n->list_lock, flags);
> -               goto redo;
> -       }
> -
> -
> -       if (mode == M_PARTIAL) {
>                 add_partial(n, slab, tail);
>                 spin_unlock_irqrestore(&n->list_lock, flags);
>                 stat(s, tail);
> -       } else if (mode == M_FREE) {
> -               stat(s, DEACTIVATE_EMPTY);
> -               discard_slab(s, slab);
> -               stat(s, FREE_SLAB);
> -       } else if (mode == M_FULL_NOLIST) {
> +       } else {
>                 stat(s, DEACTIVATE_FULL);
>         }
>  }
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 7/9] slub: Optimize deactivate_slab()
  2023-12-03  9:23   ` Hyeonggon Yoo
@ 2023-12-03 10:26     ` Chengming Zhou
  2023-12-03 11:19       ` Hyeonggon Yoo
  2023-12-04 17:55     ` Vlastimil Babka
  1 sibling, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2023-12-03 10:26 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On 2023/12/3 17:23, Hyeonggon Yoo wrote:
> On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>>
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> Since the introduce of unfrozen slabs on cpu partial list, we don't
>> need to synchronize the slab frozen state under the node list_lock.
>>
>> The caller of deactivate_slab() and the caller of __slab_free() won't
>> manipulate the slab list concurrently.
>>
>> So we can get node list_lock in the last stage if we really need to
>> manipulate the slab list in this path.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> ---
>>  mm/slub.c | 79 ++++++++++++++++++-------------------------------------
>>  1 file changed, 26 insertions(+), 53 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index bcb5b2c4e213..d137468fe4b9 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2468,10 +2468,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
>>  static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>>                             void *freelist)
>>  {
>> -       enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
>>         struct kmem_cache_node *n = get_node(s, slab_nid(slab));
>>         int free_delta = 0;
>> -       enum slab_modes mode = M_NONE;
>>         void *nextfree, *freelist_iter, *freelist_tail;
>>         int tail = DEACTIVATE_TO_HEAD;
>>         unsigned long flags = 0;
>> @@ -2509,65 +2507,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>>         /*
>>          * Stage two: Unfreeze the slab while splicing the per-cpu
>>          * freelist to the head of slab's freelist.
>> -        *
>> -        * Ensure that the slab is unfrozen while the list presence
>> -        * reflects the actual number of objects during unfreeze.
>> -        *
>> -        * We first perform cmpxchg holding lock and insert to list
>> -        * when it succeed. If there is mismatch then the slab is not
>> -        * unfrozen and number of objects in the slab may have changed.
>> -        * Then release lock and retry cmpxchg again.
>>          */
>> -redo:
>> -
>> -       old.freelist = READ_ONCE(slab->freelist);
>> -       old.counters = READ_ONCE(slab->counters);
>> -       VM_BUG_ON(!old.frozen);
>> -
>> -       /* Determine target state of the slab */
>> -       new.counters = old.counters;
>> -       if (freelist_tail) {
>> -               new.inuse -= free_delta;
>> -               set_freepointer(s, freelist_tail, old.freelist);
>> -               new.freelist = freelist;
>> -       } else
>> -               new.freelist = old.freelist;
>> -
>> -       new.frozen = 0;
>> +       do {
>> +               old.freelist = READ_ONCE(slab->freelist);
>> +               old.counters = READ_ONCE(slab->counters);
>> +               VM_BUG_ON(!old.frozen);
>> +
>> +               /* Determine target state of the slab */
>> +               new.counters = old.counters;
>> +               new.frozen = 0;
>> +               if (freelist_tail) {
>> +                       new.inuse -= free_delta;
>> +                       set_freepointer(s, freelist_tail, old.freelist);
>> +                       new.freelist = freelist;
>> +               } else {
>> +                       new.freelist = old.freelist;
>> +               }
>> +       } while (!slab_update_freelist(s, slab,
>> +               old.freelist, old.counters,
>> +               new.freelist, new.counters,
>> +               "unfreezing slab"));
>>
>> +       /*
>> +        * Stage three: Manipulate the slab list based on the updated state.
>> +        */
> 
> deactivate_slab() might unconsciously put empty slabs into partial list, like:
> 
> deactivate_slab()                    __slab_free()
> cmpxchg(), slab's not empty
>                                                cmpxchg(), slab's empty
> and unfrozen

Hi,

Sorry, but I don't get it here how __slab_free() can see the slab empty,
since the slab is not empty from deactivate_slab() path, and it can't be
used by any CPU at that time?

Thanks for review!

>                                                spin_lock(&n->list_lock)
>                                                (slab's empty but not
> on partial list,
> 
> spin_unlock(&n->list_lock) and return)
> spin_lock(&n->list_lock)
> put slab into partial list
> spin_unlock(&n->list_lock)
> 
> IMHO it should be fine in the real world, but just wanted to
> mention as it doesn't seem to be intentional.
> 
> Otherwise it looks good to me!
> 
>>         if (!new.inuse && n->nr_partial >= s->min_partial) {
>> -               mode = M_FREE;
>> +               stat(s, DEACTIVATE_EMPTY);
>> +               discard_slab(s, slab);
>> +               stat(s, FREE_SLAB);
>>         } else if (new.freelist) {
>> -               mode = M_PARTIAL;
>> -               /*
>> -                * Taking the spinlock removes the possibility that
>> -                * acquire_slab() will see a slab that is frozen
>> -                */
>>                 spin_lock_irqsave(&n->list_lock, flags);
>> -       } else {
>> -               mode = M_FULL_NOLIST;
>> -       }
>> -
>> -
>> -       if (!slab_update_freelist(s, slab,
>> -                               old.freelist, old.counters,
>> -                               new.freelist, new.counters,
>> -                               "unfreezing slab")) {
>> -               if (mode == M_PARTIAL)
>> -                       spin_unlock_irqrestore(&n->list_lock, flags);
>> -               goto redo;
>> -       }
>> -
>> -
>> -       if (mode == M_PARTIAL) {
>>                 add_partial(n, slab, tail);
>>                 spin_unlock_irqrestore(&n->list_lock, flags);
>>                 stat(s, tail);
>> -       } else if (mode == M_FREE) {
>> -               stat(s, DEACTIVATE_EMPTY);
>> -               discard_slab(s, slab);
>> -               stat(s, FREE_SLAB);
>> -       } else if (mode == M_FULL_NOLIST) {
>> +       } else {
>>                 stat(s, DEACTIVATE_FULL);
>>         }
>>  }
>> --
>> 2.20.1
>>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 7/9] slub: Optimize deactivate_slab()
  2023-12-03 10:26     ` Chengming Zhou
@ 2023-12-03 11:19       ` Hyeonggon Yoo
  2023-12-03 11:47         ` Chengming Zhou
  0 siblings, 1 reply; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-12-03 11:19 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Sun, Dec 3, 2023 at 7:26 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2023/12/3 17:23, Hyeonggon Yoo wrote:
> > On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
> >>
> >> From: Chengming Zhou <zhouchengming@bytedance.com>
> >>
> >> Since the introduce of unfrozen slabs on cpu partial list, we don't
> >> need to synchronize the slab frozen state under the node list_lock.
> >>
> >> The caller of deactivate_slab() and the caller of __slab_free() won't
> >> manipulate the slab list concurrently.
> >>
> >> So we can get node list_lock in the last stage if we really need to
> >> manipulate the slab list in this path.
> >>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >> ---
> >>  mm/slub.c | 79 ++++++++++++++++++-------------------------------------
> >>  1 file changed, 26 insertions(+), 53 deletions(-)
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index bcb5b2c4e213..d137468fe4b9 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -2468,10 +2468,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
> >>  static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >>                             void *freelist)
> >>  {
> >> -       enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
> >>         struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> >>         int free_delta = 0;
> >> -       enum slab_modes mode = M_NONE;
> >>         void *nextfree, *freelist_iter, *freelist_tail;
> >>         int tail = DEACTIVATE_TO_HEAD;
> >>         unsigned long flags = 0;
> >> @@ -2509,65 +2507,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >>         /*
> >>          * Stage two: Unfreeze the slab while splicing the per-cpu
> >>          * freelist to the head of slab's freelist.
> >> -        *
> >> -        * Ensure that the slab is unfrozen while the list presence
> >> -        * reflects the actual number of objects during unfreeze.
> >> -        *
> >> -        * We first perform cmpxchg holding lock and insert to list
> >> -        * when it succeed. If there is mismatch then the slab is not
> >> -        * unfrozen and number of objects in the slab may have changed.
> >> -        * Then release lock and retry cmpxchg again.
> >>          */
> >> -redo:
> >> -
> >> -       old.freelist = READ_ONCE(slab->freelist);
> >> -       old.counters = READ_ONCE(slab->counters);
> >> -       VM_BUG_ON(!old.frozen);
> >> -
> >> -       /* Determine target state of the slab */
> >> -       new.counters = old.counters;
> >> -       if (freelist_tail) {
> >> -               new.inuse -= free_delta;
> >> -               set_freepointer(s, freelist_tail, old.freelist);
> >> -               new.freelist = freelist;
> >> -       } else
> >> -               new.freelist = old.freelist;
> >> -
> >> -       new.frozen = 0;
> >> +       do {
> >> +               old.freelist = READ_ONCE(slab->freelist);
> >> +               old.counters = READ_ONCE(slab->counters);
> >> +               VM_BUG_ON(!old.frozen);
> >> +
> >> +               /* Determine target state of the slab */
> >> +               new.counters = old.counters;
> >> +               new.frozen = 0;
> >> +               if (freelist_tail) {
> >> +                       new.inuse -= free_delta;
> >> +                       set_freepointer(s, freelist_tail, old.freelist);
> >> +                       new.freelist = freelist;
> >> +               } else {
> >> +                       new.freelist = old.freelist;
> >> +               }
> >> +       } while (!slab_update_freelist(s, slab,
> >> +               old.freelist, old.counters,
> >> +               new.freelist, new.counters,
> >> +               "unfreezing slab"));
> >>
> >> +       /*
> >> +        * Stage three: Manipulate the slab list based on the updated state.
> >> +        */
> >
> > deactivate_slab() might unconsciously put empty slabs into partial list, like:
> >
> > deactivate_slab()                    __slab_free()
> > cmpxchg(), slab's not empty
> >                                                cmpxchg(), slab's empty
> > and unfrozen
>
> Hi,
>
> Sorry, but I don't get it here how __slab_free() can see the slab empty,
> since the slab is not empty from deactivate_slab() path, and it can't be
> used by any CPU at that time?

The scenario is CPU B previously allocated an object from slab X, but
put it into node partial list and then CPU A have taken slab X into cpu slab.

While slab X is CPU A's cpu slab, when CPU B frees an object from slab X,
it puts the object into slab X's freelist using cmpxchg.

Let's say in CPU A the deactivation path performs cmpxchg and X.inuse was 1,
and then CPU B frees (__slab_free()) to slab X's freelist using cmpxchg,
_before_ slab X's put into partial list by CPU A.

Then CPU A thinks it's not empty so put it into partial list, but by CPU B
the slab has become empty.

Maybe I am confused, in that case please tell me I'm wrong :)

Thanks!

--
Hyeonggon


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 7/9] slub: Optimize deactivate_slab()
  2023-12-03 11:19       ` Hyeonggon Yoo
@ 2023-12-03 11:47         ` Chengming Zhou
  0 siblings, 0 replies; 44+ messages in thread
From: Chengming Zhou @ 2023-12-03 11:47 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On 2023/12/3 19:19, Hyeonggon Yoo wrote:
> On Sun, Dec 3, 2023 at 7:26 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>
>> On 2023/12/3 17:23, Hyeonggon Yoo wrote:
>>> On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>>>>
>>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>>
>>>> Since the introduce of unfrozen slabs on cpu partial list, we don't
>>>> need to synchronize the slab frozen state under the node list_lock.
>>>>
>>>> The caller of deactivate_slab() and the caller of __slab_free() won't
>>>> manipulate the slab list concurrently.
>>>>
>>>> So we can get node list_lock in the last stage if we really need to
>>>> manipulate the slab list in this path.
>>>>
>>>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>>>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>>>> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>>> ---
>>>>  mm/slub.c | 79 ++++++++++++++++++-------------------------------------
>>>>  1 file changed, 26 insertions(+), 53 deletions(-)
>>>>
>>>> diff --git a/mm/slub.c b/mm/slub.c
>>>> index bcb5b2c4e213..d137468fe4b9 100644
>>>> --- a/mm/slub.c
>>>> +++ b/mm/slub.c
>>>> @@ -2468,10 +2468,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
>>>>  static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>>>>                             void *freelist)
>>>>  {
>>>> -       enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
>>>>         struct kmem_cache_node *n = get_node(s, slab_nid(slab));
>>>>         int free_delta = 0;
>>>> -       enum slab_modes mode = M_NONE;
>>>>         void *nextfree, *freelist_iter, *freelist_tail;
>>>>         int tail = DEACTIVATE_TO_HEAD;
>>>>         unsigned long flags = 0;
>>>> @@ -2509,65 +2507,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>>>>         /*
>>>>          * Stage two: Unfreeze the slab while splicing the per-cpu
>>>>          * freelist to the head of slab's freelist.
>>>> -        *
>>>> -        * Ensure that the slab is unfrozen while the list presence
>>>> -        * reflects the actual number of objects during unfreeze.
>>>> -        *
>>>> -        * We first perform cmpxchg holding lock and insert to list
>>>> -        * when it succeed. If there is mismatch then the slab is not
>>>> -        * unfrozen and number of objects in the slab may have changed.
>>>> -        * Then release lock and retry cmpxchg again.
>>>>          */
>>>> -redo:
>>>> -
>>>> -       old.freelist = READ_ONCE(slab->freelist);
>>>> -       old.counters = READ_ONCE(slab->counters);
>>>> -       VM_BUG_ON(!old.frozen);
>>>> -
>>>> -       /* Determine target state of the slab */
>>>> -       new.counters = old.counters;
>>>> -       if (freelist_tail) {
>>>> -               new.inuse -= free_delta;
>>>> -               set_freepointer(s, freelist_tail, old.freelist);
>>>> -               new.freelist = freelist;
>>>> -       } else
>>>> -               new.freelist = old.freelist;
>>>> -
>>>> -       new.frozen = 0;
>>>> +       do {
>>>> +               old.freelist = READ_ONCE(slab->freelist);
>>>> +               old.counters = READ_ONCE(slab->counters);
>>>> +               VM_BUG_ON(!old.frozen);
>>>> +
>>>> +               /* Determine target state of the slab */
>>>> +               new.counters = old.counters;
>>>> +               new.frozen = 0;
>>>> +               if (freelist_tail) {
>>>> +                       new.inuse -= free_delta;
>>>> +                       set_freepointer(s, freelist_tail, old.freelist);
>>>> +                       new.freelist = freelist;
>>>> +               } else {
>>>> +                       new.freelist = old.freelist;
>>>> +               }
>>>> +       } while (!slab_update_freelist(s, slab,
>>>> +               old.freelist, old.counters,
>>>> +               new.freelist, new.counters,
>>>> +               "unfreezing slab"));
>>>>
>>>> +       /*
>>>> +        * Stage three: Manipulate the slab list based on the updated state.
>>>> +        */
>>>
>>> deactivate_slab() might unconsciously put empty slabs into partial list, like:
>>>
>>> deactivate_slab()                    __slab_free()
>>> cmpxchg(), slab's not empty
>>>                                                cmpxchg(), slab's empty
>>> and unfrozen
>>
>> Hi,
>>
>> Sorry, but I don't get it here how __slab_free() can see the slab empty,
>> since the slab is not empty from deactivate_slab() path, and it can't be
>> used by any CPU at that time?
> 
> The scenario is CPU B previously allocated an object from slab X, but
> put it into node partial list and then CPU A have taken slab X into cpu slab.
> 
> While slab X is CPU A's cpu slab, when CPU B frees an object from slab X,
> it puts the object into slab X's freelist using cmpxchg.
> 
> Let's say in CPU A the deactivation path performs cmpxchg and X.inuse was 1,
> and then CPU B frees (__slab_free()) to slab X's freelist using cmpxchg,
> _before_ slab X's put into partial list by CPU A.
> 
> Then CPU A thinks it's not empty so put it into partial list, but by CPU B
> the slab has become empty.
> 
> Maybe I am confused, in that case please tell me I'm wrong :)
> 

Ah, you're right! I misunderstood the slab "empty" with "full". :)

Yes, in this case the "empty" slab would be put into the node partial list,
and it should be fine in the real world as you noted earlier.

Thanks!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 7/9] slub: Optimize deactivate_slab()
  2023-12-03  9:23   ` Hyeonggon Yoo
  2023-12-03 10:26     ` Chengming Zhou
@ 2023-12-04 17:55     ` Vlastimil Babka
  2023-12-05  0:20       ` Hyeonggon Yoo
  1 sibling, 1 reply; 44+ messages in thread
From: Vlastimil Babka @ 2023-12-04 17:55 UTC (permalink / raw)
  To: Hyeonggon Yoo, chengming.zhou
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	linux-mm, linux-kernel, Chengming Zhou

On 12/3/23 10:23, Hyeonggon Yoo wrote:
> On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>>
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> Since the introduce of unfrozen slabs on cpu partial list, we don't
>> need to synchronize the slab frozen state under the node list_lock.
>>
>> The caller of deactivate_slab() and the caller of __slab_free() won't
>> manipulate the slab list concurrently.
>>
>> So we can get node list_lock in the last stage if we really need to
>> manipulate the slab list in this path.
>>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>> ---
>>  mm/slub.c | 79 ++++++++++++++++++-------------------------------------
>>  1 file changed, 26 insertions(+), 53 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index bcb5b2c4e213..d137468fe4b9 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2468,10 +2468,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
>>  static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>>                             void *freelist)
>>  {
>> -       enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
>>         struct kmem_cache_node *n = get_node(s, slab_nid(slab));
>>         int free_delta = 0;
>> -       enum slab_modes mode = M_NONE;
>>         void *nextfree, *freelist_iter, *freelist_tail;
>>         int tail = DEACTIVATE_TO_HEAD;
>>         unsigned long flags = 0;
>> @@ -2509,65 +2507,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>>         /*
>>          * Stage two: Unfreeze the slab while splicing the per-cpu
>>          * freelist to the head of slab's freelist.
>> -        *
>> -        * Ensure that the slab is unfrozen while the list presence
>> -        * reflects the actual number of objects during unfreeze.
>> -        *
>> -        * We first perform cmpxchg holding lock and insert to list
>> -        * when it succeed. If there is mismatch then the slab is not
>> -        * unfrozen and number of objects in the slab may have changed.
>> -        * Then release lock and retry cmpxchg again.
>>          */
>> -redo:
>> -
>> -       old.freelist = READ_ONCE(slab->freelist);
>> -       old.counters = READ_ONCE(slab->counters);
>> -       VM_BUG_ON(!old.frozen);
>> -
>> -       /* Determine target state of the slab */
>> -       new.counters = old.counters;
>> -       if (freelist_tail) {
>> -               new.inuse -= free_delta;
>> -               set_freepointer(s, freelist_tail, old.freelist);
>> -               new.freelist = freelist;
>> -       } else
>> -               new.freelist = old.freelist;
>> -
>> -       new.frozen = 0;
>> +       do {
>> +               old.freelist = READ_ONCE(slab->freelist);
>> +               old.counters = READ_ONCE(slab->counters);
>> +               VM_BUG_ON(!old.frozen);
>> +
>> +               /* Determine target state of the slab */
>> +               new.counters = old.counters;
>> +               new.frozen = 0;
>> +               if (freelist_tail) {
>> +                       new.inuse -= free_delta;
>> +                       set_freepointer(s, freelist_tail, old.freelist);
>> +                       new.freelist = freelist;
>> +               } else {
>> +                       new.freelist = old.freelist;
>> +               }
>> +       } while (!slab_update_freelist(s, slab,
>> +               old.freelist, old.counters,
>> +               new.freelist, new.counters,
>> +               "unfreezing slab"));
>>
>> +       /*
>> +        * Stage three: Manipulate the slab list based on the updated state.
>> +        */
> 
> deactivate_slab() might unconsciously put empty slabs into partial list, like:
> 
> deactivate_slab()                    __slab_free()
> cmpxchg(), slab's not empty
>                                                cmpxchg(), slab's empty
> and unfrozen
>                                                spin_lock(&n->list_lock)
>                                                (slab's empty but not
> on partial list,
> 
> spin_unlock(&n->list_lock) and return)
> spin_lock(&n->list_lock)
> put slab into partial list
> spin_unlock(&n->list_lock)
> 
> IMHO it should be fine in the real world, but just wanted to
> mention as it doesn't seem to be intentional.

I've noticed it too during review, but then realized it's not a new
behavior, same thing could happen with deactivate_slab() already before the
series. Free slabs on partial list are supported, we even keep some
intentionally as long as "n->nr_partial < s->min_partial" (and that check is
racy too), so no need to try making this more strict.

> Otherwise it looks good to me!

Good enough for a reviewed-by? :)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 7/9] slub: Optimize deactivate_slab()
  2023-12-04 17:55     ` Vlastimil Babka
@ 2023-12-05  0:20       ` Hyeonggon Yoo
  0 siblings, 0 replies; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-12-05  0:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: chengming.zhou, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Tue, Dec 5, 2023 at 2:55 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/3/23 10:23, Hyeonggon Yoo wrote:
> > On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
> >>
> >> From: Chengming Zhou <zhouchengming@bytedance.com>
> >>
> >> Since the introduce of unfrozen slabs on cpu partial list, we don't
> >> need to synchronize the slab frozen state under the node list_lock.
> >>
> >> The caller of deactivate_slab() and the caller of __slab_free() won't
> >> manipulate the slab list concurrently.
> >>
> >> So we can get node list_lock in the last stage if we really need to
> >> manipulate the slab list in this path.
> >>
> >> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> >> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> >> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >> ---
> >>  mm/slub.c | 79 ++++++++++++++++++-------------------------------------
> >>  1 file changed, 26 insertions(+), 53 deletions(-)
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index bcb5b2c4e213..d137468fe4b9 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -2468,10 +2468,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
> >>  static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >>                             void *freelist)
> >>  {
> >> -       enum slab_modes { M_NONE, M_PARTIAL, M_FREE, M_FULL_NOLIST };
> >>         struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> >>         int free_delta = 0;
> >> -       enum slab_modes mode = M_NONE;
> >>         void *nextfree, *freelist_iter, *freelist_tail;
> >>         int tail = DEACTIVATE_TO_HEAD;
> >>         unsigned long flags = 0;
> >> @@ -2509,65 +2507,40 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >>         /*
> >>          * Stage two: Unfreeze the slab while splicing the per-cpu
> >>          * freelist to the head of slab's freelist.
> >> -        *
> >> -        * Ensure that the slab is unfrozen while the list presence
> >> -        * reflects the actual number of objects during unfreeze.
> >> -        *
> >> -        * We first perform cmpxchg holding lock and insert to list
> >> -        * when it succeed. If there is mismatch then the slab is not
> >> -        * unfrozen and number of objects in the slab may have changed.
> >> -        * Then release lock and retry cmpxchg again.
> >>          */
> >> -redo:
> >> -
> >> -       old.freelist = READ_ONCE(slab->freelist);
> >> -       old.counters = READ_ONCE(slab->counters);
> >> -       VM_BUG_ON(!old.frozen);
> >> -
> >> -       /* Determine target state of the slab */
> >> -       new.counters = old.counters;
> >> -       if (freelist_tail) {
> >> -               new.inuse -= free_delta;
> >> -               set_freepointer(s, freelist_tail, old.freelist);
> >> -               new.freelist = freelist;
> >> -       } else
> >> -               new.freelist = old.freelist;
> >> -
> >> -       new.frozen = 0;
> >> +       do {
> >> +               old.freelist = READ_ONCE(slab->freelist);
> >> +               old.counters = READ_ONCE(slab->counters);
> >> +               VM_BUG_ON(!old.frozen);
> >> +
> >> +               /* Determine target state of the slab */
> >> +               new.counters = old.counters;
> >> +               new.frozen = 0;
> >> +               if (freelist_tail) {
> >> +                       new.inuse -= free_delta;
> >> +                       set_freepointer(s, freelist_tail, old.freelist);
> >> +                       new.freelist = freelist;
> >> +               } else {
> >> +                       new.freelist = old.freelist;
> >> +               }
> >> +       } while (!slab_update_freelist(s, slab,
> >> +               old.freelist, old.counters,
> >> +               new.freelist, new.counters,
> >> +               "unfreezing slab"));
> >>
> >> +       /*
> >> +        * Stage three: Manipulate the slab list based on the updated state.
> >> +        */
> >
> > deactivate_slab() might unconsciously put empty slabs into partial list, like:
> >
> > deactivate_slab()                    __slab_free()
> > cmpxchg(), slab's not empty
> >                                                cmpxchg(), slab's empty
> > and unfrozen
> >                                                spin_lock(&n->list_lock)
> >                                                (slab's empty but not
> > on partial list,
> >
> > spin_unlock(&n->list_lock) and return)
> > spin_lock(&n->list_lock)
> > put slab into partial list
> > spin_unlock(&n->list_lock)
> >
> > IMHO it should be fine in the real world, but just wanted to
> > mention as it doesn't seem to be intentional.
>
> I've noticed it too during review, but then realized it's not a new
> behavior, same thing could happen with deactivate_slab() already before the
> series.

Ah, you are right.

>  Free slabs on partial list are supported, we even keep some
> intentionally as long as "n->nr_partial < s->min_partial" (and that check is
> racy too) so no need to try making this more strict.

Agreed.

> > Otherwise it looks good to me!
>
> Good enough for a reviewed-by? :)

Yes,
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Thanks!
--
Hyeonggon


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 8/9] slub: Rename all *unfreeze_partials* functions to *put_partials*
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (6 preceding siblings ...)
  2023-11-02  3:23 ` [PATCH v5 7/9] slub: Optimize deactivate_slab() chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-12-03  9:27   ` Hyeonggon Yoo
  2023-11-02  3:23 ` [PATCH v5 9/9] slub: Update frozen slabs documentations in the source chengming.zhou
  2023-11-13  8:36 ` [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs Vlastimil Babka
  9 siblings, 1 reply; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

Since all partial slabs on the CPU partial list are not frozen anymore,
we don't unfreeze when moving cpu partial slabs to node partial list,
it's better to rename these functions.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d137468fe4b9..c20bdf5dab0f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2546,7 +2546,7 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 }
 
 #ifdef CONFIG_SLUB_CPU_PARTIAL
-static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
+static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
 {
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
 	struct slab *slab, *slab_to_discard = NULL;
@@ -2588,9 +2588,9 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
 }
 
 /*
- * Unfreeze all the cpu partial slabs.
+ * Put all the cpu partial slabs to the node partial list.
  */
-static void unfreeze_partials(struct kmem_cache *s)
+static void put_partials(struct kmem_cache *s)
 {
 	struct slab *partial_slab;
 	unsigned long flags;
@@ -2601,11 +2601,11 @@ static void unfreeze_partials(struct kmem_cache *s)
 	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 
 	if (partial_slab)
-		__unfreeze_partials(s, partial_slab);
+		__put_partials(s, partial_slab);
 }
 
-static void unfreeze_partials_cpu(struct kmem_cache *s,
-				  struct kmem_cache_cpu *c)
+static void put_partials_cpu(struct kmem_cache *s,
+			     struct kmem_cache_cpu *c)
 {
 	struct slab *partial_slab;
 
@@ -2613,7 +2613,7 @@ static void unfreeze_partials_cpu(struct kmem_cache *s,
 	c->partial = NULL;
 
 	if (partial_slab)
-		__unfreeze_partials(s, partial_slab);
+		__put_partials(s, partial_slab);
 }
 
 /*
@@ -2626,7 +2626,7 @@ static void unfreeze_partials_cpu(struct kmem_cache *s,
 static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 {
 	struct slab *oldslab;
-	struct slab *slab_to_unfreeze = NULL;
+	struct slab *slab_to_put = NULL;
 	unsigned long flags;
 	int slabs = 0;
 
@@ -2641,7 +2641,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 			 * per node partial list. Postpone the actual unfreezing
 			 * outside of the critical section.
 			 */
-			slab_to_unfreeze = oldslab;
+			slab_to_put = oldslab;
 			oldslab = NULL;
 		} else {
 			slabs = oldslab->slabs;
@@ -2657,17 +2657,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 
 	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 
-	if (slab_to_unfreeze) {
-		__unfreeze_partials(s, slab_to_unfreeze);
+	if (slab_to_put) {
+		__put_partials(s, slab_to_put);
 		stat(s, CPU_PARTIAL_DRAIN);
 	}
 }
 
 #else	/* CONFIG_SLUB_CPU_PARTIAL */
 
-static inline void unfreeze_partials(struct kmem_cache *s) { }
-static inline void unfreeze_partials_cpu(struct kmem_cache *s,
-				  struct kmem_cache_cpu *c) { }
+static inline void put_partials(struct kmem_cache *s) { }
+static inline void put_partials_cpu(struct kmem_cache *s,
+				    struct kmem_cache_cpu *c) { }
 
 #endif	/* CONFIG_SLUB_CPU_PARTIAL */
 
@@ -2709,7 +2709,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 		stat(s, CPUSLAB_FLUSH);
 	}
 
-	unfreeze_partials_cpu(s, c);
+	put_partials_cpu(s, c);
 }
 
 struct slub_flush_work {
@@ -2737,7 +2737,7 @@ static void flush_cpu_slab(struct work_struct *w)
 	if (c->slab)
 		flush_slab(s, c);
 
-	unfreeze_partials(s);
+	put_partials(s);
 }
 
 static bool has_cpu_slab(int cpu, struct kmem_cache *s)
@@ -3168,7 +3168,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		if (unlikely(!node_match(slab, node) ||
 			     !pfmemalloc_match(slab, gfpflags))) {
 			slab->next = NULL;
-			__unfreeze_partials(s, slab);
+			__put_partials(s, slab);
 			continue;
 		}
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 8/9] slub: Rename all *unfreeze_partials* functions to *put_partials*
  2023-11-02  3:23 ` [PATCH v5 8/9] slub: Rename all *unfreeze_partials* functions to *put_partials* chengming.zhou
@ 2023-12-03  9:27   ` Hyeonggon Yoo
  0 siblings, 0 replies; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-12-03  9:27 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> Since all partial slabs on the CPU partial list are not frozen anymore,
> we don't unfreeze when moving cpu partial slabs to node partial list,
> it's better to rename these functions.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 34 +++++++++++++++++-----------------
>  1 file changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index d137468fe4b9..c20bdf5dab0f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2546,7 +2546,7 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>  }
>
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
> -static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
> +static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
>  {
>         struct kmem_cache_node *n = NULL, *n2 = NULL;
>         struct slab *slab, *slab_to_discard = NULL;
> @@ -2588,9 +2588,9 @@ static void __unfreeze_partials(struct kmem_cache *s, struct slab *partial_slab)
>  }
>
>  /*
> - * Unfreeze all the cpu partial slabs.
> + * Put all the cpu partial slabs to the node partial list.
>   */
> -static void unfreeze_partials(struct kmem_cache *s)
> +static void put_partials(struct kmem_cache *s)
>  {
>         struct slab *partial_slab;
>         unsigned long flags;
> @@ -2601,11 +2601,11 @@ static void unfreeze_partials(struct kmem_cache *s)
>         local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>
>         if (partial_slab)
> -               __unfreeze_partials(s, partial_slab);
> +               __put_partials(s, partial_slab);
>  }
>
> -static void unfreeze_partials_cpu(struct kmem_cache *s,
> -                                 struct kmem_cache_cpu *c)
> +static void put_partials_cpu(struct kmem_cache *s,
> +                            struct kmem_cache_cpu *c)
>  {
>         struct slab *partial_slab;
>
> @@ -2613,7 +2613,7 @@ static void unfreeze_partials_cpu(struct kmem_cache *s,
>         c->partial = NULL;
>
>         if (partial_slab)
> -               __unfreeze_partials(s, partial_slab);
> +               __put_partials(s, partial_slab);
>  }
>
>  /*
> @@ -2626,7 +2626,7 @@ static void unfreeze_partials_cpu(struct kmem_cache *s,
>  static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
>  {
>         struct slab *oldslab;
> -       struct slab *slab_to_unfreeze = NULL;
> +       struct slab *slab_to_put = NULL;
>         unsigned long flags;
>         int slabs = 0;
>
> @@ -2641,7 +2641,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
>                          * per node partial list. Postpone the actual unfreezing
>                          * outside of the critical section.
>                          */
> -                       slab_to_unfreeze = oldslab;
> +                       slab_to_put = oldslab;
>                         oldslab = NULL;
>                 } else {
>                         slabs = oldslab->slabs;
> @@ -2657,17 +2657,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
>
>         local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>
> -       if (slab_to_unfreeze) {
> -               __unfreeze_partials(s, slab_to_unfreeze);
> +       if (slab_to_put) {
> +               __put_partials(s, slab_to_put);
>                 stat(s, CPU_PARTIAL_DRAIN);
>         }
>  }
>
>  #else  /* CONFIG_SLUB_CPU_PARTIAL */
>
> -static inline void unfreeze_partials(struct kmem_cache *s) { }
> -static inline void unfreeze_partials_cpu(struct kmem_cache *s,
> -                                 struct kmem_cache_cpu *c) { }
> +static inline void put_partials(struct kmem_cache *s) { }
> +static inline void put_partials_cpu(struct kmem_cache *s,
> +                                   struct kmem_cache_cpu *c) { }
>
>  #endif /* CONFIG_SLUB_CPU_PARTIAL */
>
> @@ -2709,7 +2709,7 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
>                 stat(s, CPUSLAB_FLUSH);
>         }
>
> -       unfreeze_partials_cpu(s, c);
> +       put_partials_cpu(s, c);
>  }
>
>  struct slub_flush_work {
> @@ -2737,7 +2737,7 @@ static void flush_cpu_slab(struct work_struct *w)
>         if (c->slab)
>                 flush_slab(s, c);
>
> -       unfreeze_partials(s);
> +       put_partials(s);
>  }
>
>  static bool has_cpu_slab(int cpu, struct kmem_cache *s)
> @@ -3168,7 +3168,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>                 if (unlikely(!node_match(slab, node) ||
>                              !pfmemalloc_match(slab, gfpflags))) {
>                         slab->next = NULL;
> -                       __unfreeze_partials(s, slab);
> +                       __put_partials(s, slab);
>                         continue;
>                 }
>
> --

Looks good to me,
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Thanks!

> 2.20.1
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v5 9/9] slub: Update frozen slabs documentations in the source
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (7 preceding siblings ...)
  2023-11-02  3:23 ` [PATCH v5 8/9] slub: Rename all *unfreeze_partials* functions to *put_partials* chengming.zhou
@ 2023-11-02  3:23 ` chengming.zhou
  2023-12-03  9:47   ` Hyeonggon Yoo
  2023-12-04 21:41   ` Christoph Lameter (Ampere)
  2023-11-13  8:36 ` [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs Vlastimil Babka
  9 siblings, 2 replies; 44+ messages in thread
From: chengming.zhou @ 2023-11-02  3:23 UTC (permalink / raw)
  To: vbabka, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, chengming.zhou, Chengming Zhou

From: Chengming Zhou <zhouchengming@bytedance.com>

The current updated scheme (which this series implemented) is:
 - node partial slabs: PG_Workingset && !frozen
 - cpu partial slabs: !PG_Workingset && !frozen
 - cpu slabs: !PG_Workingset && frozen
 - full slabs: !PG_Workingset && !frozen

The most important change is that "frozen" bit is not set for the
cpu partial slabs anymore, __slab_free() will grab node list_lock
then check by !PG_Workingset that it's not on a node partial list.

And the "frozen" bit is still kept for the cpu slabs for performance,
since we don't need to grab node list_lock to check whether the
PG_Workingset is set or not if the "frozen" bit is set in __slab_free().

Update related documentations and comments in the source.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
 mm/slub.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index c20bdf5dab0f..a307d319e82c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -76,13 +76,22 @@
  *
  *   Frozen slabs
  *
- *   If a slab is frozen then it is exempt from list management. It is not
- *   on any list except per cpu partial list. The processor that froze the
+ *   If a slab is frozen then it is exempt from list management. It is
+ *   the cpu slab which is actively allocated from by the processor that
+ *   froze it and it is not on any list. The processor that froze the
  *   slab is the one who can perform list operations on the slab. Other
  *   processors may put objects onto the freelist but the processor that
  *   froze the slab is the only one that can retrieve the objects from the
  *   slab's freelist.
  *
+ *   CPU partial slabs
+ *
+ *   The partially empty slabs cached on the CPU partial list are used
+ *   for performance reasons, which speeds up the allocation process.
+ *   These slabs are not frozen, but are also exempt from list management,
+ *   by clearing the PG_workingset flag when moving out of the node
+ *   partial list. Please see __slab_free() for more details.
+ *
  *   list_lock
  *
  *   The list_lock protects the partial and full list on each node and
@@ -2617,8 +2626,7 @@ static void put_partials_cpu(struct kmem_cache *s,
 }
 
 /*
- * Put a slab that was just frozen (in __slab_free|get_partial_node) into a
- * partial slab slot if available.
+ * Put a slab into a partial slab slot if available.
  *
  * If we did not find a slot then simply move all the partials to the
  * per node partial list.
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 9/9] slub: Update frozen slabs documentations in the source
  2023-11-02  3:23 ` [PATCH v5 9/9] slub: Update frozen slabs documentations in the source chengming.zhou
@ 2023-12-03  9:47   ` Hyeonggon Yoo
  2023-12-04 21:41   ` Christoph Lameter (Ampere)
  1 sibling, 0 replies; 44+ messages in thread
From: Hyeonggon Yoo @ 2023-12-03  9:47 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, cl, penberg, rientjes, iamjoonsoo.kim, akpm,
	roman.gushchin, linux-mm, linux-kernel, Chengming Zhou

On Thu, Nov 2, 2023 at 12:25 PM <chengming.zhou@linux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> The current updated scheme (which this series implemented) is:
>  - node partial slabs: PG_Workingset && !frozen
>  - cpu partial slabs: !PG_Workingset && !frozen
>  - cpu slabs: !PG_Workingset && frozen
>  - full slabs: !PG_Workingset && !frozen
>
> The most important change is that "frozen" bit is not set for the
> cpu partial slabs anymore, __slab_free() will grab node list_lock
> then check by !PG_Workingset that it's not on a node partial list.
>
> And the "frozen" bit is still kept for the cpu slabs for performance,
> since we don't need to grab node list_lock to check whether the
> PG_Workingset is set or not if the "frozen" bit is set in __slab_free().
>
> Update related documentations and comments in the source.
>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
>  mm/slub.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index c20bdf5dab0f..a307d319e82c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -76,13 +76,22 @@
>   *
>   *   Frozen slabs
>   *
> - *   If a slab is frozen then it is exempt from list management. It is not
> - *   on any list except per cpu partial list. The processor that froze the
> + *   If a slab is frozen then it is exempt from list management. It is
> + *   the cpu slab which is actively allocated from by the processor that
> + *   froze it and it is not on any list. The processor that froze the
>   *   slab is the one who can perform list operations on the slab. Other
>   *   processors may put objects onto the freelist but the processor that
>   *   froze the slab is the only one that can retrieve the objects from the
>   *   slab's freelist.
>   *
> + *   CPU partial slabs
> + *
> + *   The partially empty slabs cached on the CPU partial list are used
> + *   for performance reasons, which speeds up the allocation process.
> + *   These slabs are not frozen, but are also exempt from list management,
> + *   by clearing the PG_workingset flag when moving out of the node
> + *   partial list. Please see __slab_free() for more details.
> + *
>   *   list_lock
>   *
>   *   The list_lock protects the partial and full list on each node and
> @@ -2617,8 +2626,7 @@ static void put_partials_cpu(struct kmem_cache *s,
>  }
>
>  /*
> - * Put a slab that was just frozen (in __slab_free|get_partial_node) into a
> - * partial slab slot if available.
> + * Put a slab into a partial slab slot if available.
>   *
>   * If we did not find a slot then simply move all the partials to the
>   * per node partial list.
> --

Looks good to me,
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

Thanks!

> 2.20.1
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 9/9] slub: Update frozen slabs documentations in the source
  2023-11-02  3:23 ` [PATCH v5 9/9] slub: Update frozen slabs documentations in the source chengming.zhou
  2023-12-03  9:47   ` Hyeonggon Yoo
@ 2023-12-04 21:41   ` Christoph Lameter (Ampere)
  2023-12-05  6:06     ` Chengming Zhou
  1 sibling, 1 reply; 44+ messages in thread
From: Christoph Lameter (Ampere) @ 2023-12-04 21:41 UTC (permalink / raw)
  To: chengming.zhou
  Cc: vbabka, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

On Thu, 2 Nov 2023, chengming.zhou@linux.dev wrote:

> From: Chengming Zhou <zhouchengming@bytedance.com>
>
> The current updated scheme (which this series implemented) is:
> - node partial slabs: PG_Workingset && !frozen
> - cpu partial slabs: !PG_Workingset && !frozen
> - cpu slabs: !PG_Workingset && frozen
> - full slabs: !PG_Workingset && !frozen

The above would be good to include in the comments.

Acked-by: Christoph Lameter (Ampere) <cl@linux.com>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 9/9] slub: Update frozen slabs documentations in the source
  2023-12-04 21:41   ` Christoph Lameter (Ampere)
@ 2023-12-05  6:06     ` Chengming Zhou
  2023-12-05  9:39       ` Vlastimil Babka
  0 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2023-12-05  6:06 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: vbabka, penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

On 2023/12/5 05:41, Christoph Lameter (Ampere) wrote:
> On Thu, 2 Nov 2023, chengming.zhou@linux.dev wrote:
> 
>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>
>> The current updated scheme (which this series implemented) is:
>> - node partial slabs: PG_Workingset && !frozen
>> - cpu partial slabs: !PG_Workingset && !frozen
>> - cpu slabs: !PG_Workingset && frozen
>> - full slabs: !PG_Workingset && !frozen
> 
> The above would be good to include in the comments.
> 
> Acked-by: Christoph Lameter (Ampere) <cl@linux.com>
> 

Thanks for your review and suggestion!

Maybe something like this:

diff --git a/mm/slub.c b/mm/slub.c
index 623c17a4cdd6..21f88bd9c16b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -93,6 +93,12 @@
  *   by clearing the PG_workingset flag when moving out of the node
  *   partial list. Please see __slab_free() for more details.
  *
+ *   To sum up, the current scheme is:
+ *   - node partial slab: PG_Workingset && !frozen
+ *   - cpu partial slab: !PG_Workingset && !frozen
+ *   - cpu slab: !PG_Workingset && frozen
+ *   - full slab: !PG_Workingset && !frozen
+ *
  *   list_lock
  *
  *   The list_lock protects the partial and full list on each node and


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 9/9] slub: Update frozen slabs documentations in the source
  2023-12-05  6:06     ` Chengming Zhou
@ 2023-12-05  9:39       ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2023-12-05  9:39 UTC (permalink / raw)
  To: Chengming Zhou, Christoph Lameter (Ampere)
  Cc: penberg, rientjes, iamjoonsoo.kim, akpm, roman.gushchin,
	42.hyeyoo, linux-mm, linux-kernel, Chengming Zhou

On 12/5/23 07:06, Chengming Zhou wrote:
> On 2023/12/5 05:41, Christoph Lameter (Ampere) wrote:
>> On Thu, 2 Nov 2023, chengming.zhou@linux.dev wrote:
>> 
>>> From: Chengming Zhou <zhouchengming@bytedance.com>
>>>
>>> The current updated scheme (which this series implemented) is:
>>> - node partial slabs: PG_Workingset && !frozen
>>> - cpu partial slabs: !PG_Workingset && !frozen
>>> - cpu slabs: !PG_Workingset && frozen
>>> - full slabs: !PG_Workingset && !frozen
>> 
>> The above would be good to include in the comments.
>> 
>> Acked-by: Christoph Lameter (Ampere) <cl@linux.com>
>> 
> 
> Thanks for your review and suggestion!
> 
> Maybe something like this:

Thanks, added.

> diff --git a/mm/slub.c b/mm/slub.c
> index 623c17a4cdd6..21f88bd9c16b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -93,6 +93,12 @@
>   *   by clearing the PG_workingset flag when moving out of the node
>   *   partial list. Please see __slab_free() for more details.
>   *
> + *   To sum up, the current scheme is:
> + *   - node partial slab: PG_Workingset && !frozen
> + *   - cpu partial slab: !PG_Workingset && !frozen
> + *   - cpu slab: !PG_Workingset && frozen
> + *   - full slab: !PG_Workingset && !frozen
> + *
>   *   list_lock
>   *
>   *   The list_lock protects the partial and full list on each node and



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs
  2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
                   ` (8 preceding siblings ...)
  2023-11-02  3:23 ` [PATCH v5 9/9] slub: Update frozen slabs documentations in the source chengming.zhou
@ 2023-11-13  8:36 ` Vlastimil Babka
  9 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2023-11-13  8:36 UTC (permalink / raw)
  To: chengming.zhou, cl, penberg
  Cc: rientjes, iamjoonsoo.kim, akpm, roman.gushchin, 42.hyeyoo,
	linux-mm, linux-kernel, Chengming Zhou

On 11/2/23 04:23, chengming.zhou@linux.dev wrote:
> From: Chengming Zhou <zhouchengming@bytedance.com>
> 
> Changes in v5:
>  - Drop "RFC".
>  - Retest to update performance numbers (little difference with RFC v1).
>  - Add Reviewed-by and Tested-by tags. Many thanks!
>  - Change to better function name: __put_partials().
>  - Some minor improvements of comments and changelog.
>  - RFC v4: https://lore.kernel.org/all/20231031140741.79387-1-chengming.zhou@linux.dev/

Thanks! Pushed to
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-6.8/partial-freezing

Vlastimil



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2023-12-05  9:39 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-02  3:23 [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs chengming.zhou
2023-11-02  3:23 ` [PATCH v5 1/9] slub: Reflow ___slab_alloc() chengming.zhou
2023-11-22  0:26   ` Hyeonggon Yoo
2023-11-02  3:23 ` [PATCH v5 2/9] slub: Change get_partial() interfaces to return slab chengming.zhou
2023-11-22  1:09   ` Hyeonggon Yoo
2023-11-02  3:23 ` [PATCH v5 3/9] slub: Keep track of whether slub is on the per-node partial list chengming.zhou
2023-11-22  1:21   ` Hyeonggon Yoo
2023-11-02  3:23 ` [PATCH v5 4/9] slub: Prepare __slab_free() for unfrozen partial slab out of node " chengming.zhou
2023-12-03  6:01   ` Hyeonggon Yoo
2023-11-02  3:23 ` [PATCH v5 5/9] slub: Introduce freeze_slab() chengming.zhou
2023-11-02  3:23 ` [PATCH v5 6/9] slub: Delay freezing of partial slabs chengming.zhou
2023-11-14  5:44   ` kernel test robot
2023-11-20 18:49   ` Mark Brown
2023-11-21  0:58     ` Chengming Zhou
2023-11-21  1:29       ` Mark Brown
2023-11-21 15:47         ` Chengming Zhou
2023-11-21 18:21           ` Mark Brown
2023-11-22  8:52             ` Vlastimil Babka
2023-11-22  9:37     ` Vlastimil Babka
2023-11-22 11:27       ` Mark Brown
2023-11-22 11:35       ` Chengming Zhou
2023-11-22 11:40         ` Vlastimil Babka
2023-11-22 11:54           ` Chengming Zhou
2023-11-22 13:19             ` Vlastimil Babka
2023-11-22 14:28               ` Chengming Zhou
2023-11-22 14:32                 ` Vlastimil Babka
2023-12-03  6:53   ` Hyeonggon Yoo
2023-12-03 10:15     ` Chengming Zhou
2023-12-04 16:58       ` Vlastimil Babka
2023-11-02  3:23 ` [PATCH v5 7/9] slub: Optimize deactivate_slab() chengming.zhou
2023-12-03  9:23   ` Hyeonggon Yoo
2023-12-03 10:26     ` Chengming Zhou
2023-12-03 11:19       ` Hyeonggon Yoo
2023-12-03 11:47         ` Chengming Zhou
2023-12-04 17:55     ` Vlastimil Babka
2023-12-05  0:20       ` Hyeonggon Yoo
2023-11-02  3:23 ` [PATCH v5 8/9] slub: Rename all *unfreeze_partials* functions to *put_partials* chengming.zhou
2023-12-03  9:27   ` Hyeonggon Yoo
2023-11-02  3:23 ` [PATCH v5 9/9] slub: Update frozen slabs documentations in the source chengming.zhou
2023-12-03  9:47   ` Hyeonggon Yoo
2023-12-04 21:41   ` Christoph Lameter (Ampere)
2023-12-05  6:06     ` Chengming Zhou
2023-12-05  9:39       ` Vlastimil Babka
2023-11-13  8:36 ` [PATCH v5 0/9] slub: Delay freezing of CPU partial slabs Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).