Re: [PATCH mm-stable v3] mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Harry Yoo (Oracle)" <harry@kernel.org>
To: Hui Zhu <hui.zhu@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Hui Zhu <zhuhui@kylinos.cn>
Subject: Re: [PATCH mm-stable v3] mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook
Date: Tue, 31 Mar 2026 20:48:40 +0900	[thread overview]
Message-ID: <acu0mGYGI6K_yTL1@hyeyoo> (raw)
In-Reply-To: <20260331091707.226786-1-hui.zhu@linux.dev>

On Tue, Mar 31, 2026 at 05:17:07PM +0800, Hui Zhu wrote:
> From: Hui Zhu <zhuhui@kylinos.cn>
> 
> When kmem_cache_alloc_bulk() allocates multiple objects, the post-alloc
> hook __memcg_slab_post_alloc_hook() previously charged memcg one object
> at a time, even though consecutive objects may reside on slabs backed by
> the same pgdat node.
> 
> Batch the memcg charging by scanning ahead from the current position to
> find a contiguous run of objects whose slabs share the same pgdat, then
> issue a single __obj_cgroup_charge() / __consume_obj_stock() call for
> the entire run. The per-object obj_ext assignment loop is preserved as-is
> since it cannot be further collapsed.
> 
> This implements the TODO comment left in commit bc730030f956 ("memcg:
> combine slab obj stock charging and accounting").
> 
> The existing error-recovery contract is unchanged: if size == 1 then
> memcg_alloc_abort_single() will free the sole object, and for larger
> bulk allocations kmem_cache_free_bulk() will uncharge any objects that
> were already charged before the failure.
> 
> Benchmark using kmem_cache_alloc_bulk() with SLAB_ACCOUNT
> (iters=100000):
> 
>   bulk=32  before: 215 ns/object   after: 174 ns/object  (-19%)
>   bulk=1   before: 344 ns/object   after: 335 ns/object  (  ~)
> 
> No measurable regression for bulk=1, as expected.
> 
> Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
> ---
> 
> Changelog:
> v3:
> Update base from "mm-unstable" to "mm-stable".
> v2:
> According to the comments in [1], add code to handle the integer
> overflow issue.
> 
> [1] https://sashiko.dev/#/patchset/20260316084839.1342163-1-hui.zhu%40linux.dev
> 
>  mm/memcontrol.c | 77 +++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 58 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 051b82ebf371..3159bf39e060 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3277,51 +3277,90 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  			return false;
>  	}
>  
> -	for (i = 0; i < size; i++) {
> +	for (i = 0; i < size; ) {
>  		unsigned long obj_exts;
>  		struct slabobj_ext *obj_ext;
>  		struct obj_stock_pcp *stock;
> +		struct pglist_data *pgdat;
> +		int batch_bytes;
> +		size_t run_len = 0;

Let's initialize it to 1 to simplify the code. And perhaps the variable
could be renamed `batch_count` to align with `batch_bytes`.

> +		size_t j;
> +		size_t max_size;
> +		bool skip_next = false;
>  
>  		slab = virt_to_slab(p[i]);
>  
>  		if (!slab_obj_exts(slab) &&
>  		    alloc_slab_obj_exts(slab, s, flags, false)) {
> +			i++;
>  			continue;
>  		}
>  
> +		pgdat = slab_pgdat(slab);
> +		run_len = 1;

Hmm allocating 2GiB of memory at one kmem_cache_alloc_bulk() call sounds
already crazy, but if we have to check overflow anyway...

Could we simply skip the optimization
if check_mul_overflow(obj_size, size, &batch_bytes) returns nonzero?
It should be extremely unlikely to trigger anyway.

> +		/*
> +		 * The value of batch_bytes must not exceed
> +		 * (INT_MAX - PAGE_SIZE) to prevent integer overflow in

Since the vmstat data is cached at least once before it's flushed,
it isn't accurate to assume that stock->nr_slab_{un,}reclaimable will be
<= PAGE_SIZE.

So I think it's safe for `batch_bytes` to be INT_MAX but
__account_obj_stock() should check if it'll overflow before adding the
counter?

(but again that sounds quite theoretical and unlikely to hit in practice)

> +		 * the final accumulation performed by __account_obj_stock().
> +		 */
> +		max_size = min((size_t)((INT_MAX - PAGE_SIZE) / obj_size),
> +			       size);
> +
> +		for (j = i + 1; j < max_size; j++) {
> +			struct slab *slab_j = virt_to_slab(p[j]);
> +
> +			if (slab_pgdat(slab_j) != pgdat)
> +				break;
> +
> +			if (!slab_obj_exts(slab_j) &&
> +			    alloc_slab_obj_exts(slab_j, s, flags, false)) {
> +				skip_next = true;

Let's not micro-optimize it and drop skip_next.
In the next iteration of the outer loop it will be skipped anyway and
it's rare for alloc_slab_obj_exts() to fail.

> +				break;
> +			}
> +
> +			run_len++;
> +		}
> +
>  		/*
> -		 * if we fail and size is 1, memcg_alloc_abort_single() will
> +		 * If we fail and size is 1, memcg_alloc_abort_single() will
>  		 * just free the object, which is ok as we have not assigned
> -		 * objcg to its obj_ext yet
> -		 *
> -		 * for larger sizes, kmem_cache_free_bulk() will uncharge
> -		 * any objects that were already charged and obj_ext assigned
> +		 * objcg to its obj_ext yet.
>  		 *
> -		 * TODO: we could batch this until slab_pgdat(slab) changes
> -		 * between iterations, with a more complicated undo
> +		 * For larger sizes, kmem_cache_free_bulk() will uncharge
> +		 * any objects that were already charged and obj_ext assigned.
>  		 */
> +		batch_bytes = obj_size * run_len;
>  		stock = trylock_stock();
> -		if (!stock || !__consume_obj_stock(objcg, stock, obj_size)) {
> +		if (!stock || !__consume_obj_stock(objcg, stock, batch_bytes)) {
>  			size_t remainder;
>  
>  			unlock_stock(stock);
> -			if (__obj_cgroup_charge(objcg, flags, obj_size, &remainder))
> +			if (__obj_cgroup_charge(objcg, flags, batch_bytes, &remainder))
>  				return false;
>  			stock = trylock_stock();
>  			if (remainder)
>  				__refill_obj_stock(objcg, stock, remainder, false);
>  		}
> -		__account_obj_stock(objcg, stock, obj_size,
> -				    slab_pgdat(slab), cache_vmstat_idx(s));
> +		__account_obj_stock(objcg, stock, batch_bytes,
> +				    pgdat, cache_vmstat_idx(s));
>  		unlock_stock(stock);
>  
> -		obj_exts = slab_obj_exts(slab);
> -		get_slab_obj_exts(obj_exts);
> -		off = obj_to_index(s, slab, p[i]);
> -		obj_ext = slab_obj_ext(slab, obj_exts, off);
> -		obj_cgroup_get(objcg);
> -		obj_ext->objcg = objcg;
> -		put_slab_obj_exts(obj_exts);
> +		for (j = 0; j < run_len; j++) {
> +			slab = virt_to_slab(p[i + j]);
> +			obj_exts = slab_obj_exts(slab);
> +			get_slab_obj_exts(obj_exts);
> +			off = obj_to_index(s, slab, p[i + j]);
> +			obj_ext = slab_obj_ext(slab, obj_exts, off);

> +			obj_cgroup_get(objcg);

This could be batched by calling:
		obj_cgroup_get_many(objcg, batch_count);

> +			obj_ext->objcg = objcg;
> +			put_slab_obj_exts(obj_exts);
> +		}
> +
> +		if (skip_next)
> +			i = i + run_len + 1;
> +		else
> +			i += run_len;

With the suggestion above this could be `i += batch_count;`

>  	}
>  
>  	return true;

To sum up... something like this?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3d98ab41f1f..3252252ea9c3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3184,8 +3184,12 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 		*bytes = nr;
 		nr = 0;
 	} else {
-		*bytes += nr;
-		if (abs(*bytes) > PAGE_SIZE) {
+		int old = *bytes;
+
+		if (unlikely(check_add_overflow(old, nr, bytes))) {
+			*bytes = nr;
+			nr = old;
+		} else if (abs(*bytes) > PAGE_SIZE) {
 			nr = *bytes;
 			*bytes = 0;
 		} else {
@@ -3422,6 +3426,8 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	struct slab *slab;
 	unsigned long off;
 	size_t i;
+	int batch_bytes;
+	bool skip_batching = false;

 	/*
 	 * The obtained objcg pointer is safe to use within the current scope,
@@ -3455,51 +3461,77 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 			return false;
 	}

-	for (i = 0; i < size; i++) {
+	if (check_mul_overflow(obj_size, size, &batch_bytes))
+		skip_batching = true;
+
+	for (i = 0; i < size; ) {
 		unsigned long obj_exts;
 		struct slabobj_ext *obj_ext;
 		struct obj_stock_pcp *stock;
+		struct pglist_data *pgdat;
+		size_t batch_count = 1;
+		size_t j;

 		slab = virt_to_slab(p[i]);
-
 		if (!slab_obj_exts(slab) &&
 		    alloc_slab_obj_exts(slab, s, flags, false)) {
+			i++;
 			continue;
 		}

+		pgdat = slab_pgdat(slab);
+
+		if (likely(!skip_batching)) {
+			for (j = i + 1; j < size; j++) {
+				struct slab *slab_j = virt_to_slab(p[j]);
+
+				if (slab_pgdat(slab_j) != pgdat)
+					break;
+
+				if (!slab_obj_exts(slab_j) &&
+				    alloc_slab_obj_exts(slab_j, s, flags, false))
+					break;
+
+				batch_count++;
+			}
+		}
+
 		/*
-		 * if we fail and size is 1, memcg_alloc_abort_single() will
+		 * If we fail and size is 1, memcg_alloc_abort_single() will
 		 * just free the object, which is ok as we have not assigned
-		 * objcg to its obj_ext yet
-		 *
-		 * for larger sizes, kmem_cache_free_bulk() will uncharge
-		 * any objects that were already charged and obj_ext assigned
+		 * objcg to its obj_ext yet.
 		 *
-		 * TODO: we could batch this until slab_pgdat(slab) changes
-		 * between iterations, with a more complicated undo
+		 * For larger sizes, kmem_cache_free_bulk() will uncharge
+		 * any objects that were already charged and obj_ext assigned.
 		 */
+		batch_bytes = obj_size * batch_count;
 		stock = trylock_stock();
-		if (!stock || !__consume_obj_stock(objcg, stock, obj_size)) {
+		if (!stock || !__consume_obj_stock(objcg, stock, batch_bytes)) {
 			size_t remainder;

 			unlock_stock(stock);
-			if (__obj_cgroup_charge(objcg, flags, obj_size, &remainder))
+			if (__obj_cgroup_charge(objcg, flags, batch_bytes, &remainder))
 				return false;
 			stock = trylock_stock();
 			if (remainder)
 				__refill_obj_stock(objcg, stock, remainder, false);
 		}
-		__account_obj_stock(objcg, stock, obj_size,
-				    slab_pgdat(slab), cache_vmstat_idx(s));
+		__account_obj_stock(objcg, stock, batch_bytes,
+				    pgdat, cache_vmstat_idx(s));
 		unlock_stock(stock);

-		obj_exts = slab_obj_exts(slab);
-		get_slab_obj_exts(obj_exts);
-		off = obj_to_index(s, slab, p[i]);
-		obj_ext = slab_obj_ext(slab, obj_exts, off);
-		obj_cgroup_get(objcg);
-		obj_ext->objcg = objcg;
-		put_slab_obj_exts(obj_exts);
+		obj_cgroup_get_many(objcg, batch_count);
+		for (j = 0; j < batch_count; j++) {
+			slab = virt_to_slab(p[i + j]);
+			obj_exts = slab_obj_exts(slab);
+			get_slab_obj_exts(obj_exts);
+			off = obj_to_index(s, slab, p[i + j]);
+			obj_ext = slab_obj_ext(slab, obj_exts, off);
+			obj_ext->objcg = objcg;
+			put_slab_obj_exts(obj_exts);
+		}
+
+		i += batch_count;
 	}

 	return true;
--
2.43.0

-- 
Cheers,
Harry / Hyeonggon

next prev parent reply	other threads:[~2026-03-31 11:48 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-31  9:17 [PATCH mm-stable v3] mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook Hui Zhu
2026-03-31 11:48 ` Harry Yoo (Oracle) [this message]
2026-03-31 15:32 ` Shakeel Butt
2026-03-31 16:41   ` Harry Yoo (Oracle)
2026-04-01 12:26     ` teawater
2026-04-22  9:00     ` teawater

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c3d98ab41f1 dfblob:3252252ea9c )
 OR (
bs:"Re: [PATCH mm-stable v3] mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acu0mGYGI6K_yTL1@hyeyoo \
    --to=harry@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hui.zhu@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=zhuhui@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.