All of lore.kernel.org
 help / color / mirror / Atom feed
From: "teawater" <hui.zhu@linux.dev>
To: "Harry Yoo (Oracle)" <harry@kernel.org>,
	"Shakeel Butt" <shakeel.butt@linux.dev>
Cc: "Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, "Hui Zhu" <zhuhui@kylinos.cn>,
	"Vlastimil Babka" <vbabka@kernel.org>,
	"Hao Li" <hao.li@linux.dev>
Subject: Re: [PATCH mm-stable v3] mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook
Date: Wed, 22 Apr 2026 09:00:01 +0000	[thread overview]
Message-ID: <9871d2cd927f7410e95ddc77ece8b9d00ed5b787@linux.dev> (raw)
In-Reply-To: <acv5QCe0qMUUW2xP@hyeyoo>

> 
> On Tue, Mar 31, 2026 at 08:32:30AM -0700, Shakeel Butt wrote:
> 
> > 
> > On Tue, Mar 31, 2026 at 05:17:07PM +0800, Hui Zhu wrote:
> >  From: Hui Zhu <zhuhui@kylinos.cn>
> >  
> >  When kmem_cache_alloc_bulk() allocates multiple objects, the post-alloc
> >  hook __memcg_slab_post_alloc_hook() previously charged memcg one object
> >  at a time, even though consecutive objects may reside on slabs backed by
> >  the same pgdat node.
> >  
> >  Batch the memcg charging by scanning ahead from the current position to
> >  find a contiguous run of objects whose slabs share the same pgdat, then
> >  issue a single __obj_cgroup_charge() / __consume_obj_stock() call for
> >  the entire run. The per-object obj_ext assignment loop is preserved as-is
> >  since it cannot be further collapsed.
> >  
> >  This implements the TODO comment left in commit bc730030f956 ("memcg:
> >  combine slab obj stock charging and accounting").
> >  
> >  The existing error-recovery contract is unchanged: if size == 1 then
> >  memcg_alloc_abort_single() will free the sole object, and for larger
> >  bulk allocations kmem_cache_free_bulk() will uncharge any objects that
> >  were already charged before the failure.
> >  
> >  Benchmark using kmem_cache_alloc_bulk() with SLAB_ACCOUNT
> >  (iters=100000):
> >  
> >  bulk=32 before: 215 ns/object after: 174 ns/object (-19%)
> >  bulk=1 before: 344 ns/object after: 335 ns/object ( ~)
> >  
> >  No measurable regression for bulk=1, as expected.
> >  
> >  Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
> >  
> >  Do we have an actual user of kmem_cache_alloc_bulk(GFP_ACCOUNT) in kernel?
> > 
> Apparently we have a SLAB_ACCOUNT user in io_uring.c.
> (perhaps it's the only user?)
> 
> > 
> > If yes, can you please benchmark that usage? Otherwise can we please wait for
> >  an actual user before adding more complexity? Or you can look for opportunities
> >  for kmem_cache_alloc_bulk(GFP_ACCOUNT) users and add the optimization along with
> >  the user.
> > 
> Good point. I was also wondering what are use cases benefiting
> from this beyond the microbenchmark.
> 
> > 
> > Have you looked at the bulk free side? I think we already have rcu freeing in
> >  bulk as a user. Did you find any opportunities in optimizing the
> >  __memcg_slab_free_hook() from bulk free?
> > 
> Probably a bit out of scope but one thing to note on slab side:
> kfree_bulk() (called by kfree_rcu batching) doesn't specify slab cache,
> and it builds a detached freelist which contains objects from the same slab.
> 
> On the other hand kmem_cache_free_bulk() with non-NULL slab cache
> simply calls free_to_pcs_bulk() and it passes objects one by one to
> __memcg_slab_free_hook() since objects may not come from the same slab.
> 
> Now that we have sheaves enabled for (almost) all slab caches, it might
> be worth revisiting - e.g. sort objects by slab cache and
> pass them to free_to_pcs_bulk() instead of building a detached freelist.
> 
> And let __memcg_slab_free_hook() handle objects from the same cache but
> from different slabs.
> 
> -- 
> Cheers,
> Harry / Hyeonggon
>


Hi Shakeel and Harry,

I ran a couple of benchmarks against the patch and wanted to share
the results.

The first test exercises the __io_alloc_req_refill bulk-refill path
directly. It submits POLL_ADD requests against a pipe fd that never
becomes readable, so requests accumulate in the poll wait queue and
force repeated refills at high throughput. With the patch applied,
elapsed time dropped by 8.7% — a clear win for that code path.

However, the second test focuses on single-object allocation speed
under the same ring setup. There, the patch actually regressed
performance by 5.7%.

I also tried two targeted mitigations to recover that regression:

  1. Replacing `likely` with `unlikely` in the relevant branch.
  2. Replacing `check_mul_overflow` with a simpler bounds check:
       size <= (size_t)(INT_MAX - PAGE_SIZE) /
              (KMALLOC_MAX_SIZE + sizeof(struct obj_cgroup *))

Neither approach recovered the single-allocation loss in a
meaningful way.

Given that only the __io_alloc_req_refill call path benefits from
this patch while the common single-allocation path takes a step
back, the trade-off doesn't seem worthwhile at this point. I'd
suggest we hold off on merging until we find an approach that
improves — or at least doesn't hurt — the general case.

Happy to discuss further or run additional benchmarks if that
would help. The two test programs I used are included at the
bottom of this email.

Best,
Hui

#define _GNU_SOURCE
#include <liburing.h>
#include <stdatomic.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/resource.h>
#include <time.h>
#include <unistd.h>

#define QD		4096	/* SQ depth per ring */
#define BURST		2048	/* SQEs submitted per round; refills ≈ BURST/8 */
#define RING_RECYCLE	32	/* rounds before recycling the ring */

/*
 * Default total number of submissions.  Can be overridden via argv[1].
 * The loop exits as soon as the cumulative submitted count reaches this value.
 */
#define DEFAULT_TOTAL	(1UL << 24)

int main(int argc, char **argv)
{
	unsigned long target = argc > 1 ? strtoul(argv[1], NULL, 0) : DEFAULT_TOTAL;
	unsigned long submitted = 0;

	/* Raise nofile/memlock limits; poll requests are heavy on fd table and slab */
	struct rlimit rl = { .rlim_cur = 1 << 20, .rlim_max = 1 << 20 };
	setrlimit(RLIMIT_NOFILE, &rl);
	setrlimit(RLIMIT_MEMLOCK, &rl);

	printf("target=%lu QD=%d burst=%d ring_recycle=%d\n",
	       target, QD, BURST, RING_RECYCLE);

	/*
	 * A pipe whose read end will never become readable.
	 * POLL_ADD(POLLIN) requests submitted against pfd[0] will hang
	 * indefinitely in the poll wait queue without producing a CQE,
	 * which is exactly what exercises the refill path at high rate.
	 */
	int pfd[2];
	if (pipe(pfd) < 0) {
		perror("pipe");
		return 1;
	}

	struct timespec t0, t1;
	clock_gettime(CLOCK_MONOTONIC, &t0);

	while (submitted < target) {
		struct io_uring ring;
		struct io_uring_params pr = { 0 };

		/*
		 * No SQPOLL: submissions go through io_submit_sqes(), which is
		 * the code path where refill is invoked.
		 */
		if (io_uring_queue_init_params(QD, &ring, &pr) < 0) {
			perror("io_uring_queue_init_params");
			break;
		}

		for (int round = 0; round < RING_RECYCLE && submitted < target; round++) {
			int prepared = 0;

			for (int i = 0; i < BURST; i++) {
				struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);

				if (!sqe)
					break;

				/*
				 * POLL_ADD on an fd that never fires: the request
				 * is parked on the poll wait queue and does not
				 * return to the free list until ring exit.
				 */
				io_uring_prep_poll_add(sqe, pfd[0], POLL_IN);
				sqe->user_data = i;
				prepared++;
			}

			if (!prepared)
				break;

			int r = io_uring_submit(&ring);

			if (r < 0)
				break;

			submitted += r;
		}

		/*
		 * Destroy the ring periodically so that the io_kiocb objects
		 * accumulated in nr_req_allocated are returned to req_cachep.
		 * ring_exit() drains all pending poll requests; once the
		 * percpu_ref reaches zero the slab objects are released in
		 * bulk, preventing unbounded memory growth.
		 */
		io_uring_queue_exit(&ring);
	}

	clock_gettime(CLOCK_MONOTONIC, &t1);

	close(pfd[0]);
	close(pfd[1]);

	double dt = (t1.tv_sec - t0.tv_sec) +
		    (t1.tv_nsec - t0.tv_nsec) / 1e9;

	printf("submitted=%lu  refills=%lu  elapsed=%.3fs  (%.2f Mrefill/s)\n",
	       submitted, submitted / 8,
	       dt, (submitted / 8.0) / dt / 1e6);

	return 0;
}


#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/ktime.h>
#include <linux/mm.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Hui Zhu");
MODULE_DESCRIPTION("Benchmark for kmem_cache_alloc_bulk with memcg accounting");

/* Default number of iterations */
static int iters = 100000;
module_param(iters, int, 0444);
MODULE_PARM_DESC(iters, "Number of iterations");

/*
 * Default bulk size. Set to 32 or 64 to evaluate
 * the effect of bulk allocation optimizations.
 */
static int bulk_size = 32;
module_param(bulk_size, int, 0444);
MODULE_PARM_DESC(bulk_size, "Number of objects per bulk allocation");

#define OBJ_SIZE 256

static int __init bench_init(void)
{
	struct kmem_cache *cache;
	void **objs;
	int i;
	u64 start, end, delta;
	int ret = 0;

	pr_info("Benchmarking kmem_cache_alloc_bulk with SLAB_ACCOUNT...\n");

	/*
	 * Create the cache with SLAB_ACCOUNT so that every allocation
	 * from it triggers the memcg accounting hooks, specifically
	 * __memcg_slab_post_alloc_hook.
	 */
	cache = kmem_cache_create("bench_memcg_cache", OBJ_SIZE, 0,
				  SLAB_ACCOUNT, NULL);
	if (!cache) {
		pr_err("Failed to create cache\n");
		return -ENOMEM;
	}

	/* Allocate the pointer array to hold bulk-allocated objects */
	objs = kmalloc_array(bulk_size, sizeof(void *), GFP_KERNEL);
	if (!objs) {
		pr_err("Failed to allocate pointer array\n");
		kmem_cache_destroy(cache);
		return -ENOMEM;
	}

	/* Warm up once to avoid cold-start overhead on the first run */
	ret = kmem_cache_alloc_bulk(cache, GFP_KERNEL, bulk_size, objs);
	if (ret)
		kmem_cache_free_bulk(cache, ret, objs);

	/* Start timing */
	start = ktime_get_ns();

	for (i = 0; i < iters; i++) {
		/* Core measurement: bulk allocation */
		ret = kmem_cache_alloc_bulk(cache, GFP_KERNEL, bulk_size, objs);
		if (unlikely(!ret)) {
			pr_err("Allocation failed at iteration %d\n", i);
			break;
		}
		/*
		 * Free immediately; we only care about the performance
		 * of the allocation-path hooks.
		 */
		kmem_cache_free_bulk(cache, ret, objs);
	}

	end = ktime_get_ns();

	delta = end - start;
	pr_info("Benchmark Result (iters=%d, bulk=%d):\n", iters, bulk_size);
	pr_info("  Total Time:              %llu ns\n", delta);
	pr_info("  Avg Time per Iteration:  %llu ns\n", delta / iters);
	pr_info("  Avg Time per Object:     %llu ns\n",
		delta / (iters * bulk_size));

	/* Release resources */
	kfree(objs);
	kmem_cache_destroy(cache);

	/*
	 * Return -EAGAIN to prevent the module from being fully loaded.
	 * insmod will report an error and exit, but the benchmark results
	 * are already recorded in dmesg, so no manual rmmod is needed.
	 */
	return -EAGAIN;
}

static void __exit bench_exit(void)
{
}

module_init(bench_init);
module_exit(bench_exit);

      parent reply	other threads:[~2026-04-22  9:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-31  9:17 [PATCH mm-stable v3] mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook Hui Zhu
2026-03-31 11:48 ` Harry Yoo (Oracle)
2026-03-31 15:32 ` Shakeel Butt
2026-03-31 16:41   ` Harry Yoo (Oracle)
2026-04-01 12:26     ` teawater
2026-04-22  9:00     ` teawater [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9871d2cd927f7410e95ddc77ece8b9d00ed5b787@linux.dev \
    --to=hui.zhu@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hao.li@linux.dev \
    --cc=harry@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=vbabka@kernel.org \
    --cc=zhuhui@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.