All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Vlastimil Babka <vbabka@suse.cz>,
	Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Meta kernel team <kernel-team@meta.com>
Subject: Re: [PATCH] memcg: multi-memcg percpu charge cache
Date: Tue, 29 Apr 2025 14:13:16 +0200	[thread overview]
Message-ID: <aBDCXB_Tb2Iaihua@tiehlicka> (raw)
In-Reply-To: <20250416180229.2902751-1-shakeel.butt@linux.dev>

On Wed 16-04-25 11:02:29, Shakeel Butt wrote:
> Memory cgroup accounting is expensive and to reduce the cost, the kernel
> maintains per-cpu charge cache for a single memcg. So, if a charge
> request comes for a different memcg, the kernel will flush the old
> memcg's charge cache and then charge the newer memcg a fixed amount (64
> pages), subtracts the charge request amount and stores the remaining in
> the per-cpu charge cache for the newer memcg.
> 
> This mechanism is based on the assumption that the kernel, for locality,
> keep a process on a CPU for long period of time and most of the charge
> requests from that process will be served by that CPU's local charge
> cache.
> 
> However this assumption breaks down for incoming network traffic in a
> multi-tenant machine. We are in the process of running multiple
> workloads on a single machine and if such workloads are network heavy,
> we are seeing very high network memory accounting cost. We have observed
> multiple CPUs spending almost 100% of their time in net_rx_action and
> almost all of that time is spent in memcg accounting of the network
> traffic.
> 
> More precisely, net_rx_action is serving packets from multiple workloads
> and is observing/serving mix of packets of these workloads. The memcg
> switch of per-cpu cache is very expensive and we are observing a lot of
> memcg switches on the machine. Almost all the time is being spent on
> charging new memcg and flushing older memcg cache. So, definitely we
> need per-cpu cache that support multiple memcgs for this scenario.
> 
> This patch implements a simple (and dumb) multiple memcg percpu charge
> cache. Actually we started with more sophisticated LRU based approach but
> the dumb one was always better than the sophisticated one by 1% to 3%,
> so going with the simple approach.

Makes sense to start simple and go for a more sophisticated (has table
appraoch maybe) later when a clear gain could be demonstrated.

> Some of the design choices are:
> 
> 1. Fit all caches memcgs in a single cacheline.

Could you be more specific about the reasoning? I suspect it is for the
network receive path you are mentioning above, right?

> 2. The cache array can be mix of empty slots or memcg charged slots, so
>    the kernel has to traverse the full array.
> 3. The cache drain from the reclaim will drain all cached memcgs to keep
>    things simple.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload where each netperf client runs in a different
> cgroup. The next-20250415 kernel is used as base.
> 
>  $ netserver -6
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> number of clients | Without patch | With patch
>   6               | 42584.1 Mbps  | 48603.4 Mbps (14.13% improvement)
>   12              | 30617.1 Mbps  | 47919.7 Mbps (56.51% improvement)
>   18              | 25305.2 Mbps  | 45497.3 Mbps (79.79% improvement)
>   24              | 20104.1 Mbps  | 37907.7 Mbps (88.55% improvement)
>   30              | 14702.4 Mbps  | 30746.5 Mbps (109.12% improvement)
>   36              | 10801.5 Mbps  | 26476.3 Mbps (145.11% improvement)
> 
> The results show drastic improvement for network intensive workloads.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Just a minor suggestion below. Other than that looks good to me (with
follow up fixes) in this thread.
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  mm/memcontrol.c | 128 ++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 91 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1ad326e871c1..0a02ba07561e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1769,10 +1769,11 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
>  	pr_cont(" are going to be killed due to memory.oom.group set\n");
>  }
>  

/* Make sure nr_pages and cached fit into a single cache line */
> +#define NR_MEMCG_STOCK 7
>  struct memcg_stock_pcp {
>  	local_trylock_t stock_lock;
> -	struct mem_cgroup *cached; /* this never be root cgroup */
> -	unsigned int nr_pages;
> +	uint8_t nr_pages[NR_MEMCG_STOCK];
> +	struct mem_cgroup *cached[NR_MEMCG_STOCK];
>  
>  	struct obj_cgroup *cached_objcg;
>  	struct pglist_data *cached_pgdat;
[...]
-- 
Michal Hocko
SUSE Labs

  parent reply	other threads:[~2025-04-29 12:13 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-16 18:02 [PATCH] memcg: multi-memcg percpu charge cache Shakeel Butt
2025-04-23  1:10 ` Jakub Kicinski
2025-04-23 22:16   ` Shakeel Butt
2025-04-23 22:30     ` Jakub Kicinski
2025-04-23 22:59       ` Shakeel Butt
2025-04-23 23:14 ` Shakeel Butt
2025-04-25 20:18 ` Shakeel Butt
2025-04-29  9:40   ` Hugh Dickins
2025-04-29 14:50     ` Shakeel Butt
2025-04-30 10:05   ` Vlastimil Babka
2025-04-30 15:16     ` Shakeel Butt
2025-04-29 12:13 ` Michal Hocko [this message]
2025-04-29 18:43   ` Shakeel Butt
2025-04-30  6:48     ` Michal Hocko
2025-04-30  9:57 ` Vlastimil Babka
2025-04-30 15:05   ` Shakeel Butt
2025-04-30 15:32 ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBDCXB_Tb2Iaihua@tiehlicka \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=edumazet@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=soheil@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.