From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93C4BC4345F for ; Tue, 30 Apr 2024 08:42:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 283F46B0092; Tue, 30 Apr 2024 04:42:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 20CB56B0093; Tue, 30 Apr 2024 04:42:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 05E456B0095; Tue, 30 Apr 2024 04:42:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D78596B0092 for ; Tue, 30 Apr 2024 04:42:18 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4CB231C0C60 for ; Tue, 30 Apr 2024 08:42:18 +0000 (UTC) X-FDA: 82065556356.24.E1BB7AD Received: from mail-ej1-f53.google.com (mail-ej1-f53.google.com [209.85.218.53]) by imf12.hostedemail.com (Postfix) with ESMTP id 6BE604001B for ; Tue, 30 Apr 2024 08:42:16 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=O5g4o52E; spf=pass (imf12.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714466536; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8Zba65WMDdip0DZuwifDqSY1/xmoz56ISnX6RYA6WDQ=; b=KOfzGfPgVSxdHE5LkwdPJOFeBtgt4jjWwuz/EA6aPAkm6wyRJGG9KMmLdHJ5K1xFSpfP1N RAk6kY1kv5qwMkh+fCtNf8dS+7+P824pw4Zcw8IisA0E/nXtLx6iPgNHwM9HWCsUsVfXDg 0PZhV4poRUlM1YQn7diKzFmqJNJbIgU= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=O5g4o52E; spf=pass (imf12.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.53 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714466536; a=rsa-sha256; cv=none; b=k/UH5vxPsLyt5JGQcY94oOL9+vA06qA+1kiVEmVqJPdM/8b8Mp1WSDehoRUjIclesPggyw ntZlZX0iXhd9j5Ktmbsh3nYbRcbzQmu3RA3VwyAGx1bCUPDE1vUWR3mxB7lGkiEtmaBoB/ o/4CGrI4xSbuBF2CG4C53RwJzd8XdJE= Received: by mail-ej1-f53.google.com with SMTP id a640c23a62f3a-a58f1f36427so331485666b.3 for ; Tue, 30 Apr 2024 01:42:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1714466535; x=1715071335; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8Zba65WMDdip0DZuwifDqSY1/xmoz56ISnX6RYA6WDQ=; b=O5g4o52E77G46C6c2R1iGS6AtPgcltLhb77er3KNfiyT0ZXRRlpoMY3lJlO2qCsP4M ViPs/68HqjSMutkl5G316yNiBPPv9OHXdPTA1i7lLQFNrEnJaV4t+Hl3WkHemjDErp57 c9M46MxD4gztIUFfOC5h93o5DhlYkYM5FEh0k2BlDMFYT2nfcKP1eKOC6zH4dZ/+DOvv iPljHOyjQgBDcBGkUHWjgforgyAQzeSAbf9oTk+YKpSob5u/91TUDqXRqm3AzAcT7r0k q2wAp9/bVP2cTP0WObt2wrtP9+nfU4f8+R0gGwsZk9F3UOpN1xgZ5DNll0x5QtndWeNV aGsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714466535; x=1715071335; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8Zba65WMDdip0DZuwifDqSY1/xmoz56ISnX6RYA6WDQ=; b=c/2RHTCiSL6UzZUtrbV2wN/2OygWj7zflxJhGiBFMfXTSihSYYnqv0YsIFocg43imh 7pJC5EuFV2lfWaLMykzFPO9wDc8N68kyIodiwSbfQ9pAxK6h/FYreuBNuVBNuJuecVDq vTYRXiq6ULieGJq2shCipFwFIB6Y0P7p2jBed39lsAzx+JEkJAUC23hjams9plG9xLq0 PRj/Tknf/+cB7gqUzAvCjCNTrIRFVWrPYVFVZPPqucXwSdlnB6QORVDVOIzGZkkpvXM8 mmA7EZP+OgGZhkqnDmS1YX5N+ywkCaRSpCXqbEQTljZzNCaTD0fWor7oZbxrrbw0OBa+ opAg== X-Forwarded-Encrypted: i=1; AJvYcCV4ZC7ThwTXb0ohtYoO5ZoDOEQZKefxR6tE0M9hpaQpAQSDbCL5ni9k507NOqWNXrn9POtBLjp3Lvi+7bkUB8/W0g0= X-Gm-Message-State: AOJu0YyBbWrxBJ1NycJxdRJjSrCdSzWqbv+zdVJbNwW70qugetcA8rsU 6JgHH3foclgIHYlx88p3Kgzems8jTf4K7vBSXdH06RG0twiZB1hdWDvXrN/4vs+rGSxbQEdh0C6 zSRFDLXcOUtVZDjmIbg1N4SrpZW0ALh69JFpo X-Google-Smtp-Source: AGHT+IHdE/GpKkrwRHE1F/1oRxuttoqPfBO4HCwVGpX7+Rk16UYs91HiO5AEVAjWPHqVtufpfp4O9nqsce/FWxAajzw= X-Received: by 2002:a17:906:ff42:b0:a55:b345:63ec with SMTP id zo2-20020a170906ff4200b00a55b34563ecmr1467370ejb.15.1714466534566; Tue, 30 Apr 2024 01:42:14 -0700 (PDT) MIME-Version: 1.0 References: <20240430060612.2171650-1-shakeel.butt@linux.dev> <20240430060612.2171650-5-shakeel.butt@linux.dev> In-Reply-To: <20240430060612.2171650-5-shakeel.butt@linux.dev> From: Yosry Ahmed Date: Tue, 30 Apr 2024 01:41:38 -0700 Message-ID: Subject: Re: [PATCH v3 4/8] memcg: reduce memory for the lruvec and memcg stats To: Shakeel Butt Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , "T . J . Mercier" , kernel-team@meta.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: mfocc6ss7n5mzo1onaf8joxybsjzrk6m X-Rspam-User: X-Rspamd-Queue-Id: 6BE604001B X-Rspamd-Server: rspam05 X-HE-Tag: 1714466536-466711 X-HE-Meta: U2FsdGVkX1+YQLGp+L5PXY0UZH6SKwMzPAvrgmKxOQLAVRwzjn1rlKcjUXXJRzfJqcotBfxQ7mtubu9kdBm8dvtmMQhztyXf2TLQTxXUnwBKoLmuipwi3C3geY7hk9/s45o1VHiJxBdtFFjF3D2bj7ZRl5idhxeg2xQwGhJlN+jzekO3Y4NPZOlRpAZjTlnFGLyDrdAZm+jDm/MWH+JVv/xrYZPLpvPIgprSQRVUPIOzoJ0vXPNIcA8IvTnT+u1W+g08qpCUDPmtX6C0+FejA94AuMjMHnKIAJ20jg/QL5pitsLGqqcVuTpPdEELBnpjK2hHhzePlzCC8vrFZiKo27MeV+f83jJr52BHfwX2SZDA1b6cPxg/T0Ysp5ePJUqkP9zkiVYvn7jA4A3WRRV7Xo+dKiEl+1Rs9sAAWM9IFPrjhsdBQQmeBxcXF+Wjfw1ohpVpCVwrtLTip458bQI1SdniykGFnYV/7CIl5f+36KrC9NyvfNlArAAl+wVprj57dMnEEJq3eRIv0GOhF9xjSXnz2xSzM5XlMznOY+oRA9IUjc1AwwvdWCh+BlQUQZiC+CiMoEA4/bhUcnXCzzCvj7N6sm5hvL6pK3aMAbwvKrpIkqZ7uopnBrSgrNBLOWrAi3qsg+1XpXLb4VxZJ8+o7O7mXJLMKI2PV94HyyuvsNbr3xMRme1EpnbIypyVDnNpHEuhpoffQLpEzTKEu3WxsHB5bD6gzJ670LpzLLwQPg4CesaRDzvYV/ZbAfR8q9HZL/viMRJvSBiArkcr/8hvGf+AdtyyGuQUhYPU+4Y19Zn2JqwL8l/hcwYqsluDpvuP5wR9Ay9UbOomaBtZi88vPVUbfjH/us224652DRrQ8IMRRBuu2beFLXteeJ5m/LeCiGuhwsCNcR1Upk2m9Mvi2ozrCbP1twE9Z4tNWASCnV9pXstSdNLG5PeSNZM41DoaWgRLd7JYUGVX0tdgzi1 SXtOp86h gaCFaMpocmBxWjtomETOWnn5Kh0zwTiGgUtDRw53XZKwgiKSdWdK18PEai+ljnF1jqT89sqjNtVLokqbqoWNKmOK1uSpNVg0Qfgselysqd4ThPs9EAo2E1QtCMFAGaIsoruERD/TZjfC/6vlB048C1yG42u8JdLRPHo7deV6i+AnMEK+xY6XUamPP6lnQ2Wdf6GJhCfYWoV4z4EtWOkEd3mra5TiTeSttzrJO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Apr 29, 2024 at 11:06=E2=80=AFPM Shakeel Butt wrote: > > At the moment, the amount of memory allocated for stats related structs > in the mem_cgroup corresponds to the size of enum node_stat_item. > However not all fields in enum node_stat_item has corresponding memcg > stats. So, let's use indirection mechanism similar to the one used for > memcg vmstats management. > > For a given x86_64 config, the size of stats with and without patch is: > > structs size in bytes w/o with > > struct lruvec_stats 1128 648 > struct lruvec_stats_percpu 752 432 > struct memcg_vmstats 1832 1352 > struct memcg_vmstats_percpu 1280 960 > > The memory savings is further compounded by the fact that these structs > are allocated for each cpu and for each node. To be precise, for each > memcg the memory saved would be: > > Memory saved =3D ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODS * NR_CPUS) + > (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long) > > Where 21 is the number of fields eliminated. > > Signed-off-by: Shakeel Butt > --- > > Changes since v2: > - N/A > > mm/memcontrol.c | 138 ++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 115 insertions(+), 23 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 434cff91b65e..f424c5b2ba9b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -576,35 +576,105 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgro= up_tree_per_node *mctz) > return mz; > } > > +/* Subset of node_stat_item for memcg stats */ > +static const unsigned int memcg_node_stat_items[] =3D { > + NR_INACTIVE_ANON, > + NR_ACTIVE_ANON, > + NR_INACTIVE_FILE, > + NR_ACTIVE_FILE, > + NR_UNEVICTABLE, > + NR_SLAB_RECLAIMABLE_B, > + NR_SLAB_UNRECLAIMABLE_B, > + WORKINGSET_REFAULT_ANON, > + WORKINGSET_REFAULT_FILE, > + WORKINGSET_ACTIVATE_ANON, > + WORKINGSET_ACTIVATE_FILE, > + WORKINGSET_RESTORE_ANON, > + WORKINGSET_RESTORE_FILE, > + WORKINGSET_NODERECLAIM, > + NR_ANON_MAPPED, > + NR_FILE_MAPPED, > + NR_FILE_PAGES, > + NR_FILE_DIRTY, > + NR_WRITEBACK, > + NR_SHMEM, > + NR_SHMEM_THPS, > + NR_FILE_THPS, > + NR_ANON_THPS, > + NR_KERNEL_STACK_KB, > + NR_PAGETABLE, > + NR_SECONDARY_PAGETABLE, > +#ifdef CONFIG_SWAP > + NR_SWAPCACHE, > +#endif > +}; > + > +static const unsigned int memcg_stat_items[] =3D { > + MEMCG_SWAP, > + MEMCG_SOCK, > + MEMCG_PERCPU_B, > + MEMCG_VMALLOC, > + MEMCG_KMEM, > + MEMCG_ZSWAP_B, > + MEMCG_ZSWAPPED, > +}; > + > +#define NR_MEMCG_NODE_STAT_ITEMS ARRAY_SIZE(memcg_node_stat_items) > +#define NR_MEMCG_STATS (NR_MEMCG_NODE_STAT_ITEMS + ARRAY_SIZE(memcg_stat= _items)) > +static int8_t mem_cgroup_stats_index[MEMCG_NR_STAT] __read_mostly; NR_MEMCG_STATS and MEMCG_NR_STAT are awfully close and have different meanings. I think we should come up with better names (sorry nothing comes to mind) or add a comment to make the difference more obvious. > + > +static void init_memcg_stats(void) > +{ > + int8_t i, j =3D 0; > + > + /* Switch to short once this failure occurs. */ > + BUILD_BUG_ON(NR_MEMCG_STATS >=3D 127 /* INT8_MAX */); Should we use S8_MAX here too? > + > + for (i =3D 0; i < NR_MEMCG_NODE_STAT_ITEMS; ++i) > + mem_cgroup_stats_index[memcg_node_stat_items[i]] =3D ++j; > + > + for (i =3D 0; i < ARRAY_SIZE(memcg_stat_items); ++i) > + mem_cgroup_stats_index[memcg_stat_items[i]] =3D ++j; > +} > + > +static inline int memcg_stats_index(int idx) > +{ > + return mem_cgroup_stats_index[idx] - 1; > +} > + > struct lruvec_stats_percpu { > /* Local (CPU and cgroup) state */ > - long state[NR_VM_NODE_STAT_ITEMS]; > + long state[NR_MEMCG_NODE_STAT_ITEMS]; > > /* Delta calculation for lockless upward propagation */ > - long state_prev[NR_VM_NODE_STAT_ITEMS]; > + long state_prev[NR_MEMCG_NODE_STAT_ITEMS]; > }; > > struct lruvec_stats { > /* Aggregated (CPU and subtree) state */ > - long state[NR_VM_NODE_STAT_ITEMS]; > + long state[NR_MEMCG_NODE_STAT_ITEMS]; > > /* Non-hierarchical (CPU aggregated) state */ > - long state_local[NR_VM_NODE_STAT_ITEMS]; > + long state_local[NR_MEMCG_NODE_STAT_ITEMS]; > > /* Pending child counts during tree propagation */ > - long state_pending[NR_VM_NODE_STAT_ITEMS]; > + long state_pending[NR_MEMCG_NODE_STAT_ITEMS]; > }; > > unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_it= em idx) > { > struct mem_cgroup_per_node *pn; > - long x; > + long x =3D 0; > + int i; > > if (mem_cgroup_disabled()) > return node_page_state(lruvec_pgdat(lruvec), idx); > > - pn =3D container_of(lruvec, struct mem_cgroup_per_node, lruvec); > - x =3D READ_ONCE(pn->lruvec_stats->state[idx]); > + i =3D memcg_stats_index(idx); > + if (i >=3D 0) { nit: we could return here if (i < 0) like you did in memcg_page_state() and others below, less indentation. Same for lruvec_page_state_local(). > + pn =3D container_of(lruvec, struct mem_cgroup_per_node, l= ruvec); > + x =3D READ_ONCE(pn->lruvec_stats->state[i]); > + } > #ifdef CONFIG_SMP > if (x < 0) > x =3D 0; > @@ -617,12 +687,16 @@ unsigned long lruvec_page_state_local(struct lruvec= *lruvec, > { > struct mem_cgroup_per_node *pn; > long x =3D 0; > + int i; > > if (mem_cgroup_disabled()) > return node_page_state(lruvec_pgdat(lruvec), idx); > > - pn =3D container_of(lruvec, struct mem_cgroup_per_node, lruvec); > - x =3D READ_ONCE(pn->lruvec_stats->state_local[idx]); > + i =3D memcg_stats_index(idx); > + if (i >=3D 0) { > + pn =3D container_of(lruvec, struct mem_cgroup_per_node, l= ruvec); > + x =3D READ_ONCE(pn->lruvec_stats->state_local[i]); > + } > #ifdef CONFIG_SMP > if (x < 0) > x =3D 0; > @@ -689,11 +763,11 @@ struct memcg_vmstats_percpu { > /* The above should fit a single cacheline for memcg_rstat_update= d() */ > > /* Local (CPU and cgroup) page state & events */ > - long state[MEMCG_NR_STAT]; > + long state[NR_MEMCG_STATS]; > unsigned long events[NR_MEMCG_EVENTS]; > > /* Delta calculation for lockless upward propagation */ > - long state_prev[MEMCG_NR_STAT]; > + long state_prev[NR_MEMCG_STATS]; > unsigned long events_prev[NR_MEMCG_EVENTS]; > > /* Cgroup1: threshold notifications & softlimit tree updates */ > @@ -703,15 +777,15 @@ struct memcg_vmstats_percpu { > > struct memcg_vmstats { > /* Aggregated (CPU and subtree) page state & events */ > - long state[MEMCG_NR_STAT]; > + long state[NR_MEMCG_STATS]; > unsigned long events[NR_MEMCG_EVENTS]; > > /* Non-hierarchical (CPU aggregated) page state & events */ > - long state_local[MEMCG_NR_STAT]; > + long state_local[NR_MEMCG_STATS]; > unsigned long events_local[NR_MEMCG_EVENTS]; > > /* Pending child counts during tree propagation */ > - long state_pending[MEMCG_NR_STAT]; > + long state_pending[NR_MEMCG_STATS]; > unsigned long events_pending[NR_MEMCG_EVENTS]; > > /* Stats updates since the last flush */ > @@ -844,7 +918,13 @@ static void flush_memcg_stats_dwork(struct work_stru= ct *w) > > unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) > { > - long x =3D READ_ONCE(memcg->vmstats->state[idx]); > + long x; > + int i =3D memcg_stats_index(idx); > + > + if (i < 0) > + return 0; > + > + x =3D READ_ONCE(memcg->vmstats->state[i]); > #ifdef CONFIG_SMP > if (x < 0) > x =3D 0; > @@ -876,18 +956,25 @@ static int memcg_state_val_in_pages(int idx, int va= l) > */ > void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) > { > - if (mem_cgroup_disabled()) > + int i =3D memcg_stats_index(idx); > + > + if (mem_cgroup_disabled() || i < 0) > return; > > - __this_cpu_add(memcg->vmstats_percpu->state[idx], val); > + __this_cpu_add(memcg->vmstats_percpu->state[i], val); > memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val)); > } > > /* idx can be of type enum memcg_stat_item or node_stat_item. */ > static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, in= t idx) > { > - long x =3D READ_ONCE(memcg->vmstats->state_local[idx]); > + long x; > + int i =3D memcg_stats_index(idx); > + > + if (i < 0) > + return 0; > > + x =3D READ_ONCE(memcg->vmstats->state_local[i]); > #ifdef CONFIG_SMP > if (x < 0) > x =3D 0; > @@ -901,6 +988,10 @@ static void __mod_memcg_lruvec_state(struct lruvec *= lruvec, > { > struct mem_cgroup_per_node *pn; > struct mem_cgroup *memcg; > + int i =3D memcg_stats_index(idx); > + > + if (i < 0) > + return; > > pn =3D container_of(lruvec, struct mem_cgroup_per_node, lruvec); > memcg =3D pn->memcg; > @@ -930,10 +1021,10 @@ static void __mod_memcg_lruvec_state(struct lruvec= *lruvec, > } > > /* Update memcg */ > - __this_cpu_add(memcg->vmstats_percpu->state[idx], val); > + __this_cpu_add(memcg->vmstats_percpu->state[i], val); > > /* Update lruvec */ > - __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); > + __this_cpu_add(pn->lruvec_stats_percpu->state[i], val); > > memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val)); > memcg_stats_unlock(); > @@ -5702,6 +5793,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pa= rent_css) > page_counter_init(&memcg->kmem, &parent->kmem); > page_counter_init(&memcg->tcpmem, &parent->tcpmem); > } else { > + init_memcg_stats(); > init_memcg_events(); > page_counter_init(&memcg->memory, NULL); > page_counter_init(&memcg->swap, NULL); > @@ -5873,7 +5965,7 @@ static void mem_cgroup_css_rstat_flush(struct cgrou= p_subsys_state *css, int cpu) > > statc =3D per_cpu_ptr(memcg->vmstats_percpu, cpu); > > - for (i =3D 0; i < MEMCG_NR_STAT; i++) { > + for (i =3D 0; i < NR_MEMCG_STATS; i++) { > /* > * Collect the aggregated propagation counts of groups > * below us. We're in a per-cpu loop here and this is > @@ -5937,7 +6029,7 @@ static void mem_cgroup_css_rstat_flush(struct cgrou= p_subsys_state *css, int cpu) > > lstatc =3D per_cpu_ptr(pn->lruvec_stats_percpu, cpu); > > - for (i =3D 0; i < NR_VM_NODE_STAT_ITEMS; i++) { > + for (i =3D 0; i < NR_MEMCG_NODE_STAT_ITEMS; i++) { > delta =3D lstats->state_pending[i]; > if (delta) > lstats->state_pending[i] =3D 0; > -- > 2.43.0 >