From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CFA39C636CD
	for <linux-mm@archiver.kernel.org>; Sat,  4 Feb 2023 04:55:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 287286B0072; Fri,  3 Feb 2023 23:55:28 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 20F3F6B0073; Fri,  3 Feb 2023 23:55:28 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0B0476B0074; Fri,  3 Feb 2023 23:55:28 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id E9AE26B0072
	for <linux-mm@kvack.org>; Fri,  3 Feb 2023 23:55:27 -0500 (EST)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id A92CA1A1010
	for <linux-mm@kvack.org>; Sat,  4 Feb 2023 04:55:27 +0000 (UTC)
X-FDA: 80428395894.07.AB2E0A2
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf08.hostedemail.com (Postfix) with ESMTP id 9A275160004
	for <linux-mm@kvack.org>; Sat,  4 Feb 2023 04:55:24 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KkmqdzSn;
	spf=pass (imf08.hostedemail.com: domain of leobras@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=leobras@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1675486525;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=qQRbxMLCjxlgXXqgO9mCd2kMv5xzFrcBO/WIKqmpLy4=;
	b=LeYywF5T1yZE3LZBBm90mTZrkBJT98KwXJbdIKjLHL0edQ1UbNe4mIm0ZNU8ZIS7ioY+5d
	2Ls5DEtmplVyc4GZnApZgxa5yf/GPyMitP119fQDtlsThLIYjbYmLjlXVxqKOCS/2FZx8W
	vQfkgDa3ijNiHrV5plI+vU7N1Y6MOLg=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KkmqdzSn;
	spf=pass (imf08.hostedemail.com: domain of leobras@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=leobras@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675486525; a=rsa-sha256;
	cv=none;
	b=Au3dqjRWuZlOs9aAYi41HjL8NL23EK6Skbi5M9NrRiyuTmRrJh79JJfWyeAt7sPUhySLOj
	T6RtjowN26+1aKObEq41B2PoKkAAYUGOs9/nfyGSjvbLXRzrGoAQcPzoYDfG2FtvUseJjD
	ZyYVLdnRuzm/6ReJaV4Yw2Jqkr6d22s=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1675486523;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=qQRbxMLCjxlgXXqgO9mCd2kMv5xzFrcBO/WIKqmpLy4=;
	b=KkmqdzSn1O+BFrhAijLtjEmTWDec3gkgsxsFnHS6O9M+k9Oc/4myHfbIo1XGZOxUlEMCov
	p0LJjU2GLJnyq4a5L/Ky1CtkeII1SuwLVoGwx8hmnoKyj75P5XRSkdxov20oC7clOnEzqO
	2CY1sXt6W3vc+ESrFGKHdXD2uWKQpf8=
Received: from mail-oa1-f72.google.com (mail-oa1-f72.google.com
 [209.85.160.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-445--LDRhJ9BMOehIyrv1GvxwQ-1; Fri, 03 Feb 2023 23:55:17 -0500
X-MC-Unique: -LDRhJ9BMOehIyrv1GvxwQ-1
Received: by mail-oa1-f72.google.com with SMTP id 586e51a60fabf-16a0bd03b41so1245733fac.21
        for <linux-mm@kvack.org>; Fri, 03 Feb 2023 20:55:17 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=mime-version:user-agent:content-transfer-encoding:references
         :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=FQ4uha3PJgX8OHjyXh+WN0eX5F0h2aTsXeAPONLwqFE=;
        b=Ir3AgnFDk/Ss/1oaUJk6VFk7DY+wi10O8cFoS99f4/s9ZXZV+5xmhtkf/uG6pk2PDq
         2tFy0YNN7+fc+zYugYxVccDBmjVtj15zD4TDsEws6PYb9X5FYeTJdTPYLQJK2HFiOFst
         g9biCPNKkA/sCnydFU7rWUJBi2qk9gWhE1NklloO4KP5zcIjCmVT5KFNFRpXi8QpL6W6
         pxgty49T16INVLeQKh/e/RORg8pXTyIuCWxr2jA7Pj8wPuL2isGYBgvr8fF/metYVtvs
         TcOO1XrYE4jmvp0uGXzQNpNNlRh1Rm/hpIKtQeUrc3sRYiPpnBCh0qXXFFDWqG6XeNhX
         rWFA==
X-Gm-Message-State: AO0yUKUfa1KsD2NCOLvUgtRvGcHi1kzIkaVm7Ht+9AR97QeJ+E/A6/Kq
	+j3cUR51uZDsR4b/DOWZXD5BGWyZYI4KDh6INNkiwGlim4hJngVX15bwbD/8pk3416SpZ4xcsrE
	a8GrQXje4ZG8=
X-Received: by 2002:a05:6808:4084:b0:36e:a91d:b7ed with SMTP id db4-20020a056808408400b0036ea91db7edmr5197303oib.9.1675486516689;
        Fri, 03 Feb 2023 20:55:16 -0800 (PST)
X-Google-Smtp-Source: AK7set8dIZaX4QNjaxFTxht/y8nO4ZEqyQ+5DOY49AHbU+2VqduudxsJntRFaN2pQ/J1EAmmtdjcSA==
X-Received: by 2002:a05:6808:4084:b0:36e:a91d:b7ed with SMTP id db4-20020a056808408400b0036ea91db7edmr5197296oib.9.1675486516360;
        Fri, 03 Feb 2023 20:55:16 -0800 (PST)
Received: from ?IPv6:2804:1b3:a800:9aa9:fdcb:7dec:9680:8417? ([2804:1b3:a800:9aa9:fdcb:7dec:9680:8417])
        by smtp.gmail.com with ESMTPSA id bg12-20020a056808178c00b0035c21f1a570sm1578624oib.6.2023.02.03.20.55.12
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 03 Feb 2023 20:55:15 -0800 (PST)
Message-ID: <28e08669302ad1e7a41bdf8b9988de6a352b5fe1.camel@redhat.com>
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining
From: Leonardo =?ISO-8859-1?Q?Br=E1s?= <leobras@redhat.com>
To: Michal Hocko <mhocko@suse.com>, Roman Gushchin
 <roman.gushchin@linux.dev>,  Marcelo Tosatti <mtosatti@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin
 <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song
 <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, 
 cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Date: Sat, 04 Feb 2023 01:55:10 -0300
In-Reply-To: <Y9pd7AxAILUSHrpe@dhcp22.suse.cz>
References: <20230125073502.743446-1-leobras@redhat.com>
	 <Y9DpbVF+JR/G+5Or@dhcp22.suse.cz>
	 <9e61ab53e1419a144f774b95230b789244895424.camel@redhat.com>
	 <Y9FzSBw10MGXm2TK@tpad> <Y9IvoDJbLbFcitTc@dhcp22.suse.cz>
	 <Y9LDAZmApLeffrT8@tpad> <Y9LQ615H13RmG7wL@dhcp22.suse.cz>
	 <0122005439ffb7895efda7a1a67992cbe41392fe.camel@redhat.com>
	 <Y9j9BnMwfm4TJks7@tpad> <Y9pd7AxAILUSHrpe@dhcp22.suse.cz>
User-Agent: Evolution 3.46.3
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 9A275160004
X-Rspam-User: 
X-Stat-Signature: g543yisk6pzrz79xn9sj6g9ea8kgupbu
X-HE-Tag: 1675486524-472326
X-HE-Meta: U2FsdGVkX1/KGPAa0Lf8Vmz4SkzqbD0H648O1vCDbOhkFr4SUNvYlBo0eQXEgwo36nox2EQUsWgRQUG4BCguGEi0kgI+pz3DzxpjXgAHHO4wZJ+atn9hITlqBUOnvcYZborDnTDMUX01XyAhjxwriQb3MRptMsl+LekrPWg8EK6kNiKMjf284BFM0yM7lGvTaPfpEvidr37OJGm5MxWtmkakHFQU2P6RxepNbQj4NTZfG0EUGmLVQiDW3A+AC3aVm9559ZpxCrTj3wSamuRDRmpf+zD+UCQgHWO6Rl1UPJFLz8TaCBPkrCc9Y3IzH7lWZE/Cet5M1zDwI8ymr1Wn2UlA3nUAxaFVVEyq8tED16XZ8MTleUZjxfRZjWd2vLzLePSuGi1ikzKjUWvlHe5J7H3sVFgGJTa+KdeAn9huFAADMnYXu1k2pvtB6GteCXRFOHn1WmpVDOXqPFE8U9ud0asHh53Fi4UzvUEwky+y03k+lod/7xTG32R8F+94auSu8qNVjcn0GItZvFer2gNJKrKcX7cQfh1cW3lIEBRjDoeWIqWiKncnK3m/6eECPWd/ScgfYZP59dgv7crzbUW9dzxLKBEPoeJHVk7H5qePWjYyOukkNu93Fd9huf4lzQi0RM2iRx5hscR4fGf5PnJE4cdN++MMU49lzpdRDIUOuyHXTFC7ejTY30PU62zlHaymqC6rlevQtW+N1N2MWFOxa58vLTaGfI0mMw3aI8Ffhp7fT9ZbBS91yzmiiGiOU19vqIN6f/pD4O45TQtulqYoUlRXzkN4tcYmxLwIHk9cNNliBPLn4JpeOsyLFfPBOGs/gaaPPpyfX2XnSNcy+61EsnF4qDyjYvBwqcwKcmQMlMiAZQdnMG+ZGE/joPLnRqr8qzs8Aao9SdG8HSSySpo/acU7/4+ImTUprrRbfuheneDDp6nepCrqx/jBy0Qvg32rE57Bd7DbkeXPTjvdfxN
 YOs53P+C
 aVBSei+GuG5gEOEYY7E3oRl3EYDMCHiqJCAFa+RchdT/GNheFrSSctqd2EJDguj4yDeeaGNuMfLk80KUjdJcayIrzvF3+GUvB+i2q/wwBvPOc2PBUEHTAVt97vUUqoHLKiHRgAiIPgFPpcmKM1UwK5dBvbIbhiMHBL/Pi+0KADpvY1JKN9JuWycQsYBdoWoDB54EXT3F90Hk1pUNPt5wqwS6PUIjA+Nt3U8iI5Z7ecB/l/5EddJvi4pt3QBg6zGalG1QiwArGfJ/R6YUK2/tHfEEG98tV2YcoTHqR9sFP6cZXeeU8Vby1k3/ap7AlmQQwVcC6TFMrf2JBxvWoDZxi8URyh8onT1NZmK+bpjO8oSMhLzSDNI7yg+uYnF+pr1MGDatV8xoZV6mJ9Q1QM2jaEtaHsS+h3FNDMcRBYSUB+2fw6tgE7nvVtTRs5A==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Michal, Roman,

I understand you have far more experience in this codebase than myself, so
please help me understand what am I missing in my argument for the spinlock
approach.=C2=A0

I honestly want to improve, and your help is really appreciated.

On Wed, 2023-02-01 at 13:41 +0100, Michal Hocko wrote:
> On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote:
> [...]
> > So it would be good to point out a specific problematic=20
> > testcase/scenario with using the spinlock in this particular case?
>=20
> Please think about it some more. The sole purpose of the pcp charge
> caching is to avoid atomics because the normal fast path (i.e. no limit
> hit) is a page counter which is an atomic counter. If you go with a spin
> lock for the pcp cache you are just losing some of the advantage of the
> caching. Sure that would be a smaller atomics use than a deeper memcg
> hierarchy but still.

I could read the code that calls consume_stock(), and noticed that you are
probably referencing the atomic operations on memcg->memory->usage (and oth=
ers)
used in page_counter_try_charge(). It is a single atomic variable per memcg=
 that
is potentially accessed by every cpu. Is that correct?

I understand the cost of an atomic operation such as this is high because o=
f the
inherent high cost of bouncing the variable's cacheline between those cpus.

The difference between 'usage' atomic variable and the spinlock we are prop=
osing
is the scope of the variable: spinlock is defined on a per-cpu structure,=
=C2=A0and
most of the accesses will come from the local cpu.=C2=A0

According to=C2=A0Intel=C2=AE 64 and IA-32 Architectures Software Developer=
=E2=80=99s Manual, at
Vol. 3A page 9-5:

###
9.1.4 Effects of a LOCK Operation on Internal Processor Caches
[...]
For the P6 and more recent processor families, if the area of memory being
locked during a LOCK operation is cached in the processor that is performin=
g the
LOCK operation as write-back memory and is completely contained in a cache =
line,
the processor may not assert the LOCK# signal on the bus. Instead, it will
modify the memory location internally and allow it=E2=80=99s cache coherenc=
y mechanism
to ensure that the operation is carried out atomically. This operation is c=
alled
=E2=80=9Ccache locking.=E2=80=9D The cache coherency mechanism automaticall=
y prevents two or
more processors that have cached the same area of memory from simultaneousl=
y
modifying data in that area.
###

So locking on a percpu spinlock which is cached in the current cpu (happens=
 most
of time) is as cheap as modifying the local cacheline. Since the cachelines=
 are
already modified in the local cpu functions, so the write to memory is batc=
hed.

For reference, this is the pahole output for memcg_stock_pcp after my patch=
. The
spinlock fits in the same cacheline as the changed variables.

struct memcg_stock_pcp {
=09spinlock_t                 stock_lock;           /*     0     4 */
=09unsigned int               nr_pages;             /*     4     4 */
=09struct mem_cgroup *        cached;               /*     8     8 */
=09struct obj_cgroup *        cached_objcg;         /*    16     8 */
=09struct pglist_data *       cached_pgdat;         /*    24     8 */
=09unsigned int               nr_bytes;             /*    32     4 */
=09int                        nr_slab_reclaimable_b; /*    36     4 */
=09int                        nr_slab_unreclaimable_b; /*    40     4 */

=09/* size: 48, cachelines: 1, members: 8 */
=09/* padding: 4 */
=09/* last cacheline: 48 bytes */
};

The only accesses that will come from a remote cpu, and thus cause the cach=
eline
to bounce and the lock to be more expensive, are the ones from
drain_all_stock(). OTOH, on upstream, those accesses already modify the rem=
ote
cacheline with an atomic variable (memcg_stock_pcp.flags), so we are only
dealing with potential contention costs.

>=20
> Not to mention a potential contention which is hard to predict and it
> will depend on the specific runtime very much. So not something that
> would be easy to test for other than artificial testcases. Full cpu
> pcp draining is not a very common operation and one could argue that
> the problem will be limited=C2=A0
>=20

Yes, but we are exchanging an "always schedule_work_on()", which is a kind =
of
contention, for a "sometimes we hit spinlock contention".

For the spinlock proposal, on the local cpu side, the *worst case* contenti=
on
is:
1 - wait the spin_unlock() for a complete <percpu cache drain process>,
2 - wait a cache hit for local per-cpu cacheline=C2=A0

What is current implemented (schedule_work_on() approach), for the local
cpu=C2=A0side there is *always* this contention:
1 - wait for a context switch,
2 - wait a cache hit from it's local per-cpu cacheline,
3 - wait a complete <percpu cache drain process>,=C2=A0
4 - then for a new context switch to the current thread.

So moving from schedule_work_on() to spinlocks will save 2 context switches=
 per
cpu every time drain_all_stock() is called.

On the remote cpu side, my tests point that doing the remote draining is fa=
ster
than scheduling a local draining, so it's also a gain.

Also, IIUC the possible contention in the spinlock approach happens only on
page-faulting and syscalls, versus the schedule_work_on() approach that can
interrupt user workload at any time.=C2=A0

In fact, not interrupting the user workload in isolated cpus is just a bonu=
s of
using spinlocks.

> but so far I haven't really heard any strong
> arguments against the proposal to avoid scheduling the work on isolated
> cpus which is a much more simpler solution and achieves the same
> effect.

I understand the 'not scheduling work on isolated cpus' appeal: it's a much
simpler change, and will only affect cases with isolated cpus, so it is saf=
er
than changing the whole locking structure.=C2=A0

I am not against it. I already have a patch implementing it for testing, an=
d I
could gladly send it if you see fit.

But if nothing else, it introduces another specific case, and now it have t=
o
deal differently with local cpu, remote cpus, and isolated cpus.

With spinlocks, there is a much simpler solution:
- All cpus are dealt with the same way, which is faster in the local cpu (a=
s=C2=A0
  in upstream implementation).
- We don't have to avoid draining on isolated cpus.
- We get rid of the "work_struct", and scheduling costs
- We get rid of "flag", as we don't need to take care of multi-scheduling l=
ocal=C2=A0
  draining.
- When drain_all_stock() returns, all stock have already been drained, so=
=C2=A0
  retrying=C2=A0consume_stock() may have immediate results on pre-OOM cases=
.

On Wed, 2023-02-01 at 13:52 +0100, Michal Hocko wrote:
[...]
> Let me be clear here. Unless there are some real concerns about not
> flushing remotely on isolated cpus then the spin lock approach is just
> much less preferable. So I would much rather focus on the trivial patch
> and investigates whether there are no new corner cases to be created by
> that.

I understand 'not scheduling work on isolated cpus' approach should work wi=
th
low impact on current behavior, and I also understand that as Maintainers y=
ou
have to consider many more factors than a regular developer like me, and al=
so
that the spinlock approach is a big change on how pcp memcg caches work.=20

On that topic, if there is any comparison test you find important running, =
or
any topic you think could be better discussed, please let me know.

Thank you for reviewing,
Leo