From mboxrd@z Thu Jan  1 00:00:00 1970
From: Leonardo =?ISO-8859-1?Q?Br=E1s?= <leobras@redhat.com>
Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining
Date: Tue, 07 Feb 2023 00:18:01 -0300
Message-ID: <4b232f47e038ab6fcaa0114f73c28d4bf8799f84.camel@redhat.com>
References: <Y9DpbVF+JR/G+5Or@dhcp22.suse.cz>
         <9e61ab53e1419a144f774b95230b789244895424.camel@redhat.com>
         <Y9FzSBw10MGXm2TK@tpad> <Y9IvoDJbLbFcitTc@dhcp22.suse.cz>
         <Y9LDAZmApLeffrT8@tpad> <Y9LQ615H13RmG7wL@dhcp22.suse.cz>
         <0122005439ffb7895efda7a1a67992cbe41392fe.camel@redhat.com>
         <Y9j9BnMwfm4TJks7@tpad> <Y9pd7AxAILUSHrpe@dhcp22.suse.cz>
         <28e08669302ad1e7a41bdf8b9988de6a352b5fe1.camel@redhat.com>
         <Y+AIOQy0HdVXCw8m@P9FQF9L96D>
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Return-path: <linux-kernel-owner@vger.kernel.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1675739888;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:content-type:content-type:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=GIxTMZmbsJ4n4h+wodJFJ7AH4L9sE2etRukeWGfprOQ=;
        b=eANDvVka1kzOtB2QmbX+LinmbIlao9B/6l1R7qr5DYuNkPP5NcCHal3ixbiFxYHqQPHUjP
        lpHjTENepvTF1XjUbCwRYdyWiVnxpzpfgfss/AEiYeS586W+hN/nOymAFS8ude9b8RUq2Y
        QjaDgif+t9OFCU9p+Vr2KitiBUw/Jhk=
In-Reply-To: <Y+AIOQy0HdVXCw8m@P9FQF9L96D>
List-ID: <cgroups.vger.kernel.org>
Content-Type: text/plain; charset="utf-8"
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>, Marcelo Tosatti <mtosatti@redhat.com>, Johannes Weiner <hannes@cmpxchg.org>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org

On Sun, 2023-02-05 at 11:49 -0800, Roman Gushchin wrote:
> Hi Leonardo!

Hello Roman,
Thanks a lot for replying!

>=20
> > Yes, but we are exchanging an "always schedule_work_on()", which is a k=
ind of
> > contention, for a "sometimes we hit spinlock contention".
> >=20
> > For the spinlock proposal, on the local cpu side, the *worst case* cont=
ention
> > is:
> > 1 - wait the spin_unlock() for a complete <percpu cache drain process>,
> > 2 - wait a cache hit for local per-cpu cacheline=C2=A0
> >=20
> > What is current implemented (schedule_work_on() approach), for the loca=
l
> > cpu=C2=A0side there is *always* this contention:
> > 1 - wait for a context switch,
> > 2 - wait a cache hit from it's local per-cpu cacheline,
> > 3 - wait a complete <percpu cache drain process>,=C2=A0
> > 4 - then for a new context switch to the current thread.
>=20
> I think both Michal and me are thinking of a more generic case in which t=
he cpu
> is not exclusively consumed by 1 special process, so that the draining wo=
rk can
> be executed during an idle time. In this case the work is basically free.

Oh, it makes sense.
But in such scenarios, wouldn't the same happens to spinlocks?

I mean, most of the contention with spinlocks only happens if the remote cp=
u is
trying to drain the cache while the local cpu happens to be draining/chargi=
ng,
which is quite rare due to how fast the local cpu operations are.

Also, if the cpu has some idle time, using a little more on a possible spin=
lock
contention should not be a problem. Right?

>=20
> And the introduction of a spin_lock() on the hot path is what we're are c=
oncerned
> about. I agree, that on some hardware platforms it won't be that expensiv=
e,=C2=A0
>=20

IIRC most hardware platforms with multicore supported by the kernel should =
have
the same behavior, since it's better to rely on cache coherence than lockin=
g the
memory bus.

For instance, the other popular architectures supported by Linux use the LR=
/SC
strategy for atomic operations (tested on ARM, POWER, RISCV) and IIRC the
LoadReserve slow part waits for the cacheline exclusivity, which is already
already exclusive in this perCPU structure.


> but in general not having any spinlocks is so much better.

I agree that spinlocks may bring contention, which is not ideal in many cas=
es.
In this case, though, it may not be a big issue, due to very rare remote ac=
cess
in the structure, for the usual case (non-pre-OOMCG)

>=20
> >=20
> > So moving from schedule_work_on() to spinlocks will save 2 context swit=
ches per
> > cpu every time drain_all_stock() is called.
> >=20
> > On the remote cpu side, my tests point that doing the remote draining i=
s faster
> > than scheduling a local draining, so it's also a gain.
> >=20
> > Also, IIUC the possible contention in the spinlock approach happens onl=
y on
> > page-faulting and syscalls, versus the schedule_work_on() approach that=
 can
> > interrupt user workload at any time.=C2=A0
> >=20
> > In fact, not interrupting the user workload in isolated cpus is just a =
bonus of
> > using spinlocks.
>=20
> I believe it significantly depends on the preemption model: you're right =
regarding
> fully preemptive kernels, but with voluntary/none preemption it's exactly=
 opposite:
> the draining work will be executed at some point later (probably with 0 c=
ost),

So, in case of voluntary/none preemption with some free cpu time.=20

> while the remote access from another cpu will potentially cause delays on=
 the
> spin lock as well as a need to refill the stock.

But if there is some free CPU time, what is the issue of some (potential) d=
elays
due to spinlock contention?

I am probably missing the whole picture, but when I think of performance
improvement, I think on doing more with the same cputime. If we can use fre=
e
cputime to do stuff later, it's only fair to also use it in case of content=
ion,
right?

I know there are some cases that may need to be more previsible (mostly RT)=
, but
when I think of memory allocation, I don't expect it to always take the sam=
e
time (as there are caches, pre-OOM, and so)

Also, as previously discussed, in case of a busy cpu, the spinlock approach=
 will
probably allow more work to be done.

>=20
> Overall I'd expect a noticeable performance regression from an introducti=
on of
> spin locks and remote draining. Maybe not on all platforms, but at least =
on some.
> That's my main concern.
>=20

I see.=20
For the platform I have tested (x86) I noticed better overall performance o=
n
spinlocks than upstream solution. For other popular platforms, I have brief=
ly
read some documentation on locking/atomicity and I think we may keep the
performance gains.

But to be sure, I could retake the tests on other platforms, such as ARM, P=
OWER,
RISCV, and so. Or even perform extra suggested tests.

With that info, would you feel less concerned about a possible change in me=
mcg
pcp cache locking scheme?


>  And I don't think the problem we're aiming to solve here
> justifies this potential regression.
>=20

To be strict, the isolated cpu scheduling problem is already fixed by the
housekeeping patch (with some limitations).=C2=A0

At this point, I am trying to bring focus to a (possible) performance
improvement on the memcg pcp cache locking system.


> Thanks!
>=20

Thank you for helping me better understand your arguments and concerns.
I really appreciate it!

Best regards,
Leo