From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFA39C636CD for ; Sat, 4 Feb 2023 04:55:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 287286B0072; Fri, 3 Feb 2023 23:55:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 20F3F6B0073; Fri, 3 Feb 2023 23:55:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B0476B0074; Fri, 3 Feb 2023 23:55:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E9AE26B0072 for ; Fri, 3 Feb 2023 23:55:27 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A92CA1A1010 for ; Sat, 4 Feb 2023 04:55:27 +0000 (UTC) X-FDA: 80428395894.07.AB2E0A2 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf08.hostedemail.com (Postfix) with ESMTP id 9A275160004 for ; Sat, 4 Feb 2023 04:55:24 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KkmqdzSn; spf=pass (imf08.hostedemail.com: domain of leobras@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=leobras@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675486525; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qQRbxMLCjxlgXXqgO9mCd2kMv5xzFrcBO/WIKqmpLy4=; b=LeYywF5T1yZE3LZBBm90mTZrkBJT98KwXJbdIKjLHL0edQ1UbNe4mIm0ZNU8ZIS7ioY+5d 2Ls5DEtmplVyc4GZnApZgxa5yf/GPyMitP119fQDtlsThLIYjbYmLjlXVxqKOCS/2FZx8W vQfkgDa3ijNiHrV5plI+vU7N1Y6MOLg= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KkmqdzSn; spf=pass (imf08.hostedemail.com: domain of leobras@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=leobras@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675486525; a=rsa-sha256; cv=none; b=Au3dqjRWuZlOs9aAYi41HjL8NL23EK6Skbi5M9NrRiyuTmRrJh79JJfWyeAt7sPUhySLOj T6RtjowN26+1aKObEq41B2PoKkAAYUGOs9/nfyGSjvbLXRzrGoAQcPzoYDfG2FtvUseJjD ZyYVLdnRuzm/6ReJaV4Yw2Jqkr6d22s= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1675486523; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qQRbxMLCjxlgXXqgO9mCd2kMv5xzFrcBO/WIKqmpLy4=; b=KkmqdzSn1O+BFrhAijLtjEmTWDec3gkgsxsFnHS6O9M+k9Oc/4myHfbIo1XGZOxUlEMCov p0LJjU2GLJnyq4a5L/Ky1CtkeII1SuwLVoGwx8hmnoKyj75P5XRSkdxov20oC7clOnEzqO 2CY1sXt6W3vc+ESrFGKHdXD2uWKQpf8= Received: from mail-oa1-f72.google.com (mail-oa1-f72.google.com [209.85.160.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-445--LDRhJ9BMOehIyrv1GvxwQ-1; Fri, 03 Feb 2023 23:55:17 -0500 X-MC-Unique: -LDRhJ9BMOehIyrv1GvxwQ-1 Received: by mail-oa1-f72.google.com with SMTP id 586e51a60fabf-16a0bd03b41so1245733fac.21 for ; Fri, 03 Feb 2023 20:55:17 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=FQ4uha3PJgX8OHjyXh+WN0eX5F0h2aTsXeAPONLwqFE=; b=Ir3AgnFDk/Ss/1oaUJk6VFk7DY+wi10O8cFoS99f4/s9ZXZV+5xmhtkf/uG6pk2PDq 2tFy0YNN7+fc+zYugYxVccDBmjVtj15zD4TDsEws6PYb9X5FYeTJdTPYLQJK2HFiOFst g9biCPNKkA/sCnydFU7rWUJBi2qk9gWhE1NklloO4KP5zcIjCmVT5KFNFRpXi8QpL6W6 pxgty49T16INVLeQKh/e/RORg8pXTyIuCWxr2jA7Pj8wPuL2isGYBgvr8fF/metYVtvs TcOO1XrYE4jmvp0uGXzQNpNNlRh1Rm/hpIKtQeUrc3sRYiPpnBCh0qXXFFDWqG6XeNhX rWFA== X-Gm-Message-State: AO0yUKUfa1KsD2NCOLvUgtRvGcHi1kzIkaVm7Ht+9AR97QeJ+E/A6/Kq +j3cUR51uZDsR4b/DOWZXD5BGWyZYI4KDh6INNkiwGlim4hJngVX15bwbD/8pk3416SpZ4xcsrE a8GrQXje4ZG8= X-Received: by 2002:a05:6808:4084:b0:36e:a91d:b7ed with SMTP id db4-20020a056808408400b0036ea91db7edmr5197303oib.9.1675486516689; Fri, 03 Feb 2023 20:55:16 -0800 (PST) X-Google-Smtp-Source: AK7set8dIZaX4QNjaxFTxht/y8nO4ZEqyQ+5DOY49AHbU+2VqduudxsJntRFaN2pQ/J1EAmmtdjcSA== X-Received: by 2002:a05:6808:4084:b0:36e:a91d:b7ed with SMTP id db4-20020a056808408400b0036ea91db7edmr5197296oib.9.1675486516360; Fri, 03 Feb 2023 20:55:16 -0800 (PST) Received: from ?IPv6:2804:1b3:a800:9aa9:fdcb:7dec:9680:8417? ([2804:1b3:a800:9aa9:fdcb:7dec:9680:8417]) by smtp.gmail.com with ESMTPSA id bg12-20020a056808178c00b0035c21f1a570sm1578624oib.6.2023.02.03.20.55.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Feb 2023 20:55:15 -0800 (PST) Message-ID: <28e08669302ad1e7a41bdf8b9988de6a352b5fe1.camel@redhat.com> Subject: Re: [PATCH v2 0/5] Introduce memcg_stock_pcp remote draining From: Leonardo =?ISO-8859-1?Q?Br=E1s?= To: Michal Hocko , Roman Gushchin , Marcelo Tosatti Cc: Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Date: Sat, 04 Feb 2023 01:55:10 -0300 In-Reply-To: References: <20230125073502.743446-1-leobras@redhat.com> <9e61ab53e1419a144f774b95230b789244895424.camel@redhat.com> <0122005439ffb7895efda7a1a67992cbe41392fe.camel@redhat.com> User-Agent: Evolution 3.46.3 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 9A275160004 X-Rspam-User: X-Stat-Signature: g543yisk6pzrz79xn9sj6g9ea8kgupbu X-HE-Tag: 1675486524-472326 X-HE-Meta: U2FsdGVkX1/KGPAa0Lf8Vmz4SkzqbD0H648O1vCDbOhkFr4SUNvYlBo0eQXEgwo36nox2EQUsWgRQUG4BCguGEi0kgI+pz3DzxpjXgAHHO4wZJ+atn9hITlqBUOnvcYZborDnTDMUX01XyAhjxwriQb3MRptMsl+LekrPWg8EK6kNiKMjf284BFM0yM7lGvTaPfpEvidr37OJGm5MxWtmkakHFQU2P6RxepNbQj4NTZfG0EUGmLVQiDW3A+AC3aVm9559ZpxCrTj3wSamuRDRmpf+zD+UCQgHWO6Rl1UPJFLz8TaCBPkrCc9Y3IzH7lWZE/Cet5M1zDwI8ymr1Wn2UlA3nUAxaFVVEyq8tED16XZ8MTleUZjxfRZjWd2vLzLePSuGi1ikzKjUWvlHe5J7H3sVFgGJTa+KdeAn9huFAADMnYXu1k2pvtB6GteCXRFOHn1WmpVDOXqPFE8U9ud0asHh53Fi4UzvUEwky+y03k+lod/7xTG32R8F+94auSu8qNVjcn0GItZvFer2gNJKrKcX7cQfh1cW3lIEBRjDoeWIqWiKncnK3m/6eECPWd/ScgfYZP59dgv7crzbUW9dzxLKBEPoeJHVk7H5qePWjYyOukkNu93Fd9huf4lzQi0RM2iRx5hscR4fGf5PnJE4cdN++MMU49lzpdRDIUOuyHXTFC7ejTY30PU62zlHaymqC6rlevQtW+N1N2MWFOxa58vLTaGfI0mMw3aI8Ffhp7fT9ZbBS91yzmiiGiOU19vqIN6f/pD4O45TQtulqYoUlRXzkN4tcYmxLwIHk9cNNliBPLn4JpeOsyLFfPBOGs/gaaPPpyfX2XnSNcy+61EsnF4qDyjYvBwqcwKcmQMlMiAZQdnMG+ZGE/joPLnRqr8qzs8Aao9SdG8HSSySpo/acU7/4+ImTUprrRbfuheneDDp6nepCrqx/jBy0Qvg32rE57Bd7DbkeXPTjvdfxN YOs53P+C aVBSei+GuG5gEOEYY7E3oRl3EYDMCHiqJCAFa+RchdT/GNheFrSSctqd2EJDguj4yDeeaGNuMfLk80KUjdJcayIrzvF3+GUvB+i2q/wwBvPOc2PBUEHTAVt97vUUqoHLKiHRgAiIPgFPpcmKM1UwK5dBvbIbhiMHBL/Pi+0KADpvY1JKN9JuWycQsYBdoWoDB54EXT3F90Hk1pUNPt5wqwS6PUIjA+Nt3U8iI5Z7ecB/l/5EddJvi4pt3QBg6zGalG1QiwArGfJ/R6YUK2/tHfEEG98tV2YcoTHqR9sFP6cZXeeU8Vby1k3/ap7AlmQQwVcC6TFMrf2JBxvWoDZxi8URyh8onT1NZmK+bpjO8oSMhLzSDNI7yg+uYnF+pr1MGDatV8xoZV6mJ9Q1QM2jaEtaHsS+h3FNDMcRBYSUB+2fw6tgE7nvVtTRs5A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Michal, Roman, I understand you have far more experience in this codebase than myself, so please help me understand what am I missing in my argument for the spinlock approach.=C2=A0 I honestly want to improve, and your help is really appreciated. On Wed, 2023-02-01 at 13:41 +0100, Michal Hocko wrote: > On Tue 31-01-23 08:35:34, Marcelo Tosatti wrote: > [...] > > So it would be good to point out a specific problematic=20 > > testcase/scenario with using the spinlock in this particular case? >=20 > Please think about it some more. The sole purpose of the pcp charge > caching is to avoid atomics because the normal fast path (i.e. no limit > hit) is a page counter which is an atomic counter. If you go with a spin > lock for the pcp cache you are just losing some of the advantage of the > caching. Sure that would be a smaller atomics use than a deeper memcg > hierarchy but still. I could read the code that calls consume_stock(), and noticed that you are probably referencing the atomic operations on memcg->memory->usage (and oth= ers) used in page_counter_try_charge(). It is a single atomic variable per memcg= that is potentially accessed by every cpu. Is that correct? I understand the cost of an atomic operation such as this is high because o= f the inherent high cost of bouncing the variable's cacheline between those cpus. The difference between 'usage' atomic variable and the spinlock we are prop= osing is the scope of the variable: spinlock is defined on a per-cpu structure,= =C2=A0and most of the accesses will come from the local cpu.=C2=A0 According to=C2=A0Intel=C2=AE 64 and IA-32 Architectures Software Developer= =E2=80=99s Manual, at Vol. 3A page 9-5: ### 9.1.4 Effects of a LOCK Operation on Internal Processor Caches [...] For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performin= g the LOCK operation as write-back memory and is completely contained in a cache = line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it=E2=80=99s cache coherenc= y mechanism to ensure that the operation is carried out atomically. This operation is c= alled =E2=80=9Ccache locking.=E2=80=9D The cache coherency mechanism automaticall= y prevents two or more processors that have cached the same area of memory from simultaneousl= y modifying data in that area. ### So locking on a percpu spinlock which is cached in the current cpu (happens= most of time) is as cheap as modifying the local cacheline. Since the cachelines= are already modified in the local cpu functions, so the write to memory is batc= hed. For reference, this is the pahole output for memcg_stock_pcp after my patch= . The spinlock fits in the same cacheline as the changed variables. struct memcg_stock_pcp { =09spinlock_t stock_lock; /* 0 4 */ =09unsigned int nr_pages; /* 4 4 */ =09struct mem_cgroup * cached; /* 8 8 */ =09struct obj_cgroup * cached_objcg; /* 16 8 */ =09struct pglist_data * cached_pgdat; /* 24 8 */ =09unsigned int nr_bytes; /* 32 4 */ =09int nr_slab_reclaimable_b; /* 36 4 */ =09int nr_slab_unreclaimable_b; /* 40 4 */ =09/* size: 48, cachelines: 1, members: 8 */ =09/* padding: 4 */ =09/* last cacheline: 48 bytes */ }; The only accesses that will come from a remote cpu, and thus cause the cach= eline to bounce and the lock to be more expensive, are the ones from drain_all_stock(). OTOH, on upstream, those accesses already modify the rem= ote cacheline with an atomic variable (memcg_stock_pcp.flags), so we are only dealing with potential contention costs. >=20 > Not to mention a potential contention which is hard to predict and it > will depend on the specific runtime very much. So not something that > would be easy to test for other than artificial testcases. Full cpu > pcp draining is not a very common operation and one could argue that > the problem will be limited=C2=A0 >=20 Yes, but we are exchanging an "always schedule_work_on()", which is a kind = of contention, for a "sometimes we hit spinlock contention". For the spinlock proposal, on the local cpu side, the *worst case* contenti= on is: 1 - wait the spin_unlock() for a complete , 2 - wait a cache hit for local per-cpu cacheline=C2=A0 What is current implemented (schedule_work_on() approach), for the local cpu=C2=A0side there is *always* this contention: 1 - wait for a context switch, 2 - wait a cache hit from it's local per-cpu cacheline, 3 - wait a complete ,=C2=A0 4 - then for a new context switch to the current thread. So moving from schedule_work_on() to spinlocks will save 2 context switches= per cpu every time drain_all_stock() is called. On the remote cpu side, my tests point that doing the remote draining is fa= ster than scheduling a local draining, so it's also a gain. Also, IIUC the possible contention in the spinlock approach happens only on page-faulting and syscalls, versus the schedule_work_on() approach that can interrupt user workload at any time.=C2=A0 In fact, not interrupting the user workload in isolated cpus is just a bonu= s of using spinlocks. > but so far I haven't really heard any strong > arguments against the proposal to avoid scheduling the work on isolated > cpus which is a much more simpler solution and achieves the same > effect. I understand the 'not scheduling work on isolated cpus' appeal: it's a much simpler change, and will only affect cases with isolated cpus, so it is saf= er than changing the whole locking structure.=C2=A0 I am not against it. I already have a patch implementing it for testing, an= d I could gladly send it if you see fit. But if nothing else, it introduces another specific case, and now it have t= o deal differently with local cpu, remote cpus, and isolated cpus. With spinlocks, there is a much simpler solution: - All cpus are dealt with the same way, which is faster in the local cpu (a= s=C2=A0 in upstream implementation). - We don't have to avoid draining on isolated cpus. - We get rid of the "work_struct", and scheduling costs - We get rid of "flag", as we don't need to take care of multi-scheduling l= ocal=C2=A0 draining. - When drain_all_stock() returns, all stock have already been drained, so= =C2=A0 retrying=C2=A0consume_stock() may have immediate results on pre-OOM cases= . On Wed, 2023-02-01 at 13:52 +0100, Michal Hocko wrote: [...] > Let me be clear here. Unless there are some real concerns about not > flushing remotely on isolated cpus then the spin lock approach is just > much less preferable. So I would much rather focus on the trivial patch > and investigates whether there are no new corner cases to be created by > that. I understand 'not scheduling work on isolated cpus' approach should work wi= th low impact on current behavior, and I also understand that as Maintainers y= ou have to consider many more factors than a regular developer like me, and al= so that the spinlock approach is a big change on how pcp memcg caches work.=20 On that topic, if there is any comparison test you find important running, = or any topic you think could be better discussed, please let me know. Thank you for reviewing, Leo