From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA643C2BD09 for ; Fri, 28 Jun 2024 18:48:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2F8336B0088; Fri, 28 Jun 2024 14:48:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 259DA6B0089; Fri, 28 Jun 2024 14:48:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D3956B008A; Fri, 28 Jun 2024 14:48:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E2CC46B0088 for ; Fri, 28 Jun 2024 14:48:12 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5B778C0945 for ; Fri, 28 Jun 2024 18:48:12 +0000 (UTC) X-FDA: 82281182424.30.CDB7514 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 68A5140002 for ; Fri, 28 Jun 2024 18:48:10 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ev1KWy5x; spf=pass (imf07.hostedemail.com: domain of mtosatti@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=mtosatti@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719600481; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BPNBPdx6rKTE521JsYRdhS1BYG1MBCkTdE7R8C6b+Es=; b=ur3b65zyH0g83qeuDuRRUo09sNRo3rnS+1y2JcgcnDTYS9soW8l1aSQyLCRkGcZOo9/uPR QLa5RjZsvO55KPACp6u+jkvGBEsQMt+CaI3kuhAcoRBJIwBbf8uVZARi52xJfV7/pcNVRu vX8GxyPEaoYWjCW4Ifvvsh1GILylxDw= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ev1KWy5x; spf=pass (imf07.hostedemail.com: domain of mtosatti@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=mtosatti@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719600481; a=rsa-sha256; cv=none; b=Sx+aNDjZKdBoeRziYuQt2awqGwuM+lY7tayHi7oc3hms4iOeGyEnpHm4PwwQv4WlUpMcB+ J1HJyLqxJfbN/wTTcq0uS+xqNzO/VsTivzMx/iTsWP/K3oaAdq+HBWGKloFoKgH34aibBm 8ivkrxqtGhLVSwOoAxO5XAhIxwcxSOM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1719600489; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=BPNBPdx6rKTE521JsYRdhS1BYG1MBCkTdE7R8C6b+Es=; b=ev1KWy5xyLPKgwTueFUsS+spW5V5KTAxRW4zQJvEL2qVyaZkvlDoRMDPCYkv+hEjzjwzge QQnMPcDkllyeWYyLuNCeo0iK41+x+03DUy8qrrKI2yvmDqbsts+sps8qLkKDxhCfn0hAPb 3SVp2aCVMv+XcxC789AIiDueZao7Z9k= Received: from mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-210-ev6LzQVgNKmG0zzlGO3LLA-1; Fri, 28 Jun 2024 14:48:06 -0400 X-MC-Unique: ev6LzQVgNKmG0zzlGO3LLA-1 Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 4F6E419560AA; Fri, 28 Jun 2024 18:48:03 +0000 (UTC) Received: from tpad.localdomain (unknown [10.96.133.3]) by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 3EB9E1955BD4; Fri, 28 Jun 2024 18:48:01 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id E1B78400E5C3B; Fri, 28 Jun 2024 15:47:38 -0300 (-03) Date: Fri, 28 Jun 2024 15:47:38 -0300 From: Marcelo Tosatti To: Leonardo Bras Cc: Boqun Feng , Vlastimil Babka , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH v1 0/4] Introduce QPW for per-cpu operations Message-ID: References: <20240622035815.569665-1-leobras@redhat.com> <261612b9-e975-4c02-a493-7b83fa17c607@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15 X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 68A5140002 X-Stat-Signature: k88dqzz96ynkd6iakdi3f7aej5f6n5hh X-HE-Tag: 1719600490-653091 X-HE-Meta: U2FsdGVkX185eA4EXB+w0ilWp8TN9IoG3kLiH/Xo70KkVl7y888YiaJq7qV+YO5mbr12eNnfnRwPPZb1B/pxq74RiciC7UMY0wsYB2+FVK2CuOA91idDgq0ezKsFuSbA6gPxlo+/RwaM2U0V3Xx7M++587kBWMSe4jZFI1U1RxBwq1bLiUoJ+9TgXSxTOmO1O/TZzT5jcAhrDquR7NrEyRzoj2Sxw+ZdPuDBcwXrNzglp5hjZ8//dTfc/y+Iff4eviSVkIuj+3XvDP5yXCtZieb+c95W+fi7vkJy3ZQOLktT1xXMwk+UsIu/qifEvDgHqKj1jZnCb4HJ7JwGwhv4XMNKdm5/WgIPaVI+IfAT4ddwqIH1IfGiGCcuwemjv5/V/U39lNj50WiqfDnYMjsm261e+A5/BsI/mKaVqmafGx/6Tqf17XnKfCStgfSF8MPq2G1ryTAZlQK56ZE48ye5ffxyCnokWCE76Ucs/tK/jRZ2QT5cPgmudo6D5eedOb+LfjNrs5P10j1RRP8NCrK67S9mJXo6wG2e14ygXZnyjHadWUQxOz+c69KdPuixNaeSQVScIuFaUjh1x/+z+aU4qAzM+mZGFk/Icqxm9+G2u6LKvaqHF8oIncbTwqQAVvgCiA9UbFiUFxupKDS34qy3rNco9eTsJKDR+b3I++mX2ieufzbnmn4BQ1YTlDy157wHu9cgUbUWLoORoaMD7WIbX37XweizokQAl60bEuecMX8mCbmqe2PX2BiZzZd7PxcAX8CxUr1wpc0oEN3uzFWUDPP4yKh3QHfBJqUTd8EE3g9VW3oyu8kGCN2bbvVw6OmfCYMDvsu1gb1bXQ0aj7bidj6PHzID4eMlQ7FwgzVzwZgXQdrsA5/mApkwiz72PZtGRdM4+crgbyMphi35A6wu9opYfieYJ2S1/PWw7nmfCujgdvNzn9OAzZJ1dFHl7RcuHh8RAmpYZdSpDSrtEbF E0TF1+kC i1o2F8R0DYrVClcpIoOncck7Whr3gUMW+lIzaTmsJdNVcOOYGYJJYaPS13E+puKCDbk+cbK3Pa2HIp3YIAT2dxz8BPoMkulzyvdfI5ov50M7A0KfVHgatPRSBDHikeYorIrxHG05ubpZvHgvy7RkGhBkiZB8nuiHnVhw7im3aGGtU3Anb/0D+/RSYdlio8V+NtWHLMRHslnFWEpb0pZInzgGw9t9/1qnzkZV0r45tEB4emRelqgvy02aP3CUeGMTLyKW/VXREgR5ZXEfICDUDdxOqWL6uDGrgvd/xd8TECkzhvbI414eCPBryBWQu+BGNd/mJgIDO2YuY1/19vm3UY7nU+xRUgMDJi8Ut0qYxCyU6BJm9NLEJfnKl+pAUzxIjJqoKZ3WpS8cct/7ajYEyeN2qxIf5H8WXVyXbyxlS7UguGJvHI+NHNk1rMwszvr3Fpn5LhjhWCUlZhWfE/swXzEFWV9GwzlYlkUPuQZcIpXr1E3zsx2NZqvnKZn3EnZk+p4G3fVCcgy9Xjjg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 24, 2024 at 11:57:57PM -0300, Leonardo Bras wrote: > On Mon, Jun 24, 2024 at 03:54:14PM -0700, Boqun Feng wrote: > > On Mon, Jun 24, 2024 at 09:31:51AM +0200, Vlastimil Babka wrote: > > > Hi, > > > > > > you've included tglx, which is great, but there's also LOCKING PRIMITIVES > > > section in MAINTAINERS so I've added folks from there in my reply. > > > > Thanks! > > > > > Link to full series: > > > https://lore.kernel.org/all/20240622035815.569665-1-leobras@redhat.com/ > > > > > > > And apologies to Leonardo... I think this is a follow-up of: > > > > https://lpc.events/event/17/contributions/1484/ > > > > and I did remember we had a quick chat after that which I suggested it's > > better to change to a different name, sorry that I never found time to > > write a proper rely to your previous seriese [1] as promised. > > > > [1]: https://lore.kernel.org/lkml/20230729083737.38699-2-leobras@redhat.com/ > > That's correct, I commented about this in the end of above presentation. > Don't worry, and thanks for suggesting the per-cpu naming, it was very > helpful on designing this solution. > > > > > > On 6/22/24 5:58 AM, Leonardo Bras wrote: > > > > The problem: > > > > Some places in the kernel implement a parallel programming strategy > > > > consisting on local_locks() for most of the work, and some rare remote > > > > operations are scheduled on target cpu. This keeps cache bouncing low since > > > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT > > > > kernels, even though the very few remote operations will be expensive due > > > > to scheduling overhead. > > > > > > > > On the other hand, for RT workloads this can represent a problem: getting > > > > an important workload scheduled out to deal with remote requests is > > > > sure to introduce unexpected deadline misses. > > > > > > > > The idea: > > > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks. > > > > In this case, instead of scheduling work on a remote cpu, it should > > > > be safe to grab that remote cpu's per-cpu spinlock and run the required > > > > work locally. Tha major cost, which is un/locking in every local function, > > > > already happens in PREEMPT_RT. > > > > > > I've also noticed this a while ago (likely in the context of rewriting SLUB > > > to use local_lock) and asked about it on IRC, and IIRC tglx wasn't fond of > > > the idea. But I forgot the details about why, so I'll let the the locking > > > experts reply... > > > > > > > I think it's a good idea, especially the new name is less confusing ;-) > > So I wonder Thomas' thoughts as well. > > Thanks! > > > > > And I think a few (micro-)benchmark numbers will help. > > Last year I got some numbers on how replacing local_locks with > spinlocks would impact memcontrol.c cache operations: > > https://lore.kernel.org/all/20230125073502.743446-1-leobras@redhat.com/ > > tl;dr: It increased clocks spent in the most common this_cpu operations, > while reducing clocks spent in remote operations (drain_all_stock). > > In RT case, since local locks are already spinlocks, this cost is > already paid, so we can get results like these: > > drain_all_stock > cpus Upstream Patched Diff (cycles) Diff(%) > 1 44331.10831 38978.03581 -5353.072507 -12.07520567 > 8 43992.96512 39026.76654 -4966.198572 -11.2886198 > 128 156274.6634 58053.87421 -98220.78915 -62.85138425 > > Upstream: Clocks to schedule work on remote CPU (performing not accounted) > Patched: Clocks to grab remote cpu's spinlock and perform the needed work > locally. > > Do you have other suggestions to use as (micro-) benchmarking? > > Thanks! > Leo One improvement which was noted when mm/page_alloc.c was converted to spinlock + remote drain was that, it can bypass waiting for kwork to be scheduled (on heavily loaded CPUs). commit 443c2accd1b6679a1320167f8f56eed6536b806e Author: Nicolas Saenz Julienne Date: Fri Jun 24 13:54:22 2022 +0100 mm/page_alloc: remotely drain per-cpu lists Some setups, notably NOHZ_FULL CPUs, are too busy to handle the per-cpu drain work queued by __drain_all_pages(). So introduce a new mechanism to remotely drain the per-cpu lists. It is made possible by remotely locking 'struct per_cpu_pages' new per-cpu spinlocks. A benefit of this new scheme is that drain operations are now migration safe. There was no observed performance degradation vs. the previous scheme. Both netperf and hackbench were run in parallel to triggering the __drain_all_pages(NULL, true) code path around ~100 times per second. The new scheme performs a bit better (~5%), although the important point here is there are no performance regressions vs. the previous mechanism. Per-cpu lists draining happens only in slow paths. Minchan Kim tested an earlier version and reported; My workload is not NOHZ CPUs but run apps under heavy memory pressure so they goes to direct reclaim and be stuck on drain_all_pages until work on workqueue run. unit: nanosecond max(dur) avg(dur) count(dur) 166713013 487511.77786438033 1283 From traces, system encountered the drain_all_pages 1283 times and worst case was 166ms and avg was 487us. The other problem was alloc_contig_range in CMA. The PCP draining takes several hundred millisecond sometimes though there is no memory pressure or a few of pages to be migrated out but CPU were fully booked. Your patch perfectly removed those wasted time.