From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 206C0EE49A6 for ; Mon, 21 Aug 2023 21:07:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9481E8E0012; Mon, 21 Aug 2023 17:07:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D14B280002; Mon, 21 Aug 2023 17:07:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6B0678E0015; Mon, 21 Aug 2023 17:07:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 49D718E0012 for ; Mon, 21 Aug 2023 17:07:35 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 2118980D2B for ; Mon, 21 Aug 2023 21:07:35 +0000 (UTC) X-FDA: 81149348070.16.CD25B91 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf14.hostedemail.com (Postfix) with ESMTP id 5664D100030 for ; Mon, 21 Aug 2023 21:07:33 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=none); spf=pass (imf14.hostedemail.com: domain of dennisszhou@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=dennisszhou@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692652053; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2dQacKiLurV1KhiKqUcMclj3Blk3C5ciRarkpZ4fb2Q=; b=A51sLAyfdLiwsHnWmgV+mNRQWpHKxz4efz42zpbIVtydJndSdwtu4RoiXcM+m8VhfadWMI cONxBlH7fDeeJ2ZSrtWcvmma31tUJmzaU5Sova2+N9Uqj32x5ifUudNp+WnyrmqcxkSvrQ IuSOKqcMohK4tCgS/TzgaS9Zrt7FEQs= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=none); spf=pass (imf14.hostedemail.com: domain of dennisszhou@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=dennisszhou@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692652053; a=rsa-sha256; cv=none; b=ZoStE70YJEIO/MMdrrPbyzqvzMd9agkPi9bKmCq6L5OHlqa7/fFk9XFD9NkMokrSeBQxxc KZFbl+MAwQbwxdPeRh0eLZW6zHtZ6fzwYaEH8z5WE8xmQqFbHGkr3keKNxFI0TZzEENpoe wD19AcVG4k0diN70rQ91tCxA81+XZUA= Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-1bc63ef9959so29505125ad.2 for ; Mon, 21 Aug 2023 14:07:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692652052; x=1693256852; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=2dQacKiLurV1KhiKqUcMclj3Blk3C5ciRarkpZ4fb2Q=; b=LqNpKMj/rikIVcv+TDQ0J6NmUXOevKNdlCdPEdZzo/hHptKAyMyxTt6tLM9aATS+CW eenr00ZDy3wMzKJEO3dhDi+dDJ1I6bIAAmhb0ed7pCVathQ4yvbyHgSy3xMmp2U7eaJh AlUWfgujCmI0IDDxL4pb+7JxDOq8UvqqJH5yZ6IkGalfpR+nSSUXtGc6LRMBazzpKffo /C9tT9pwZFV2kdXDVNoNb1NYPtAnRrASij6BqVITANmJurstqna2KZcffw0fMWDVuKYU agzgqDaAquWGxQ2FmSwiqWTcwPitHoJ2kc3DYwKMS3d43hIFAzTAgFYziwnfu6LwSZ/n J7BQ== X-Gm-Message-State: AOJu0YzbSFajCUgAQYunnKzD0MNp3FoO8rxXui0IgrKaOqJEFvJI5JDc MriPMWyZuDCA/QMKTTVPt2o= X-Google-Smtp-Source: AGHT+IFecUr3EwBjtJCL9C5gKjkmgVGFsErcMzgbX4EAMy+MQTH55g/Akiyj50l1tniL7PHZ3iZzrg== X-Received: by 2002:a17:902:f54f:b0:1bc:7c69:925a with SMTP id h15-20020a170902f54f00b001bc7c69925amr9159484plf.33.1692652051969; Mon, 21 Aug 2023 14:07:31 -0700 (PDT) Received: from snowbird ([199.73.127.3]) by smtp.gmail.com with ESMTPSA id b19-20020a170902ed1300b001b89a6164desm7539543pld.118.2023.08.21.14.07.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 21 Aug 2023 14:07:31 -0700 (PDT) Date: Mon, 21 Aug 2023 14:07:28 -0700 From: Dennis Zhou To: Mateusz Guzik Cc: linux-kernel@vger.kernel.org, tj@kernel.org, cl@linux.com, akpm@linux-foundation.org, shakeelb@google.com, linux-mm@kvack.org Subject: Re: [PATCH 0/2] execve scalability issues, part 1 Message-ID: References: <20230821202829.2163744-1-mjguzik@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230821202829.2163744-1-mjguzik@gmail.com> X-Rspamd-Queue-Id: 5664D100030 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: p1dmpa4ucmo346jsqf1nssrwpg88sg9c X-HE-Tag: 1692652053-525181 X-HE-Meta: U2FsdGVkX18FL+JGdnWSf72I9ehQSdnY/c2mw0AUTd4QDcQEvzLUP9DIcK16moTs9tW19oBri3jvQOQpnUeLUtv/SLV2y/IDLAGBc208GsARFoY1bcrDwP2uaQ4Ggl4clb4SfgxODDtFoO+pmortXX8CM35hy2U7LZ+dhdI1iP8lTDMFiaK5DsrsL/PgCZXC2hOcpf/aN+PbLgG2okJ6uv7B16mxNDdojIHzNsNM/LBoxBdJFLa8NFonrKw3ulZsnnm2zdBIh+B4So236aYhMlhwT2s4ggRoHR+8gwxnz0RBl//hFFT4Gd671EPr/xkdq3KuH4Y8Zk1pmJgR5kWwR/ep3ApiOCcmqe9N+Vv69AxIfRnCUbb8NXgPy2/73FY5NIEioEGqbbUSRmzTo/egw+htxSFpCcSOpfis9APpIvsErizfGD0QvGJugNSNVGx5bSYFh42exBb8j2T81NIpFEsIyq+jPskLtQ1D81lLhj4Yl2TpBa2O23zlaSa0ka0Mzrq5qPxm8V6ff0wtpumls+LQzrmoaS5Z7jGCAhg8p1JLSBctYNY4UO6YTSqJmt/JtO9Y6u69gWjrcmzrKp3cy/kN3ei2X48SzfLuc7ly9krHMe+JXo0V9FouEzsZGG/nJOi8Lb4W1AGIfy1f/UouuvYLjASfbxTOabTCtOyslym43CKACXs7jtvjOBlq4aw8MUX4BzMuBP9S61CCkcRDvhd3afgLpVYE4k/FKrDW6tUMMdZR+2ICOjr8hYNfwTGD+1dPq5hKK9vY+L59FtFeeHN/bDvyuIwh1P6DZgeVPuN/Hwg6fmiFiKcDtVRwZH6N8RPKbhTPK1K98b+pNM8dJl2xuRHAJ8oNfq/SsZw4UwaF8u2w8IhFSIkHZ0DHVFupK/OpwoV27rfgarGA1ITKIV9ZAsU322EVGKVD4MHY/4KI2OHK56W9mCWLm+61a2lmWn+H6X6VQQ8DCmmQOiU okoy/jWX fmNlHz5+k8fzDg1Wpi8zCIBlbydgzdMm7zD+TeOUd4tlpF1fQaV4FyUFCARca0XcxXY3Dsrx9VDvZRRMh9iD6EYKrVscebjmNO7f3ZEp7Hr6y0doqczw3uxP3A7i3hOwB4fHTWGLDqT5hJI3xp9ogZ1NQDozV26IyULCd8VLCqxZdkaXSBg9v+YpyhHSDzL6ZJ9QlLnU+PcM8t+chYmychCjB1yQmuFNNdl6rc+JjqLDTsproQ9FGg7yq0wyMOh1x6bgqkodUUJAvsyzSZfzZsoPOIiw7lwLVVXqViUAH7/faxxPUxBQYYwjmsLXwkHb3XKA54YWMRpH0VfdNiHH/g3odBCauunJa4iD1g3FZ/UHAsg9rT12cPAC8Pyypx5o1+F0gUycUmb4cKBKf7/cTOxOc9/LCDCxnoNUFOx1lF3VJwU6WbTX/iGRYYP/l3viZsTFdacsK6DzcjejNlM7uKuTyvlWyLeG+LqlRyTtkSmoVMk46fSWpHxrTgTdA645IerGqxttdhPDXhgsxQYuxWsiYeg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000099, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote: > To start I figured I'm going to bench about as friendly case as it gets > -- statically linked *separate* binaries all doing execve in a loop. > > I borrowed the bench from found here: > http://apollo.backplane.com/DFlyMisc/doexec.c > > $ cc -static -O2 -o static-doexec doexec.c > $ ./static-doexec $(nproc) > > It prints a result every second (warning: first line is garbage). > > My test box is temporarily only 26 cores and even at this scale I run > into massive lock contention stemming from back-to-back calls to > percpu_counter_init (and _destroy later). > > While not a panacea, one simple thing to do here is to batch these ops. > Since the term "batching" is already used in the file, I decided to > refer to it as "grouping" instead. > Unfortunately it's taken me longer to get back to and I'm actually not super happy with the results, I wrote a batch percpu allocation api. It's better than the starting place, but I'm falling short on the free path. I am/was also wrestling with the lifetime ideas (should these lifetimes be percpus problem or call site bound like you've done). What I like about this approach is you group the percpu_counter lock which batching percpu allocations wouldn't be able to solve no matter how well I do. I'll review this more closely today. > Even if this code could be patched to dodge these counters, I would > argue a high-traffic alloc/free consumer is only a matter of time so it > makes sense to facilitate it. > > With the fix I get an ok win, to quote from the commit: > > Even at a very modest scale of 26 cores (ops/s): > > before: 133543.63 > > after: 186061.81 (+39%) > > > While with the patch these allocations remain a significant problem, > > the primary bottleneck shifts to: > > > > __pv_queued_spin_lock_slowpath+1 > > _raw_spin_lock_irqsave+57 > > folio_lruvec_lock_irqsave+91 > > release_pages+590 > > tlb_batch_pages_flush+61 > > tlb_finish_mmu+101 > > exit_mmap+327 > > __mmput+61 > > begin_new_exec+1245 > > load_elf_binary+712 > > bprm_execve+644 > > do_execveat_common.isra.0+429 > > __x64_sys_execve+50 > > do_syscall_64+46 > > entry_SYSCALL_64_after_hwframe+110 > > I intend to do more work on the area to mostly sort it out, but I would > not mind if someone else took the hammer to folio. :) > > With this out of the way I'll be looking at some form of caching to > eliminate these allocs as a problem. > I'm not against caching, this is just my first thought. Caching will have an impact on the backing pages of percpu. All it takes is 1 allocation on a page for the current allocator to pin n pages of memory. A few years ago percpu depopulation was implemented so that limits the amount of resident backing pages. Maybe the right thing to do is preallocate pools of common sized allocations so that way they can be recycled such that we don't have to think too hard about fragmentation that can occur if we populate these pools over time? Also as you've pointed out, it wasn't just the percpu allocation being the bottleneck, but percpu_counter's global lock too for hotplug support. I'm hazarding a guess most use cases of percpu might have additional locking requirements too such as percpu_counter. Thanks, Dennis > Thoughts? > > Mateusz Guzik (2): > pcpcntr: add group allocation/free > fork: group allocation of per-cpu counters for mm struct > > include/linux/percpu_counter.h | 19 ++++++++--- > kernel/fork.c | 13 ++------ > lib/percpu_counter.c | 61 ++++++++++++++++++++++++---------- > 3 files changed, 60 insertions(+), 33 deletions(-) > > -- > 2.39.2 >