From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27BCCC47DAF for ; Thu, 18 Jan 2024 18:23:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A88046B00A5; Thu, 18 Jan 2024 13:23:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A35FC6B00AA; Thu, 18 Jan 2024 13:23:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8AF526B00AB; Thu, 18 Jan 2024 13:23:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 7661D6B00A5 for ; Thu, 18 Jan 2024 13:23:54 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 44E68A0DA2 for ; Thu, 18 Jan 2024 18:23:54 +0000 (UTC) X-FDA: 81693255588.12.7048DB4 Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com [209.85.167.50]) by imf17.hostedemail.com (Postfix) with ESMTP id 52E7140011 for ; Thu, 18 Jan 2024 18:23:52 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="f/PPf1SX"; spf=pass (imf17.hostedemail.com: domain of urezki@gmail.com designates 209.85.167.50 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705602232; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oDTT+aimkuCwG90BFZdDCB6BRMUsuDiU7TMC5OX3Mbc=; b=Y+ODmxRZZcUXotW4A7giTcYOdg7xkAS7pB0Sg679rZ3/wy+DK1Rtd3xtj6rwxy9zdFwsVG ucx6Ud/S8St6nUY0RKJ5455xejjGuxhJA+CsgkDbCawsN9ykIgE+uSLwtKtmRCYKrGoIFh qnEctTeb6s7dCLr7rmtp8B5konZ4Ubo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705602232; a=rsa-sha256; cv=none; b=hhpCqbxy+ShuKkfysfpRRzQsjIBDJHnT1rYBqbn13Uoi2COR9gCC5lR0Ddtl5BxHHyl1QD dCjZY/YzQSn5mhRVGSkMXQBLL6ytSJriH7Ty3LlDWoaaCH1h0BtygwG+sGAbkXi4IMMN8z ouzB4DsrEHWTbE/f7DMat55KSlhCwFY= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="f/PPf1SX"; spf=pass (imf17.hostedemail.com: domain of urezki@gmail.com designates 209.85.167.50 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-lf1-f50.google.com with SMTP id 2adb3069b0e04-50e6ee8e911so15341120e87.1 for ; Thu, 18 Jan 2024 10:23:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705602230; x=1706207030; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=oDTT+aimkuCwG90BFZdDCB6BRMUsuDiU7TMC5OX3Mbc=; b=f/PPf1SXvLruXCAKLsYJrH1GPzc2LnOoupjOabI0tgGFQwzth4eGCjGZzqJsVVE2zc MJhE6ECizM38188GeRQPdDSvJo+mGTlr9PhlzpEs8z38F+0AWu5HOZCZ4nbugZBTQ8ZJ 1q5RHtBe8ZAKDov8c/3XWqwbE22wQK0QkhxvcodBfun759NZjxcfaclwE8CxDEcOsnCr EMF/Xx/KgUofUcsvnxHH14axSNgUuFtcYca7ISUSK4dc9QGFLHOdjfF3KC//Q0Yv0a3A StGQmFNa1A1rGeX4JHI/Q1QsKJqdMZDAgCVRWdaa4fLpjrzEncFQs8Q/IH31b8RVmAxG pbEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705602230; x=1706207030; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=oDTT+aimkuCwG90BFZdDCB6BRMUsuDiU7TMC5OX3Mbc=; b=b3ikuSEoixwuUtfjPoJqshxsFlcSg3hKlDLvZV6FtdTUXZ2ozrBBcGvXGLg6eqIJ2b bltlSYh8aJM0PbcVh6YhA9xKyfeOr5v5F09mTL6z/fVe1k4w3lJJNO2aqbuTDC2pQmZi BUcj300oQNwNILfSm9hqMX5PXuJe4/iHCu9VvgmWTJZvfwcrXbjKJxTYdu3uWGmqTpVk f7fDte2zfX+6clilz/E9Am1GFcGCRcchAi/+VcdHJ+BGiS2oaFVWCHWXsKhqoKokUY0K WmSoJItyYFoPJ6SCa45AFRG7ENRUVVtbhia8M7Vz/8xkbOnZUB4sxWwa/fKAV6urke9b zEdQ== X-Gm-Message-State: AOJu0YzRi8JXkAMLaGkU2TjqYoFDi+LBXRPh5BWFWWAkpgD8WGZGLuVB 2MBe+Q7UE5ZA2kVzf0SIsbjbh1YjBHffCOLqIXJ6KmXECV65XD1/ X-Google-Smtp-Source: AGHT+IH7CWktjjnW+TrI3W6yXBXD4/G2Z1zcyUitoT5/jCsvlhBbgwy2a8Ye/B1IGEfDnQ+W/a/DIw== X-Received: by 2002:a19:690d:0:b0:50e:b8d3:143e with SMTP id e13-20020a19690d000000b0050eb8d3143emr31028lfc.51.1705602230225; Thu, 18 Jan 2024 10:23:50 -0800 (PST) Received: from pc636 (host-90-235-20-191.mobileonline.telia.com. [90.235.20.191]) by smtp.gmail.com with ESMTPSA id fc5-20020a056512138500b0050f072ef3b0sm711259lfb.175.2024.01.18.10.23.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Jan 2024 10:23:49 -0800 (PST) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Thu, 18 Jan 2024 19:23:47 +0100 To: Dave Chinner Cc: Uladzislau Rezki , linux-mm@kvack.org, Andrew Morton , LKML , Baoquan He , Lorenzo Stoakes , Christoph Hellwig , Matthew Wilcox , "Liam R . Howlett" , "Paul E . McKenney" , Joel Fernandes , Oleksiy Avramchenko Subject: Re: [PATCH v3 10/11] mm: vmalloc: Set nr_nodes based on CPUs in a system Message-ID: References: <20240102184633.748113-1-urezki@gmail.com> <20240102184633.748113-11-urezki@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 52E7140011 X-Rspam-User: X-Stat-Signature: q16mytuq6i9gt4abp5w9dhb113rkjo7m X-Rspamd-Server: rspam03 X-HE-Tag: 1705602232-806584 X-HE-Meta: U2FsdGVkX19Jq0OrKdzI2UxE9+bHmklGJLWPJ+EfwwZAgWBr92W45e4iOdABioYo8DZw8gcBfC1rx4YxFFw7PYCp/KbLCSaQWyDbWVMHEsOjflvnlZ+y/lInLwgN9537p8zWj314z+sgVfnoDUTC1/75Vl62cWUEh86b8PggYcDdjH4AaWhYLwDtebvi8bVTNH+l6/fd6cy4Tx9lhXwYcWTBQ4xMmn4HYXwuLn2vQzxGNR2Dwv5tf8clfs1KYDgj4KNV0JPrvdjFa3IofQq9dKpO4Y1ZMaWGzTDBG8ZKvx1yHCcdl2QYsBDglX9yJAXlkjdaF1F8v9X1bzwWIHR6jJ8z4PNf7gmp7T8RgUd0yyIZUCOBvhlXsxF+Jl0TqoJfP4XUZKNOQJmWxdkbgl0E3cOsDrew6T481mibg66p6/RMjcjWHid07hbpmez6hMeeINnlU6n2h/f1unUvOQI7L3hzTOtS4DmrNBqmsQZe0iRCOsZI8ZJO2M4W2k15in5l/NH82z6RhZd5ViSPd//NvLEA5RAo9YO3nUCnYJww9JE13Bi3OCfe+DlruIP/pSEpsZSWQdd508zAzdf8xmVNXlS0+sPoQk4qhAnv0VQqKL/EYoBLRLjI+la9IKEsRQPizhw+x15uEu9BuGOA05TcwYAoTQb5+kIzPP8dzXy0ZBk+5rHIVN1gcGx49gFvH/mbgJ92Id0DFvDdfDy9dE7Kvwt4212i5jMLfFSliiUuTbL4hwRt6Ox8SqGp21HPqOWcoNyWeCmQiDFpN8rxvZVpaj5I8u4gHWnd5/z0BHf92Fm+5h7N4lcAH0AoaO7SLXNBJIdzQWe93FhfNMczVhbA/Ub+drCBZEHjqBUZFRXe0Jp+0MqKZCberhtusfNmRAzl2ttXVuNmM0mR5nkgZNStbJoGY40fdwqq7E6ZDG320Eidghb9YeGHOVPoE/JKgks98qewRnzt3ExNMRA+f/G vxzvCnpZ nPG2+uQXkhEHk/8GwHk9ujSz8cD0VOE47U1g7TyPANIpJlxWd+B58lY9nOcRRITMXKlUF1NncE04coyVw3VP8GoklEUZlZc2FrXOgTolGqQ40UfIOnNYFkgTQqzi0sQRZXVjQRNkgS9YKWvKl2vQ4qcp5KmZzuCURiyspVfczxOULX8EpqY/b7ytsJwFGcxLVjUG4GhP4880SdhiRyU76aEZ+FdgL8R01SUGQDE9Kx68SE9VpvQp42WHchUMzjvoinNBDGvKX3p8RfLeumYOvSakDXQHZ1SR8cYoJJXPJA2UP9pmc2L4Sa4DzOvpiDH51c+li7R5hky1vq+KbTXiDm0J5Ob4zO64XLfLaWWUik/dL8bgnzxxCFhTrrduoEMhvSXfa/SxqrJKZ0qbEfGAZNYiFIHmFRC8K2wGH6Ksco3TPWmlHz1GEsGfzEJ/gM7AhRRGdDV2GtPVDOj1FGRq3Lhqqz6rqTdBl1GRP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 17, 2024 at 09:06:02AM +1100, Dave Chinner wrote: > On Mon, Jan 15, 2024 at 08:09:29PM +0100, Uladzislau Rezki wrote: > > > On Tue, Jan 02, 2024 at 07:46:32PM +0100, Uladzislau Rezki (Sony) wrote: > > > > A number of nodes which are used in the alloc/free paths is > > > > set based on num_possible_cpus() in a system. Please note a > > > > high limit threshold though is fixed and corresponds to 128 > > > > nodes. > > > > > > Large CPU count machines are NUMA machines. ALl of the allocation > > > and reclaim is NUMA node based i.e. a pgdat per NUMA node. > > > > > > Shrinkers are also able to be run in a NUMA aware mode so that > > > per-node structures can be reclaimed similar to how per-node LRU > > > lists are scanned for reclaim. > > > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > > area per pgdat (or sub-node cluster) rather than just base the > > > number on CPU count and then have an arbitrary maximum number when > > > we get to 128 CPU cores. We can have 128 CPU cores in a > > > single socket these days, so not being able to scale the vmalloc > > > areas beyond a single socket seems like a bit of a limitation. > > > > > > > > > Hence I'm left to wonder if it would be better to have a vmalloc > > > area per pgdat (or sub-node cluster) rather than just base the > > > > > > Scaling out the vmalloc areas in a NUMA aware fashion allows the > > > shrinker to be run in numa aware mode, which gets rid of the need > > > for the global shrinker to loop over every single vmap area in every > > > shrinker invocation. Only the vm areas on the node that has a memory > > > shortage need to be scanned and reclaimed, it doesn't need reclaim > > > everything globally when a single node runs out of memory. > > > > > > Yes, this may not give quite as good microbenchmark scalability > > > results, but being able to locate each vm area in node local memory > > > and have operation on them largely isolated to node-local tasks and > > > vmalloc area reclaim will work much better on large multi-socket > > > NUMA machines. > > > > > Currently i fix the max nodes number to 128. This is because i do not > > have an access to such big NUMA systems whereas i do have an access to > > around ~128 ones. That is why i have decided to stop on that number as > > of now. > > I suspect you are confusing number of CPUs with number of NUMA nodes. > I do not think so :) > > A NUMA system with 128 nodes is a large NUMA system that will have > thousands of CPU cores, whilst above you talk about basing the > count on CPU cores and that a single socket can have 128 cores? > > > We can easily set nr_nodes to num_possible_cpus() and let it scale for > > anyone. But before doing this, i would like to give it a try as a first > > step because i have not tested it well on really big NUMA systems. > > I don't think you need to have large NUMA systems to test it. We > have the "fakenuma" feature for a reason. Essentially, once you > have enough CPU cores that catastrophic lock contention can be > generated in a fast path (can take as few as 4-5 CPU cores), then > you can effectively test NUMA scalability with fakenuma by creating > nodes with >=8 CPUs each. > > This is how I've done testing of numa aware algorithms (like > shrinkers!) for the past decade - I haven't had direct access to a > big NUMA machine since 2008, yet it's relatively trivial to test > NUMA based scalability algorithms without them these days. > I see your point. NUMA-aware scalability require reworking adding extra layer that allows such scaling. If the socket has 256 CPUs, how do scale VAs inside that node among those CPUs? -- Uladzislau Rezki