From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 340EEC7618E for ; Wed, 26 Apr 2023 10:41:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B634B6B00C4; Wed, 26 Apr 2023 06:41:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B136F6B00C6; Wed, 26 Apr 2023 06:41:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9DB4D6B00C7; Wed, 26 Apr 2023 06:41:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 8D7CB6B00C4 for ; Wed, 26 Apr 2023 06:41:54 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4EA18C019F for ; Wed, 26 Apr 2023 10:41:54 +0000 (UTC) X-FDA: 80723201748.26.86E17CA Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf03.hostedemail.com (Postfix) with ESMTP id 510022001A for ; Wed, 26 Apr 2023 10:41:52 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682505712; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AmopsozWt3di7/B0BefvEfNUXRl9uAxzvaidqtdoOzA=; b=pljL1Go5llDcvZIq4Z+TAmXzqQB/yo5FD46IKySP3nZKU8f5v1UWy0cYIVasYE/M27d5bW RKLGFBSX/e9lxX95vWGnjXF96MUsKB9Cpij3GQqm5EPv5fWyaKRa9TNDt9N116B6pjx7l/ dXGaqQD23shcIqjLJ+OJbD81HaRZWuE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682505712; a=rsa-sha256; cv=none; b=cOioc4RfqHW+cLvOpehh+Ij/3CLAAeIALMEp3/2/0QutkRadkw9b8/oI2fNW4k0u12Uhb2 M0jVerP2d/dotrlEDkbB+4vIhdLAUntQHF1YZ+tvHgbeajanH9oX1Wz0DQ4fKfDzalM/H9 MFmAM+R8Zs6olanEsCRrpbZeUsKPAUE= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3DAD04B3; Wed, 26 Apr 2023 03:42:35 -0700 (PDT) Received: from [10.57.69.78] (unknown [10.57.69.78]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 167E63F5A1; Wed, 26 Apr 2023 03:41:49 -0700 (PDT) Message-ID: <6857912b-4afd-7fb5-b11b-ebe0e32298c2@arm.com> Date: Wed, 26 Apr 2023 11:41:48 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Content-Language: en-US To: David Hildenbrand , Andrew Morton , "Matthew Wilcox (Oracle)" , Yu Zhao , "Yin, Fengwei" Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org References: <20230414130303.2345383-1-ryan.roberts@arm.com> <13969045-4e47-ae5d-73f4-dad40fe631be@arm.com> <568b5b73-f0e9-c385-f628-93e45825fb7b@redhat.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Queue-Id: 510022001A X-Rspamd-Server: rspam09 X-Stat-Signature: m6fahf4tywfknr6w1e76tx1no8ar7k4r X-HE-Tag: 1682505712-511592 X-HE-Meta: U2FsdGVkX1+63AYRVOAyjO9I0iVQAWjLB5ACLD1UR8s/41aJ5LNtChvXw9HnuI4WBz7Lu3RuNMm3e5CEQxk312U/8blcV20+iLtmYdiY2sjTggvcD3Da1QrMLgqbfTaBFHQl3f1Et0RWr2wGmk9P8EY5AuFD8WZngnR8omZ9kbax2VwKDp21/TyGmefZVqsUiZVqBX9u1qso4fcyWy51PRMHUrsp/ONZts133YfWtNQKLVv61qGCLYLKvv9K8vPrrGZNx5/Ht/VeNer7Y7SKgb5+5zd21hxQd+3idiACqryDIN9ECzdouBFj5USOzX7pSMXLoOuQ84xI6Esg6xF3X37aEHm+1znAD0AgnIpnjlPERAt7ZRf6nZhFUPiglclZ+UclBvFMLPzDh5z3ut7OTW5SK21aO/wmlnjD7Cy6YxwKEv4cPKDNJKqj5ErMMXTzTw7HE3OgQY5EjZYwsFtjJPM37fsf7sxX1EyxBqNWXIV0muxl3KwSnKtiLkanXdISUuDXmCEMD+TK7KgrdYjmGmuoO7LR4PORKo1gu45lDTsw3Wu8bVml8lh1mB9zIDs0UChjx/YNpo56XvX5gPWinicv4YphIXSbglJm+ZWTrnmnzpB0iWHM6XKsYLtWk8KZqdn5z+6GN3PicmzmCy1Zl64o6BHM6+AtV5tuDuJTZEYL+kL3TMAp2yHoaZvzftFu6nhW7RSDZ9ZicCt0uhzEyFRRiMn9vFe3Ylf9gTtnRFNJ6HTwdRiasb7w5rTM22nRC2RXDQpLZGKSsjKyLLkVrXUrNZOF/YaXdOOsATAGNBLaxH7Z2xjksABGYFGWh3l1Tzr+rvpjXIfeNEBvMB28bV2coK9hl9S07E/UkGZZEpCKIT7NYJOXSl5xYanlk51f+qDVYQEClrItE4gBAyJboV8kPZwqLBQocz3KRCgYPwpRu53hEIW3sQUpZJkbGlzxHLIQsXsR/leX/5W6vvW V/JISgUP 3GeZRHvNeiHdkNeMwe/fCuOvBefPxWYVZ0lOztSK7KeXOMz0wTE8eEOFExgzNnekZnVsl0pPL9b6bbLHk3Z7yTVNo07I57GVbreSUCtpVI7JXxO0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000021, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi David, On 17/04/2023 16:44, David Hildenbrand wrote: >>>>> >>>>> So what should be safe is replacing all sub-pages of a folio that are marked >>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really >>>>> worth the complexity. For THP we were happy so far to *not* optimize this, >>>>> implying that maybe we shouldn't worry about optimizing the fork() case for >>>>> now >>>>> that heavily. >>>> >>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling >>>> large copies was contributing a measurable amount to the performance >>>> improvement. (Certainly, the zero-page copy case, is definitely a big >>>> contributer). I don't have access to the HW at the moment but can rerun later >>>> with and without to double check. >>> >>> In which test exactly? Some micro-benchmark? >> >> The kernel compile benchmark that I quoted numbers for in the cover letter. I >> have some trace points (not part of the submitted series) that tell me how many >> mappings of each order we get for each code path. I'm pretty sure I remember all >> of these 4 code paths contributing non-negligible amounts. > > Interesting! It would be great to see if there is an actual difference after > patch #10 was applied without the other COW replacement. > Sorry about the delay. I now have some numbers for this... I rearranged the patch order so that all the "utility" stuff (new rmap functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a couple of general improvements (7, 17), which should be dormant until we have the final patches, then finally (6, 10, 14, 15), which implement large anon folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we never replace an exclusive page). I've measured performance at the following locations in the patch set: - baseline: none of my patches applied - utility: has utility and general improvement patches applied - alloc: utility + 6 - reuse: utility + 6 + 10 - copy: utility + 6 + 10 + 14 - zero-alloc: utility + 6 + 19 + 14 + 15 The test is `make defconfig && time make -jN Image` for a clean checkout of v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some per-boot variance (probably down to kaslr, etc). So have booted each kernel 7 times for a total of 3x7=21 samples per kernel. Then I've taken the mean: jobs=8: | label | real | user | kernel | |:-----------|-------:|-------:|---------:| | baseline | 0.0% | 0.0% | 0.0% | | utility | -2.7% | -2.8% | -3.1% | | alloc | -6.0% | -2.3% | -24.1% | | reuse | -9.5% | -5.8% | -28.5% | | copy | -10.6% | -6.9% | -29.4% | | zero-alloc | -9.2% | -5.1% | -29.8% | jobs=160: | label | real | user | kernel | |:-----------|-------:|-------:|---------:| | baseline | 0.0% | 0.0% | 0.0% | | utility | -1.8% | -0.0% | -7.7% | | alloc | -6.0% | 1.8% | -20.9% | | reuse | -7.8% | -1.6% | -24.1% | | copy | -7.8% | -2.5% | -26.8% | | zero-alloc | -7.7% | 1.5% | -29.4% | So it looks like patch 10 (reuse) is making a difference, but copy and zero-alloc are not adding a huge amount, as you hypothesized. Personally I would prefer not to drop those patches though, as it will all help towards utilization of contiguous PTEs on arm64, which is the second part of the change that I'm now working on. For the final config ("zero-alloc") I also collected stats on how many operations each of the 4 paths was performing, using ftrace and histograms. "pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of pages in the source folio): do_anonymous_page: { pnr: 1 } hitcount: 2749722 { pnr: 4 } hitcount: 387832 { pnr: 8 } hitcount: 409628 { pnr: 16 } hitcount: 4296115 pages: 76315914 faults: 7843297 pages per fault: 9.7 wp_page_reuse (anon): { pnr: 1, fnr: 1 } hitcount: 47887 { pnr: 3, fnr: 4 } hitcount: 2 { pnr: 4, fnr: 4 } hitcount: 6131 { pnr: 6, fnr: 8 } hitcount: 1 { pnr: 7, fnr: 8 } hitcount: 10 { pnr: 8, fnr: 8 } hitcount: 3794 { pnr: 1, fnr: 16 } hitcount: 36 { pnr: 2, fnr: 16 } hitcount: 23 { pnr: 3, fnr: 16 } hitcount: 5 { pnr: 4, fnr: 16 } hitcount: 9 { pnr: 5, fnr: 16 } hitcount: 8 { pnr: 6, fnr: 16 } hitcount: 9 { pnr: 7, fnr: 16 } hitcount: 3 { pnr: 8, fnr: 16 } hitcount: 24 { pnr: 9, fnr: 16 } hitcount: 2 { pnr: 10, fnr: 16 } hitcount: 1 { pnr: 11, fnr: 16 } hitcount: 9 { pnr: 12, fnr: 16 } hitcount: 2 { pnr: 13, fnr: 16 } hitcount: 27 { pnr: 14, fnr: 16 } hitcount: 2 { pnr: 15, fnr: 16 } hitcount: 54 { pnr: 16, fnr: 16 } hitcount: 6673 pages: 211393 faults: 64712 pages per fault: 3.3 wp_page_copy (anon): { pnr: 1, fnr: 1 } hitcount: 81242 { pnr: 4, fnr: 4 } hitcount: 5974 { pnr: 1, fnr: 8 } hitcount: 1 { pnr: 4, fnr: 8 } hitcount: 1 { pnr: 8, fnr: 8 } hitcount: 12933 { pnr: 1, fnr: 16 } hitcount: 19 { pnr: 4, fnr: 16 } hitcount: 3 { pnr: 8, fnr: 16 } hitcount: 7 { pnr: 16, fnr: 16 } hitcount: 4106 pages: 274390 faults: 104286 pages per fault: 2.6 wp_page_copy (zero): { pnr: 1 } hitcount: 178699 { pnr: 4 } hitcount: 14498 { pnr: 8 } hitcount: 23644 { pnr: 16 } hitcount: 257940 pages: 4552883 faults: 474781 pages per fault: 9.6 Thanks, Ryan