From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C4BCD2C567 for ; Tue, 22 Oct 2024 15:03:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 648B76B009B; Tue, 22 Oct 2024 11:03:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5D2096B009C; Tue, 22 Oct 2024 11:03:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 424526B009D; Tue, 22 Oct 2024 11:03:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1A2D86B009B for ; Tue, 22 Oct 2024 11:03:26 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C9540A0227 for ; Tue, 22 Oct 2024 15:02:55 +0000 (UTC) X-FDA: 82701556182.25.620FA46 Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) by imf21.hostedemail.com (Postfix) with ESMTP id 987471C0019 for ; Tue, 22 Oct 2024 15:02:52 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WXY9rnOQ; spf=pass (imf21.hostedemail.com: domain of towinchenmi@gmail.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=towinchenmi@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729609253; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A2l2ym0fiScKy3qn928qLouDMTSxEXPo4CHWSyo1UkQ=; b=T4becDuj6hOKNre/ybgwThHIKsf/SYzbzzqXQ7kzJufV05AeYLX+AYx8edl7FdeyMXWZTy Zd+X/a6HMQS4JO9mpMDW95GbCxEuLo+WvXKe8tK0+jwKFmvCpIYn+EWvVSTCxfj3KB2Xbf px1VL+y7TlCZOBqGVC5AdXdAuJnPWng= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729609253; a=rsa-sha256; cv=none; b=fHVc3azsj3ey8X2LA0pjNy47sJebQKL4uazQfWON0KGCqfn7+bgpbY5iKeMBrDile7D5zH Tk8+vzY05MCBSbAUt5QSP2SvIqJWAWF/JT7Ryww9uhQWB0WWvndS5Kc/Er0vbqrE4hbXI8 ORVKbQYYIEu2F+tbu8JrLzIfG8OcQik= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WXY9rnOQ; spf=pass (imf21.hostedemail.com: domain of towinchenmi@gmail.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=towinchenmi@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-wm1-f49.google.com with SMTP id 5b1f17b1804b1-43159c9f617so55132235e9.2 for ; Tue, 22 Oct 2024 08:03:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729609402; x=1730214202; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=A2l2ym0fiScKy3qn928qLouDMTSxEXPo4CHWSyo1UkQ=; b=WXY9rnOQ4Pv2V4181/mUds/CPY3Y+o26CZIRKgZ91RS8AvfkyTEB7Bs4YdRa9J5dpF TZtcf4A3+tRIcvfVVRukiewDp829WMRv7LpPhYpMUCWEEmao6XSYc4OofkIHjTEs9GpQ 4jKz/xu/g2kQJlhq7+bnwCjCsnKnnN1uPJ6dUyKrY2/Hr90gK2XULYNUyTs/r2/R0cO9 8PXHZL6szZYNtoPVDZVXeNBxmKZ3CaEkWG3HjvCQg/E3NjvzJcMDOPPvVzeD28JzTf8w gsywV8p34y8FSJ1uf10cxADEsR1TzAWCEAvKkxabvIjcWvBAcnmU2J3mZF6vYd3ZUzUZ 5TsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729609402; x=1730214202; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=A2l2ym0fiScKy3qn928qLouDMTSxEXPo4CHWSyo1UkQ=; b=stU5Pff3L5P3kpvI7wGpIR8aXNEU9F/g2ZN0XQQkgvvQO/7hVZITTEgFXWQH4HkByN VcqrjdUgYTqp6AfhfPxv0u7hP8BUA8upD1goCDIsx9MJanrUfK/0OcG2xZyeRyCH/FsB V4TnaP8vBTs9GRl/jB6sGXz5yAECulvunAZ/3hRdxN5OjaQTHQnVZIy+DykGNEp2x9F4 Wah5IhxBlwLokDVV0oGTrXb50nU+hxJg/YSd7tKJKCrzqnyu4vhae4sj7yZIbxXf0GIn FPI+8zIXAx3fvlyIzHavmdjWYwAcZdVVkm/zIr2WHWtPbc68vlMoGGEo4P7ib6ochwWE lMKg== X-Forwarded-Encrypted: i=1; AJvYcCUYv7EW6EU3JHdw90RsuR0YxA7JkSz+eXdKudoFzt/nrCR9NSmosLoticmY8dxFVogts9N/il0bVQ==@kvack.org X-Gm-Message-State: AOJu0YwF0FYdIuukHLgXvQ5zYxHS6rMXtm9A+ziqOepI3CGFuePsr40H ZiFmbezOU1MqDZ0zeMSaZuxUd6m5ghNhsk0i3KbhE4Jwn9fwWkc9 X-Google-Smtp-Source: AGHT+IHsFZUxuxHsS3Clvq9mUIdpcqW/gx8o3wqNm0K3RaPDaQO56m6TBxq9eZVVpdBlCDPPFInOiw== X-Received: by 2002:a05:600c:314f:b0:431:4e25:fe42 with SMTP id 5b1f17b1804b1-43161693928mr126873115e9.32.1729609401365; Tue, 22 Oct 2024 08:03:21 -0700 (PDT) Received: from [10.8.0.9] ([173.212.245.10]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-431805a8583sm13257635e9.48.2024.10.22.08.03.14 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 22 Oct 2024 08:03:20 -0700 (PDT) Message-ID: <2174ff43-3ab6-409b-a8a8-bd319a134d86@gmail.com> Date: Tue, 22 Oct 2024 23:03:11 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64 To: Neal Gompa , Ryan Roberts Cc: Eric Curtin , Andrew Morton , Anshuman Khandual , Ard Biesheuvel , Catalin Marinas , David Hildenbrand , Greg Marsden , Ivan Ivanov , Kalesh Singh , Marc Zyngier , Mark Rutland , Matthias Brugger , Miroslav Benes , Will Deacon , Hector Martin , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, asahi@lists.linux.dev References: <20241014105514.3206191-1-ryan.roberts@arm.com> <4623805.lGaqSPkdTl@skuld-framework> <09e480d7-3ef6-4352-a484-91733ad7d231@arm.com> <649d7aa6-4163-4969-ba14-777f0e9cddb1@arm.com> <872f1c9c-9fb2-4372-810d-abe5419c4bd8@arm.com> Content-Language: en-US From: Nick Chan In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 987471C0019 X-Stat-Signature: 7d3gmech41z6o3f97pki1pex74wed415 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1729609372-313506 X-HE-Meta: U2FsdGVkX1/qr2CqaCkZs1IXhu7mB73x/RfSzDfICq9bqyJRk7c2luRzi7xWl8dqGWIL4x2gz7E8LOmFTxFrmJGP8PUCR7lAgGbUyriCv7MpmK93jbfT6b9XywPd2Ke292h3WY3+qSW9Pu376cww3dGI88LVaSqfhTgJPFHUpPehheIDrttko8bF97X8mQQL3fT4+6Wq3LWfPNnoPJpe+I0/bJ4hRyR2Lt/b730qEb7KUhpCW0ZPoFAG99KAVRHEXN/holSyRuMPQG3Y0uo3WUbXSIoLIb2F9psw2xv+CZhk1B0dtGzOvzlwsm5S4pbyUNjUmuqEy6L+Lsm13v9kSaHUj+aH763ImbTaupxEMXezB1tqAJrPw4FZKkgzHUTB/hWi6IA2fBPqlZMlvI8k32lduQDve7BbQHNpNlOS80GZ2FPsAaiHS+tPVg4aOQmDbZPOVRP64/rThLAnGhl/DBXdyR+VLdAhI8FaLaSF1Av8X5T8jMbO06VX2KarJEynap4g6erophdv1DlBp9CZWS9o/Ake3eEOx17vd9UO7skGuqrRUxUEsmxqlA4573jqt5vvIAzgeeWafGnI6ZrYigbiEBEwAcVTzG9/vCIPiVXvXO35sfpQcuT+uxcRIav9edvGTpisbmBzUgavPkEu4Jc6Rche/dCHd7Rx+02dpBCZ0Xk89BmPzR57edgu0DozM++/pQFQA7w8HkC5BvamZ8ObEVi5f0fnauCdHyBQLrrFBpNZV6WHhYjRVcbt9l/D8ulmUtHHIbbQRRa/LCdwc4LDV80+tg/dKf3Fj+etnlGe0tEURMmHHw6MIT+VNVeKXtdJ+iluR/37Mwen8GIWJ5OzJX5gyZZ44wcSt52O5vYgDQqK1sCPowOZ/ShGNnWqXeoXD2kcvLv6sP6IXB04Sr8g05LQhTEXcfV4il7uzp4GQ+owIH5ZuOUSuvCvFASWOss9xOJFOCLgjKMELf+ khFp1K7/ icZfhXE/SCDD7Y35Iftx0i18mnbG+OjfleWoKi3QpShkZsA7IvPc4hHMValLFE6E7+QJYoWSBnf4Jgg8t8ezLBTNvjzS9hwvAqcRjb8LDOWEIwguXA2Ak2OeBc7Z/Pq6ukSat71aclB/Ke11moaDctoXi4QNXjqqMUqJx6Dy/YQji51LLrShHFuk4JVjybJtqGz75YCvfYpKPARjMzxXo+GMj7Gl3VxAf65C/t92hbBu3Mh3z/N0Tl9S4jqe8RKchSxELQgx7Ul7dfCZB9YrseNzwy3lQGUyAXn2cvK+YfiGoUBagfIOfvtMeyBSVyoahkLo8+UB8Gun7sYwSlqT2KmHMcnZ7hpE7Ih+AobDr7n2UJ6V8q8VKyUSt/3H8DzjqA+EBLoFxHkwn86XYoMtR5fgTFx2sKza0TGfDccnJBcQDQ0Fy7FzVJkoG9SVETp0yBUFNXNEeXo5GWJRl65+nyulGQSzr0cIGZP1sHJEp1Rie4P7nZBlwEFH3SWzQvGZ4TtBRMNEY5uE+nH8Z/W6HYWtmBBC42GKhi2rouqy3V8SzaxX1qXlbgu8NW13CERPvCpAGUO8OyOybLOJQM832G5dkniKPPiXyXyPtcaiwiHOEix8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Neal Gompa 於 2024/10/22 下午5:33 寫道: > On Mon, Oct 21, 2024 at 11:02 AM Ryan Roberts wrote: >> >> On 21/10/2024 14:49, Neal Gompa wrote: >>> On Mon, Oct 21, 2024 at 7:51 AM Ryan Roberts wrote: >>>> >>>> On 21/10/2024 12:32, Eric Curtin wrote: >>>>> On Mon, 21 Oct 2024 at 12:09, Ryan Roberts wrote: >>>>>> >>>>>> On 19/10/2024 16:47, Neal Gompa wrote: >>>>>>> On Monday, October 14, 2024 6:55:11 AM EDT Ryan Roberts wrote: >>>>>>>> Hi All, >>>>>>>> >>>>>>>> Patch bomb incoming... This covers many subsystems, so I've included a core >>>>>>>> set of people on the full series and additionally included maintainers on >>>>>>>> relevant patches. I haven't included those maintainers on this cover letter >>>>>>>> since the numbers were far too big for it to work. But I've included a link >>>>>>>> to this cover letter on each patch, so they can hopefully find their way >>>>>>>> here. For follow up submissions I'll break it up by subsystem, but for now >>>>>>>> thought it was important to show the full picture. >>>>>>>> >>>>>>>> This RFC series implements support for boot-time page size selection within >>>>>>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64K), but to >>>>>>>> date, page size has been selected at compile-time, meaning the size is >>>>>>>> baked into a given kernel image. As use of larger-than-4K page sizes become >>>>>>>> more prevalent this starts to present a problem for distributions. >>>>>>>> Boot-time page size selection enables the creation of a single kernel >>>>>>>> image, which can be told which page size to use on the kernel command line. >>>>>>>> >>>>>>>> Why is having an image-per-page size problematic? >>>>>>>> ================================================= >>>>>>>> >>>>>>>> Many traditional distros are now supporting both 4K and 64K. And this means >>>>>>>> managing 2 kernel packages, along with drivers for each. For some, it means >>>>>>>> multiple installer flavours and multiple ISOs. All of this adds up to a >>>>>>>> less-than-ideal level of complexity. Additionally, Android now supports 4K >>>>>>>> and 16K kernels. I'm told having to explicitly manage their KABI for each >>>>>>>> kernel is painful, and the extra flash space required for both kernel >>>>>>>> images and the duplicated modules has been problematic. Boot-time page size >>>>>>>> selection solves all of this. >>>>>>>> >>>>>>>> Additionally, in starting to think about the longer term deployment story >>>>>>>> for D128 page tables, which Arm architecture now supports, a lot of the >>>>>>>> same problems need to be solved, so this work sets us up nicely for that. >>>>>>>> >>>>>>>> So what's the down side? >>>>>>>> ======================== >>>>>>>> >>>>>>>> Well nothing's free; Various static allocations in the kernel image must be >>>>>>>> sized for the worst case (largest supported page size), so image size is in >>>>>>>> line with size of 64K compile-time image. So if you're interested in 4K or >>>>>>>> 16K, there is a slight increase to the image size. But I expect that >>>>>>>> problem goes away if you're compressing the image - its just some extra >>>>>>>> zeros. At boot-time, I expect we could free the unused static storage once >>>>>>>> we know the page size - although that would be a follow up enhancement. >>>>>>>> >>>>>>>> And then there is performance. Since PAGE_SIZE and friends are no longer >>>>>>>> compile-time constants, we must look up their values and do arithmetic at >>>>>>>> runtime instead of compile-time. My early perf testing suggests this is >>>>>>>> inperceptible for real-world workloads, and only has small impact on >>>>>>>> microbenchmarks - more on this below. >>>>>>>> >>>>>>>> Approach >>>>>>>> ======== >>>>>>>> >>>>>>>> The basic idea is to rid the source of any assumptions that PAGE_SIZE and >>>>>>>> friends are compile-time constant, but in a way that allows the compiler to >>>>>>>> perform the same optimizations as was previously being done if they do turn >>>>>>>> out to be compile-time constant. Where constants are required, we use >>>>>>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patch 1 for full >>>>>>>> description of all the classes of problems to solve. >>>>>>>> >>>>>>>> By default PAGE_SIZE_MIN=PAGE_SIZE_MAX=PAGE_SIZE. But an arch may opt-in to >>>>>>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_SIZE_MAX. >>>>>>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_PAGE_SIZE >>>>>>>> Kconfig, which is an alternative to selecting a compile-time page size. >>>>>>>> >>>>>>>> When boot-time page size is active, the arch pgtable geometry macro >>>>>>>> definitions resolve to something that can be configured at boot. The arm64 >>>>>>>> implementation in this series mainly uses global, __ro_after_init >>>>>>>> variables. I've tried using alternatives patching, but that performs worse >>>>>>>> than loading from memory; I think due to code size bloat. >>>>>>>> >>>>>>>> Status >>>>>>>> ====== >>>>>>>> >>>>>>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only implemented >>>>>>>> enough to compile the kernel image itself with defconfig (and a few other >>>>>>>> bits and pieces). This is enough to build a kernel that can boot under QEMU >>>>>>>> or FVP. I'll happily do the rest of the work to enable all the extra >>>>>>>> drivers, but wanted to get feedback on the shape of this effort first. If >>>>>>>> anyone wants to do any testing, and has a must-have config, let me know and >>>>>>>> I'll prioritize enabling it first. >>>>>>>> >>>>>>>> The series is arranged as follows: >>>>>>>> >>>>>>>> - patch 1: Add macros required for converting non-arch code to support >>>>>>>> boot-time page size selection >>>>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assumption from >>>>>>>> all non-arch code >>>>>>>> - patches 37-38: Some arm64 tidy ups >>>>>>>> - patch 39: Add macros required for converting arm64 code to >>>>>>> support >>>>>>>> boot-time page size selection >>>>>>>> - patches 40-56: arm64 changes to support boot-time page size selection >>>>>>>> - patch 57: Add arm64 Kconfig option to enable boot-time page >>>>>>> size >>>>>>>> selection >>>>>>>> >>>>>>>> Ideally, I'd like to get the basics merged (something like this series), >>>>>>>> then incrementally improve it over a handful of kernel releases until we >>>>>>>> can demonstrate that we have feature parity with the compile-time build and >>>>>>>> no performance blockers. Once at that point, ideally the compile-time build >>>>>>>> options would be removed and the code could be cleaned up further. >>>>>>>> >>>>>>>> One of the bigger peices that I'd propose to add as a follow up, is to make >>>>>>>> va-size boot-time selectable too. That will greatly simplify LPA2 fallback >>>>>>>> handling. >>>>>>>> >>>>>>>> Assuming people are ammenable to the rough shape, how would I go about >>>>>>>> getting the non-arch changes merged? Since they cover many subsystems, will >>>>>>>> each piece need to go independently to each relevant maintainer or could it >>>>>>>> all be merged together through the arm64 tree? >>>>>>>> >>>>>>>> Image Size >>>>>>>> ========== >>>>>>>> >>>>>>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrace, kprobes) >>>>>>>> kernel image on disk for base (before any changes applied), compile (with >>>>>>>> changes, configured for compile-time page size) and boot (with changes, >>>>>>>> configured for boot-time page size). >>>>>>>> >>>>>>>> You can see the that compile-16k and 64k configs are actually slightly >>>>>>>> smaller than the baselines; that's due to optimizing some buffer sizes >>>>>>>> which didn't need to depend on page size during the series. The boot-time >>>>>>>> image is ~1% bigger than the 64k compile-time image. I believe there is >>>>>>>> scope to improve this to make it >>>>>>>> equal to compile-64k if required: >>>>>>>> | config | size/KB | diff/KB | diff/% | >>>>>>>> | >>>>>>>> |-------------|---------|---------|---------| >>>>>>>> | >>>>>>>> | base-4k | 54895 | 0 | 0.0% | >>>>>>>> | base-16k | 55161 | 266 | 0.5% | >>>>>>>> | base-64k | 56775 | 1880 | 3.4% | >>>>>>>> | compile-4k | 54895 | 0 | 0.0% | >>>>>>>> | compile-16k | 55097 | 202 | 0.4% | >>>>>>>> | compile-64k | 56391 | 1496 | 2.7% | >>>>>>>> | boot-4K | 57045 | 2150 | 3.9% | >>>>>>>> >>>>>>>> And below shows the size of the image in memory at run-time, separated for >>>>>>>> text and data costs. The boot image has ~1% text cost; most likely due to >>>>>>>> the fact that PAGE_SIZE and friends are not compile-time constants so need >>>>>>>> instructions to load the values and do arithmetic. I believe we could >>>>>>>> eventually get the data cost to match the cost for the compile image for >>>>>>>> the chosen page size by freeing >>>>>>>> the ends of the static buffers not needed for the selected page size: >>>>>>>> | | text | text | text | data | data | data | >>>>>>>> | >>>>>>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB | diff/% | >>>>>>>> | >>>>>>>> |-------------|---------|---------|---------|---------|---------|---------| >>>>>>>> | >>>>>>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0 | 0.0% | >>>>>>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311 | 2.2% | >>>>>>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359 | 9.5% | >>>>>>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1 | 0.0% | >>>>>>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204 | 1.4% | >>>>>>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820 | 5.7% | >>>>>>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973 | 6.8% | >>>>>>>> >>>>>>>> Functional Testing >>>>>>>> ================== >>>>>>>> >>>>>>>> I've build-tested defconfig for all arches supported by tuxmake (which is >>>>>>>> most) without issue. >>>>>>>> >>>>>>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE for all page >>>>>>>> sizes and a few va-sizes, and additionally have run all the mm-selftests, >>>>>>>> with no regressions observed vs the equivalent compile-time page size build >>>>>>>> (although the mm-selftests have a few existing failures when run against >>>>>>>> 16K and 64K kernels - those should really be investigated and fixed >>>>>>>> independently). >>>>>>>> >>>>>>>> Test coverage is lacking for many of the drivers that I've touched, but in >>>>>>>> many cases, I'm hoping the changes are simple enough that review might >>>>>>>> suffice? >>>>>>>> >>>>>>>> Performance Testing >>>>>>>> =================== >>>>>>>> >>>>>>>> I've run some limited performance benchmarks: >>>>>>>> >>>>>>>> First, a real-world benchmark that causes a lot of page table manipulation >>>>>>>> (and therefore we would expect to see regression here if we are going to >>>>>>>> see it anywhere); kernel compilation. It barely registers a change. Values >>>>>>>> are times, >>>>>>>> so smaller is better. All relative to base-4k: >>>>>>>> | | kern | kern | user | user | real | real | >>>>>>>> | >>>>>>>> | config | mean | stdev | mean | stdev | mean | stdev | >>>>>>>> | >>>>>>>> |-------------|---------|---------|---------|---------|---------|---------| >>>>>>>> | >>>>>>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0% | 0.3% | >>>>>>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1% | 0.3% | >>>>>>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2% | 0.2% | >>>>>>>> >>>>>>>> The Speedometer JavaScript benchmark also shows no change. Values are runs >>>>>>>> per >>>>>>>> min, so bigger is better. All relative to base-4k: >>>>>>>> | config | mean | stdev | >>>>>>>> | >>>>>>>> |-------------|---------|---------| >>>>>>>> | >>>>>>>> | base-4k | 0.0% | 0.8% | >>>>>>>> | compile-4k | 0.4% | 0.8% | >>>>>>>> | boot-4k | 0.0% | 0.9% | >>>>>>>> >>>>>>>> Finally, I've run some microbenchmarks known to stress page table >>>>>>>> manipulations (originally from David Hildenbrand). The fork test >>>>>>>> maps/allocs 1G of anon memory, then measures the cost of fork(). The munmap >>>>>>>> test maps/allocs 1G of anon memory then measures the cost of munmap()ing >>>>>>>> it. The fork test is known to be extremely sensitive to any changes that >>>>>>>> cause instructions to be aligned differently in cachelines. When using this >>>>>>>> test for other changes, I've seen double digit regressions for the >>>>>>>> slightest thing, so 12% regression on this test is actually fairly good. >>>>>>>> This likely represents the extreme worst case for regressions that will be >>>>>>>> observed across other microbenchmarks (famous last >>>>>>>> words). Values are times, so smaller is better. All relative to base-4k: >>>>>>>> | | fork | fork | munmap | munmap | >>>>>>>> | >>>>>>>> | config | mean | stdev | stdev | stdev | >>>>>>>> | >>>>>>>> |-------------|---------|---------|---------|---------| >>>>>>>> | >>>>>>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% | >>>>>>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% | >>>>>>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% | >>>>>>>> >>>>>>>> NOTE: The series applies on top of v6.11. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Ryan >>>>>>>> >>>>>>>> >>>>>>>> Ryan Roberts (57): >>>>>>>> mm: Add macros ahead of supporting boot-time page size selection >>>>>>>> vmlinux: Align to PAGE_SIZE_MAX >>>>>>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE is large >>>>>>>> mm/page_alloc: Make page_frag_cache boot-time page size compatible >>>>>>>> mm: Avoid split pmd ptl if pmd level is run-time folded >>>>>>>> mm: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing >>>>>>>> fs: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> fork: Permit boot-time THREAD_SIZE determination >>>>>>>> cgroup: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> bpf: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> perf: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> kvm: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> trace: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> crash: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> crypto: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> sound: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> net: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> net: fec: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> net: marvell: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> net: igb: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> drivers/base: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> edac: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> optee: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> random: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> virtio: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> xen: Remove PAGE_SIZE compile-time constant assumption >>>>>>>> arm64: Fix macros to work in C code in addition to the linker script >>>>>>>> arm64: Track early pgtable allocation limit >>>>>>>> arm64: Introduce macros required for boot-time page selection >>>>>>>> arm64: Refactor early pgtable size calculation macros >>>>>>>> arm64: Pass desired page size on command line >>>>>>>> arm64: Divorce early init from PAGE_SIZE >>>>>>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES >>>>>>>> arm64: Align sections to PAGE_SIZE_MAX >>>>>>>> arm64: Rework trampoline rodata mapping >>>>>>>> arm64: Generalize fixmap for boot-time page size >>>>>>>> arm64: Statically allocate and align for worst-case page size >>>>>>>> arm64: Convert switch to if for non-const comparison values >>>>>>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON >>>>>>>> arm64: Remove PAGE_SZ asm-offset >>>>>>>> arm64: Introduce cpu features for page sizes >>>>>>>> arm64: Remove PAGE_SIZE from assembly code >>>>>>>> arm64: Runtime-fold pmd level >>>>>>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappings >>>>>>>> arm64: TRAMP_VALIAS is no longer compile-time constant >>>>>>>> arm64: Determine THREAD_SIZE at boot-time >>>>>>>> arm64: Enable boot-time page size selection >>>>>>>> >>>>>>>> arch/alpha/include/asm/page.h | 1 + >>>>>>>> arch/arc/include/asm/page.h | 1 + >>>>>>>> arch/arm/include/asm/page.h | 1 + >>>>>>>> arch/arm64/Kconfig | 26 ++- >>>>>>>> arch/arm64/include/asm/assembler.h | 78 ++++++- >>>>>>>> arch/arm64/include/asm/cpufeature.h | 44 +++- >>>>>>>> arch/arm64/include/asm/efi.h | 2 +- >>>>>>>> arch/arm64/include/asm/fixmap.h | 28 ++- >>>>>>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++---- >>>>>>>> arch/arm64/include/asm/kvm_arm.h | 21 +- >>>>>>>> arch/arm64/include/asm/kvm_hyp.h | 11 + >>>>>>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +- >>>>>>>> arch/arm64/include/asm/memory.h | 62 ++++-- >>>>>>>> arch/arm64/include/asm/page-def.h | 3 +- >>>>>>>> arch/arm64/include/asm/pgalloc.h | 16 +- >>>>>>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++ >>>>>>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++- >>>>>>>> arch/arm64/include/asm/pgtable-prot.h | 2 +- >>>>>>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++--- >>>>>>>> arch/arm64/include/asm/processor.h | 10 +- >>>>>>>> arch/arm64/include/asm/sections.h | 1 + >>>>>>>> arch/arm64/include/asm/smp.h | 1 + >>>>>>>> arch/arm64/include/asm/sparsemem.h | 15 +- >>>>>>>> arch/arm64/include/asm/sysreg.h | 54 +++-- >>>>>>>> arch/arm64/include/asm/tlb.h | 3 + >>>>>>>> arch/arm64/kernel/asm-offsets.c | 4 +- >>>>>>>> arch/arm64/kernel/cpufeature.c | 93 ++++++-- >>>>>>>> arch/arm64/kernel/efi.c | 2 +- >>>>>>>> arch/arm64/kernel/entry.S | 60 +++++- >>>>>>>> arch/arm64/kernel/head.S | 46 +++- >>>>>>>> arch/arm64/kernel/hibernate-asm.S | 6 +- >>>>>>>> arch/arm64/kernel/image-vars.h | 14 ++ >>>>>>>> arch/arm64/kernel/image.h | 4 + >>>>>>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++- >>>>>>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++---- >>>>>>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++++++++-- >>>>>>>> arch/arm64/kernel/pi/pi.h | 63 +++++- >>>>>>>> arch/arm64/kernel/relocate_kernel.S | 10 +- >>>>>>>> arch/arm64/kernel/vdso-wrap.S | 4 +- >>>>>>>> arch/arm64/kernel/vdso.c | 7 +- >>>>>>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +- >>>>>>>> arch/arm64/kernel/vdso32-wrap.S | 4 +- >>>>>>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +- >>>>>>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++-- >>>>>>>> arch/arm64/kvm/arm.c | 10 + >>>>>>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 + >>>>>>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +- >>>>>>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +- >>>>>>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++ >>>>>>>> arch/arm64/kvm/mmu.c | 39 ++-- >>>>>>>> arch/arm64/lib/clear_page.S | 7 +- >>>>>>>> arch/arm64/lib/copy_page.S | 33 ++- >>>>>>>> arch/arm64/lib/mte.S | 27 ++- >>>>>>>> arch/arm64/mm/Makefile | 1 + >>>>>>>> arch/arm64/mm/fixmap.c | 38 ++-- >>>>>>>> arch/arm64/mm/hugetlbpage.c | 40 +--- >>>>>>>> arch/arm64/mm/init.c | 26 +-- >>>>>>>> arch/arm64/mm/kasan_init.c | 8 +- >>>>>>>> arch/arm64/mm/mmu.c | 53 +++-- >>>>>>>> arch/arm64/mm/pgd.c | 12 +- >>>>>>>> arch/arm64/mm/pgtable-geometry.c | 24 +++ >>>>>>>> arch/arm64/mm/proc.S | 128 ++++++++--- >>>>>>>> arch/arm64/mm/ptdump.c | 3 +- >>>>>>>> arch/arm64/tools/cpucaps | 3 + >>>>>>>> arch/csky/include/asm/page.h | 3 + >>>>>>>> arch/hexagon/include/asm/page.h | 2 + >>>>>>>> arch/loongarch/include/asm/page.h | 2 + >>>>>>>> arch/m68k/include/asm/page.h | 1 + >>>>>>>> arch/microblaze/include/asm/page.h | 1 + >>>>>>>> arch/mips/include/asm/page.h | 1 + >>>>>>>> arch/nios2/include/asm/page.h | 2 + >>>>>>>> arch/openrisc/include/asm/page.h | 1 + >>>>>>>> arch/parisc/include/asm/page.h | 1 + >>>>>>>> arch/powerpc/include/asm/page.h | 2 + >>>>>>>> arch/riscv/include/asm/page.h | 1 + >>>>>>>> arch/s390/include/asm/page.h | 1 + >>>>>>>> arch/sh/include/asm/page.h | 1 + >>>>>>>> arch/sparc/include/asm/page.h | 3 + >>>>>>>> arch/um/include/asm/page.h | 2 + >>>>>>>> arch/x86/include/asm/page_types.h | 2 + >>>>>>>> arch/xtensa/include/asm/page.h | 1 + >>>>>>>> crypto/lskcipher.c | 4 +- >>>>>>>> drivers/ata/sata_sil24.c | 46 ++-- >>>>>>>> drivers/base/node.c | 6 +- >>>>>>>> drivers/base/topology.c | 32 +-- >>>>>>>> drivers/block/virtio_blk.c | 2 +- >>>>>>>> drivers/char/random.c | 4 +- >>>>>>>> drivers/edac/edac_mc.h | 13 +- >>>>>>>> drivers/firmware/efi/libstub/arm64.c | 3 +- >>>>>>>> drivers/irqchip/irq-gic-v3-its.c | 2 +- >>>>>>>> drivers/mtd/mtdswap.c | 4 +- >>>>>>>> drivers/net/ethernet/freescale/fec.h | 3 +- >>>>>>>> drivers/net/ethernet/freescale/fec_main.c | 5 +- >>>>>>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +- >>>>>>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +- >>>>>>>> drivers/net/ethernet/intel/igb/igb.h | 25 +-- >>>>>>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++------ >>>>>>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +- >>>>>>>> drivers/net/ethernet/marvell/mvneta.c | 9 +- >>>>>>>> drivers/net/ethernet/marvell/sky2.h | 2 +- >>>>>>>> drivers/tee/optee/call.c | 7 +- >>>>>>>> drivers/tee/optee/smc_abi.c | 2 +- >>>>>>>> drivers/virtio/virtio_balloon.c | 10 +- >>>>>>>> drivers/xen/balloon.c | 11 +- >>>>>>>> drivers/xen/biomerge.c | 12 +- >>>>>>>> drivers/xen/privcmd.c | 2 +- >>>>>>>> drivers/xen/xenbus/xenbus_client.c | 5 +- >>>>>>>> drivers/xen/xlate_mmu.c | 6 +- >>>>>>>> fs/binfmt_elf.c | 11 +- >>>>>>>> fs/buffer.c | 2 +- >>>>>>>> fs/coredump.c | 8 +- >>>>>>>> fs/ext4/ext4.h | 36 ++-- >>>>>>>> fs/ext4/move_extent.c | 2 +- >>>>>>>> fs/ext4/readpage.c | 2 +- >>>>>>>> fs/fat/dir.c | 4 +- >>>>>>>> fs/fat/fatent.c | 4 +- >>>>>>>> fs/nfs/nfs42proc.c | 2 +- >>>>>>>> fs/nfs/nfs42xattr.c | 2 +- >>>>>>>> fs/nfs/nfs4proc.c | 2 +- >>>>>>>> include/asm-generic/pgtable-geometry.h | 71 +++++++ >>>>>>>> include/asm-generic/vmlinux.lds.h | 38 ++-- >>>>>>>> include/linux/buffer_head.h | 1 + >>>>>>>> include/linux/cpumask.h | 5 + >>>>>>>> include/linux/linkage.h | 4 +- >>>>>>>> include/linux/mm.h | 17 +- >>>>>>>> include/linux/mm_types.h | 15 +- >>>>>>>> include/linux/mm_types_task.h | 2 +- >>>>>>>> include/linux/mmzone.h | 3 +- >>>>>>>> include/linux/netlink.h | 6 +- >>>>>>>> include/linux/percpu-defs.h | 4 +- >>>>>>>> include/linux/perf_event.h | 2 +- >>>>>>>> include/linux/sched.h | 4 +- >>>>>>>> include/linux/slab.h | 7 +- >>>>>>>> include/linux/stackdepot.h | 6 +- >>>>>>>> include/linux/sunrpc/svc.h | 8 +- >>>>>>>> include/linux/sunrpc/svc_rdma.h | 4 +- >>>>>>>> include/linux/sunrpc/svcsock.h | 2 +- >>>>>>>> include/linux/swap.h | 17 +- >>>>>>>> include/linux/swapops.h | 6 +- >>>>>>>> include/linux/thread_info.h | 10 +- >>>>>>>> include/xen/page.h | 2 + >>>>>>>> init/main.c | 7 +- >>>>>>>> kernel/bpf/core.c | 9 +- >>>>>>>> kernel/bpf/ringbuf.c | 54 ++--- >>>>>>>> kernel/cgroup/cgroup.c | 8 +- >>>>>>>> kernel/crash_core.c | 2 +- >>>>>>>> kernel/events/core.c | 2 +- >>>>>>>> kernel/fork.c | 71 +++---- >>>>>>>> kernel/power/power.h | 2 +- >>>>>>>> kernel/power/snapshot.c | 2 +- >>>>>>>> kernel/power/swap.c | 129 +++++++++-- >>>>>>>> kernel/trace/fgraph.c | 2 +- >>>>>>>> kernel/trace/trace.c | 2 +- >>>>>>>> lib/stackdepot.c | 6 +- >>>>>>>> mm/kasan/report.c | 3 +- >>>>>>>> mm/memcontrol.c | 11 +- >>>>>>>> mm/memory.c | 4 +- >>>>>>>> mm/mmap.c | 2 +- >>>>>>>> mm/page-writeback.c | 2 +- >>>>>>>> mm/page_alloc.c | 31 +-- >>>>>>>> mm/slub.c | 2 +- >>>>>>>> mm/sparse.c | 2 +- >>>>>>>> mm/swapfile.c | 2 +- >>>>>>>> mm/vmalloc.c | 7 +- >>>>>>>> net/9p/trans_virtio.c | 4 +- >>>>>>>> net/core/hotdata.c | 4 +- >>>>>>>> net/core/skbuff.c | 4 +- >>>>>>>> net/core/sysctl_net_core.c | 2 +- >>>>>>>> net/sunrpc/cache.c | 3 +- >>>>>>>> net/unix/af_unix.c | 2 +- >>>>>>>> sound/soc/soc-utils.c | 4 +- >>>>>>>> virt/kvm/kvm_main.c | 2 +- >>>>>>>> 172 files changed, 2185 insertions(+), 951 deletions(-) >>>>>>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h >>>>>>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c >>>>>>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c >>>>>>>> create mode 100644 include/asm-generic/pgtable-geometry.h >>>>>>>> >>>>>>>> -- >>>>>>>> 2.43.0 >>>>>>> >>>>>>> This is a generally very exciting patch set! I'm looking forward to seeing it >>>>>>> land so I can take advantage of it for Fedora ARM and Fedora Asahi Remix. >>>>>>> >>>>>>> That said, I have a couple of questions: >>>>>>> >>>>>>> * Going forward, how would we handle drivers/modules that require a particular >>>>>>> page size? For example, the Apple Silicon IOMMU driver code requires the >>>>>>> kernel to operate in 16k page size mode, and it would need to be disabled in >>>>>>> other page sizes. >>>>>> >>>>>> I think these drivers would want to check PAGE_SIZE at probe time and fail if an >>>>>> unsupported page size is in use. Do you see any issue with that? >>>>>> >>>>>>> >>>>>>> * How would we handle an invalid selection at boot? >>>>>> >>>>>> What do you mean by invalid here? The current policy validates that the >>>>>> requested page size is supported by the HW by checking mmfr0. If no page size is >>>>>> passed on the command line, or the passed value is not supported by the HW, then >>>>>> the we default to the largest page size supported by the HW (so for Apple >>>>>> Silicon that would be 16k since the HW doesn't support 64k). Although I think it >>>>>> may be better to change that policy to use the smallest page size in this case; >>>>>> 4k is the safer bet for compat and will waste much less memory than 64k. >>>>>> >>>>>>> Can we program in a >>>>>>> fallback when the "wrong" mode is selected for a chip or something similar? >>>>>> >>>>>> Do you mean effectively add a machanism to force 16k if the detected HW is Apple >>>>>> Silicon? The trouble is that we need to select the page size, very early in >>>>>> boot, before start_kernel() is called, so we really only have generic arch code >>>>>> and the command line with which to make the decision. >>>>> >>>>> Yes... I think a build-time CONFIG for default page size, which can be >>>>> overridden by a karg makes sense... Even on platforms like Apple >>>>> Silicon you may want to test very specific things in 4k by overriding >>>>> with a karg. >>>> >>>> Ahh, yes, that would certainly work. I'll work it into the next version. >>>> >>> >>> Could we maybe extend to have some kind of way to include a table of >>> SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon) >> >> 64k is already disabled on Apple Silicon because mmfr0 reports that 64k is not >> supported. >> >>> and preferred modes when no arg is set (16k for Apple Silicon)? That >> >> And it's not obvious that we should hard-code a page size preference to a SoC >> ID. If the CPU can support multiple page sizes, it should be up to the SW stack >> to decide, not the SoC. >> >> I'm guessing your desire is to have a single kernel build that will boot 16k by >> default on Apple Silicon and 4k by default on other systems, all without needing >> to modify the command line? Personally I think it's cleaner to just require >> setting the page size on the command line in these cases. >> >>> way it'd work something like this: >>> >>> 1. Table identification of 4/16/64 depending on identified SoC >> So I'd prefer not to have this >> >>> 2. Unidentified ones follow build-time default >>> 3. karg forces a mode regardless >> But keep these 2. >> > Since we are talking about Apple Silicon and page size, I would like to add that on the Apple Silicon SoCs I am working on, the situation is like this: Apple A7 (s5l8960x), A8 (T7000), A8X (T7001): CPU MMU support 4K and 64K page sizes. Apple A9 (s8000/s8003), A9X (s8001), A10 (t8010), A10X (t8011), A11 (t8015): CPU MMU Support 16K and 64K page sizes. However, all of them have 4K page DART IOMMUs. > I think it makes sense to have it, because it's not just Apple Silicon > where such a preference/requirement may be necessary. Apple Silicon > technically works at 4k, but is completely broken at 4k because Linux > cannot do 16k IOMMU with 4k everything else, so being able to at least > prefer 16k out of the box is important. And SoCs like the NVIDIA Grace > Hopper platform prefer 64k over other options (though I am unaware of > a gross incompatibility that effectively requires it like Apple > Silicon has). > > When we're trying to get to "single generic image that works > everywhere", stuff like this matters and I would really like you to > consider it from the lens of "we want things to work as automagic as > they do on x86". For me, in order to get to this level of automagic, there do need to be a table of which SoC should use which page size table. > > Nick Chan