From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77627D2CE0D for ; Tue, 22 Oct 2024 17:31:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F1556B0085; Tue, 22 Oct 2024 13:31:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C8DB6B0088; Tue, 22 Oct 2024 13:31:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 669CF6B0089; Tue, 22 Oct 2024 13:31:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 41D1A6B0085 for ; Tue, 22 Oct 2024 13:31:16 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 11C4AA9CB1 for ; Tue, 22 Oct 2024 17:30:43 +0000 (UTC) X-FDA: 82701928848.22.D50B3F4 Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by imf07.hostedemail.com (Postfix) with ESMTP id E752640020 for ; Tue, 22 Oct 2024 17:30:50 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf07.hostedemail.com: domain of ngompa13@gmail.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=ngompa13@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729618107; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GMJENab7JBN6G90abtBIhvJzChfSiZdBWD5eWZNL4SU=; b=giBVS//3d4pGG0vu3o/M/+XdRoumTA6wGFSkd2laYaIiip8p/hzZkLuRhwFceZC2jXqWDj 9B5oZJokFPZkzShdIcNhuyU63loM1Y1hq7PUFjbJRnxuaLZfFVmC/TaT1MObhkYeYMVN5V KQBhJpU2mJCtwWyfP5PJF2M7K/Q89Xc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729618107; a=rsa-sha256; cv=none; b=VqisPqNBXU9fUL9LEAa9CjiNe/nUwJtCGZl/2UkFoThDyorp9bLGL1fpVW1J5e79YSRnal n9Nyw3/Q5GiIjKgDC9HNeaEs4LJ1LmziNeQn2UA9K4NpPYnw4ppEHmHzLYjwzcyEUSXPZC 2kmzgLD7TS/lIfklYTPNw1j4bNFVGww= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf07.hostedemail.com: domain of ngompa13@gmail.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=ngompa13@gmail.com Received: by mail-lf1-f49.google.com with SMTP id 2adb3069b0e04-539f72c8fc1so6748650e87.1 for ; Tue, 22 Oct 2024 10:31:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729618271; x=1730223071; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GMJENab7JBN6G90abtBIhvJzChfSiZdBWD5eWZNL4SU=; b=fKyVNHTTE5/V968urTzYyCwFZ5w5mSA9qOcFhporAUrKn+7wUHOg+5C8ISFcb2kTPB NfuYOwFJm6ixPWEW6fsnMPqgRdAdw/gRuBnAjIE/S5ubPnW0khHJ572MqxJ/LnO+oqQg iVNVUbZiyCutIdNf1LrRrj+NOgIq//xMGrxdWo8JZZzLBK3d1Tyf/my1N8wg8WKAgzXV bSpoKHv6mjb802iR/fL49nbfc4mr1PpEBpxVQFZsfUa9MeRy2VBSm/1RJLDQ4M3Seqik 6ksB8Xoh3NKBUy8t0p5valucKaKrsjd6UvYg1h3qW/VEZ78S8qxblpDZnw9AtIcZNrht 60Ig== X-Forwarded-Encrypted: i=1; AJvYcCVxq5Cwlv0HfedPx0zDtWfOlIltjyqz5D0SwnrzcvusKdbDRML3L6XRFQXYn8hQPgHlkH9Wjs1b9w==@kvack.org X-Gm-Message-State: AOJu0YyhLIGRt2NqlRp2ZAprylVI7CEhmPScNh7yOV5mOAfhXzhZuoNn DXGRgdnXq6wtTNIXnX6xrfEXmrX9vY2q5RLPMmK9zUtXK1DozA1/GR++onyd06NA7w== X-Google-Smtp-Source: AGHT+IGdLcWATadw6MtDAKDTfAt7fiU10D6ExwuiI4eTCXIjSeFVAec3ZZqKRKnTPMPRaBYCEY26KA== X-Received: by 2002:a05:6512:12c8:b0:53a:1a81:f006 with SMTP id 2adb3069b0e04-53b12c03caemr2501204e87.31.1729618270217; Tue, 22 Oct 2024 10:31:10 -0700 (PDT) Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com. [209.85.167.48]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-53a2243ecd0sm821753e87.249.2024.10.22.10.31.09 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 22 Oct 2024 10:31:09 -0700 (PDT) Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-539f72c8fc1so6748581e87.1 for ; Tue, 22 Oct 2024 10:31:09 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCUvSg76IqFrVu4EvYeV/dB7UMbhShPETugJBCa3nv6vJE7Gw4LWUshEVjjkRV+iVRgdM/e6V5zJ3g==@kvack.org X-Received: by 2002:a05:6512:a8d:b0:539:8c02:64d5 with SMTP id 2adb3069b0e04-53b12c03d7bmr2446554e87.27.1729618269102; Tue, 22 Oct 2024 10:31:09 -0700 (PDT) MIME-Version: 1.0 References: <20241014105514.3206191-1-ryan.roberts@arm.com> <4623805.lGaqSPkdTl@skuld-framework> <09e480d7-3ef6-4352-a484-91733ad7d231@arm.com> <649d7aa6-4163-4969-ba14-777f0e9cddb1@arm.com> <872f1c9c-9fb2-4372-810d-abe5419c4bd8@arm.com> <2174ff43-3ab6-409b-a8a8-bd319a134d86@gmail.com> <997f1826-ec45-4d47-ad94-33c0d194b5a4@arm.com> In-Reply-To: <997f1826-ec45-4d47-ad94-33c0d194b5a4@arm.com> From: Neal Gompa Date: Tue, 22 Oct 2024 13:30:32 -0400 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC PATCH v1 00/57] Boot-time page size selection for arm64 To: Ryan Roberts Cc: Nick Chan , Eric Curtin , Andrew Morton , Anshuman Khandual , Ard Biesheuvel , Catalin Marinas , David Hildenbrand , Greg Marsden , Ivan Ivanov , Kalesh Singh , Marc Zyngier , Mark Rutland , Matthias Brugger , Miroslav Benes , Will Deacon , Hector Martin , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, asahi@lists.linux.dev Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: E752640020 X-Stat-Signature: 3qiaytybk8u1qq864m5uazmxax8cscti X-Rspam-User: X-HE-Tag: 1729618250-972432 X-HE-Meta: U2FsdGVkX187IIkbZlwBdaCNJSAgv3SlmDrZRdEsdB3dFcNA2xG+REm5LJf94z3idgZezpMUkvT7ufoQfak3f5aO+7mxwvg8p1E4y3xYF/YAXJvhGgZ3ecovV3zT1jhTVZzG9nLg4GXAd14d/ojCArT3IopMiy7oADepJJj1hlytcw60U2HbRArsetiojSdJSvljImvWw1GSnniA7N9Aa8BTrkImw0GHovnW+EJ0uee3WzXog6NXk8MXSYQc8sQn0jJyLjln1rrAKKMBLZeEkNal+mVnnuLwFomj0t27/AbaXvhEMEdUqeD3McaQGBhBygKX6q4WE2TCQDM9Tr3mOIOf2u2fPdtXfCfqlAkp9QaTW1bEKJitaqmFrXjPN1h8s1HnUpoJ+gaX3drF+16dbolAbSJcw+cDGjkDcq2SLdad4Y60+AokAN+gpsyIS139bIw+hPD6lXTDPrWjGrne2yIbqyLWw2Lv2WC9CJK9zVbcdAJRcZvdZLD2YbIDJpTOqpPT/Q3B30dyJhIMpsscHbv2vmPnubyk5ywjOMp60xvJu+/NW7ow8hGXItg4q/EGySUzs+v7i6pbz0gxtREI3gnsQ85J4wioiu6xSPMu9AE+1xy5EIJOxxDYvR2Gd8XBVQMyKZmztVHVJEGeBsw2AOwv2OEA4VO/1FMqSPpMSbrvpCj7kZOk7zrFnMiC91dOV1CAL3gtUf6RbRJWnlhdiBTJr8dPHInFqUq3kbbAuwzVWBB5Np9Ncyx5i/17vZyFqHCTgOMFJnQ+Mm+fE5KJnztAvUvu21FF0T/HIhoIdIrGs0MYzaj0DtsdanRzdfxsBugMke9hSUChuJlR39ubrZ8kGs06lEr0E9ZMsKKfusVw7oHrYi1lULGO7eGC+Gfzlaxvx/qnjGeL9xpfQXtWYDkAgpE0bQAtuf8dRWuy2oKrIFRO1U1jub94jY6mGhi38HZodVjgjext2kq1H8w dcJ3hUiS JVnpaNWssoPapbt+lf2UdN/q72jQc+AB/WYdK9r2+ZMJHHkkV0lk7asuD5HPzs7E18dh9d6szGYEc2vlPl+3ss+3nduY+VBkuYOdI2vGQhN82tluqp1JeFbGtpYCgm+O0JQFMsOLRH0xwREuAd6FMw8TCMTdVxKDb5zZBeLvLFp1ShaUrkNvgVXEPErlmO1IFqy9eS6HqdSQWmBfdPQ+whKHU1gWTB7UtmuIio6fmL15uBbvcvYORaZyjEElZfaWTEjPrYiz6Axg5fgm3S5HR/uDipVEh8qY4hUThHYUCJ/o0CtrCDbvsNf/G4PyRhhHeijUjFfWDitX7DB/QYfOIcl/i3am6J/ESJgOUKjE/04+D8CnKKbz2Fz2hnVwPv9IZ/hYOwQgGdTGJX8E= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 22, 2024 at 11:12=E2=80=AFAM Ryan Roberts wrote: > > On 22/10/2024 16:03, Nick Chan wrote: > > > > > > Neal Gompa =E6=96=BC 2024/10/22 =E4=B8=8B=E5=8D=885:33 =E5=AF=AB=E9=81= =93: > >> On Mon, Oct 21, 2024 at 11:02=E2=80=AFAM Ryan Roberts wrote: > >>> > >>> On 21/10/2024 14:49, Neal Gompa wrote: > >>>> On Mon, Oct 21, 2024 at 7:51=E2=80=AFAM Ryan Roberts wrote: > >>>>> > >>>>> On 21/10/2024 12:32, Eric Curtin wrote: > >>>>>> On Mon, 21 Oct 2024 at 12:09, Ryan Roberts = wrote: > >>>>>>> > >>>>>>> On 19/10/2024 16:47, Neal Gompa wrote: > >>>>>>>> On Monday, October 14, 2024 6:55:11=E2=80=AFAM EDT Ryan Roberts = wrote: > >>>>>>>>> Hi All, > >>>>>>>>> > >>>>>>>>> Patch bomb incoming... This covers many subsystems, so I've inc= luded a core > >>>>>>>>> set of people on the full series and additionally included main= tainers on > >>>>>>>>> relevant patches. I haven't included those maintainers on this = cover letter > >>>>>>>>> since the numbers were far too big for it to work. But I've inc= luded a link > >>>>>>>>> to this cover letter on each patch, so they can hopefully find = their way > >>>>>>>>> here. For follow up submissions I'll break it up by subsystem, = but for now > >>>>>>>>> thought it was important to show the full picture. > >>>>>>>>> > >>>>>>>>> This RFC series implements support for boot-time page size sele= ction within > >>>>>>>>> the arm64 kernel. arm64 supports 3 base page sizes (4K, 16K, 64= K), but to > >>>>>>>>> date, page size has been selected at compile-time, meaning the = size is > >>>>>>>>> baked into a given kernel image. As use of larger-than-4K page = sizes become > >>>>>>>>> more prevalent this starts to present a problem for distributio= ns. > >>>>>>>>> Boot-time page size selection enables the creation of a single = kernel > >>>>>>>>> image, which can be told which page size to use on the kernel c= ommand line. > >>>>>>>>> > >>>>>>>>> Why is having an image-per-page size problematic? > >>>>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > >>>>>>>>> > >>>>>>>>> Many traditional distros are now supporting both 4K and 64K. An= d this means > >>>>>>>>> managing 2 kernel packages, along with drivers for each. For so= me, it means > >>>>>>>>> multiple installer flavours and multiple ISOs. All of this adds= up to a > >>>>>>>>> less-than-ideal level of complexity. Additionally, Android now = supports 4K > >>>>>>>>> and 16K kernels. I'm told having to explicitly manage their KAB= I for each > >>>>>>>>> kernel is painful, and the extra flash space required for both = kernel > >>>>>>>>> images and the duplicated modules has been problematic. Boot-ti= me page size > >>>>>>>>> selection solves all of this. > >>>>>>>>> > >>>>>>>>> Additionally, in starting to think about the longer term deploy= ment story > >>>>>>>>> for D128 page tables, which Arm architecture now supports, a lo= t of the > >>>>>>>>> same problems need to be solved, so this work sets us up nicely= for that. > >>>>>>>>> > >>>>>>>>> So what's the down side? > >>>>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > >>>>>>>>> > >>>>>>>>> Well nothing's free; Various static allocations in the kernel i= mage must be > >>>>>>>>> sized for the worst case (largest supported page size), so imag= e size is in > >>>>>>>>> line with size of 64K compile-time image. So if you're interest= ed in 4K or > >>>>>>>>> 16K, there is a slight increase to the image size. But I expect= that > >>>>>>>>> problem goes away if you're compressing the image - its just so= me extra > >>>>>>>>> zeros. At boot-time, I expect we could free the unused static s= torage once > >>>>>>>>> we know the page size - although that would be a follow up enha= ncement. > >>>>>>>>> > >>>>>>>>> And then there is performance. Since PAGE_SIZE and friends are = no longer > >>>>>>>>> compile-time constants, we must look up their values and do ari= thmetic at > >>>>>>>>> runtime instead of compile-time. My early perf testing suggests= this is > >>>>>>>>> inperceptible for real-world workloads, and only has small impa= ct on > >>>>>>>>> microbenchmarks - more on this below. > >>>>>>>>> > >>>>>>>>> Approach > >>>>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D > >>>>>>>>> > >>>>>>>>> The basic idea is to rid the source of any assumptions that PAG= E_SIZE and > >>>>>>>>> friends are compile-time constant, but in a way that allows the= compiler to > >>>>>>>>> perform the same optimizations as was previously being done if = they do turn > >>>>>>>>> out to be compile-time constant. Where constants are required, = we use > >>>>>>>>> limits; PAGE_SIZE_MIN and PAGE_SIZE_MAX. See commit log in patc= h 1 for full > >>>>>>>>> description of all the classes of problems to solve. > >>>>>>>>> > >>>>>>>>> By default PAGE_SIZE_MIN=3DPAGE_SIZE_MAX=3DPAGE_SIZE. But an ar= ch may opt-in to > >>>>>>>>> boot-time page size selection by defining PAGE_SIZE_MIN & PAGE_= SIZE_MAX. > >>>>>>>>> arm64 does this if the user selects the CONFIG_ARM64_BOOT_TIME_= PAGE_SIZE > >>>>>>>>> Kconfig, which is an alternative to selecting a compile-time pa= ge size. > >>>>>>>>> > >>>>>>>>> When boot-time page size is active, the arch pgtable geometry m= acro > >>>>>>>>> definitions resolve to something that can be configured at boot= . The arm64 > >>>>>>>>> implementation in this series mainly uses global, __ro_after_in= it > >>>>>>>>> variables. I've tried using alternatives patching, but that per= forms worse > >>>>>>>>> than loading from memory; I think due to code size bloat. > >>>>>>>>> > >>>>>>>>> Status > >>>>>>>>> =3D=3D=3D=3D=3D=3D > >>>>>>>>> > >>>>>>>>> When CONFIG_ARM64_BOOT_TIME_PAGE_SIZE is selected, I've only im= plemented > >>>>>>>>> enough to compile the kernel image itself with defconfig (and a= few other > >>>>>>>>> bits and pieces). This is enough to build a kernel that can boo= t under QEMU > >>>>>>>>> or FVP. I'll happily do the rest of the work to enable all the = extra > >>>>>>>>> drivers, but wanted to get feedback on the shape of this effort= first. If > >>>>>>>>> anyone wants to do any testing, and has a must-have config, let= me know and > >>>>>>>>> I'll prioritize enabling it first. > >>>>>>>>> > >>>>>>>>> The series is arranged as follows: > >>>>>>>>> > >>>>>>>>> - patch 1: Add macros required for converting non-arch cod= e to support > >>>>>>>>> boot-time page size selection > >>>>>>>>> - patches 2-36: Remove PAGE_SIZE compile-time constant assum= ption from > >>>>>>>>> all non-arch code > >>>>>>>>> - patches 37-38: Some arm64 tidy ups > >>>>>>>>> - patch 39: Add macros required for converting arm6= 4 code to > >>>>>>>> support > >>>>>>>>> boot-time page size selection > >>>>>>>>> - patches 40-56: arm64 changes to support boot-time page size= selection > >>>>>>>>> - patch 57: Add arm64 Kconfig option to enable boot= -time page > >>>>>>>> size > >>>>>>>>> selection > >>>>>>>>> > >>>>>>>>> Ideally, I'd like to get the basics merged (something like this= series), > >>>>>>>>> then incrementally improve it over a handful of kernel releases= until we > >>>>>>>>> can demonstrate that we have feature parity with the compile-ti= me build and > >>>>>>>>> no performance blockers. Once at that point, ideally the compil= e-time build > >>>>>>>>> options would be removed and the code could be cleaned up furth= er. > >>>>>>>>> > >>>>>>>>> One of the bigger peices that I'd propose to add as a follow up= , is to make > >>>>>>>>> va-size boot-time selectable too. That will greatly simplify LP= A2 fallback > >>>>>>>>> handling. > >>>>>>>>> > >>>>>>>>> Assuming people are ammenable to the rough shape, how would I g= o about > >>>>>>>>> getting the non-arch changes merged? Since they cover many subs= ystems, will > >>>>>>>>> each piece need to go independently to each relevant maintainer= or could it > >>>>>>>>> all be merged together through the arm64 tree? > >>>>>>>>> > >>>>>>>>> Image Size > >>>>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >>>>>>>>> > >>>>>>>>> The below shows the size of a defconfig (+ xfs, squashfs, ftrac= e, kprobes) > >>>>>>>>> kernel image on disk for base (before any changes applied), com= pile (with > >>>>>>>>> changes, configured for compile-time page size) and boot (with = changes, > >>>>>>>>> configured for boot-time page size). > >>>>>>>>> > >>>>>>>>> You can see the that compile-16k and 64k configs are actually s= lightly > >>>>>>>>> smaller than the baselines; that's due to optimizing some buffe= r sizes > >>>>>>>>> which didn't need to depend on page size during the series. The= boot-time > >>>>>>>>> image is ~1% bigger than the 64k compile-time image. I believe = there is > >>>>>>>>> scope to improve this to make it > >>>>>>>>> equal to compile-64k if required: > >>>>>>>>> | config | size/KB | diff/KB | diff/% | > >>>>>>>>> | > >>>>>>>>> |-------------|---------|---------|---------| > >>>>>>>>> | > >>>>>>>>> | base-4k | 54895 | 0 | 0.0% | > >>>>>>>>> | base-16k | 55161 | 266 | 0.5% | > >>>>>>>>> | base-64k | 56775 | 1880 | 3.4% | > >>>>>>>>> | compile-4k | 54895 | 0 | 0.0% | > >>>>>>>>> | compile-16k | 55097 | 202 | 0.4% | > >>>>>>>>> | compile-64k | 56391 | 1496 | 2.7% | > >>>>>>>>> | boot-4K | 57045 | 2150 | 3.9% | > >>>>>>>>> > >>>>>>>>> And below shows the size of the image in memory at run-time, se= parated for > >>>>>>>>> text and data costs. The boot image has ~1% text cost; most lik= ely due to > >>>>>>>>> the fact that PAGE_SIZE and friends are not compile-time consta= nts so need > >>>>>>>>> instructions to load the values and do arithmetic. I believe we= could > >>>>>>>>> eventually get the data cost to match the cost for the compile = image for > >>>>>>>>> the chosen page size by freeing > >>>>>>>>> the ends of the static buffers not needed for the selected page= size: > >>>>>>>>> | | text | text | text | data | data= | data | > >>>>>>>>> | > >>>>>>>>> | config | size/KB | diff/KB | diff/% | size/KB | diff/KB= | diff/% | > >>>>>>>>> | > >>>>>>>>> |-------------|---------|---------|---------|---------|--------= -|---------| > >>>>>>>>> | > >>>>>>>>> | base-4k | 20561 | 0 | 0.0% | 14314 | 0= | 0.0% | > >>>>>>>>> | base-16k | 20439 | -122 | -0.6% | 14625 | 311= | 2.2% | > >>>>>>>>> | base-64k | 20435 | -126 | -0.6% | 15673 | 1359= | 9.5% | > >>>>>>>>> | compile-4k | 20565 | 4 | 0.0% | 14315 | 1= | 0.0% | > >>>>>>>>> | compile-16k | 20443 | -118 | -0.6% | 14517 | 204= | 1.4% | > >>>>>>>>> | compile-64k | 20439 | -122 | -0.6% | 15134 | 820= | 5.7% | > >>>>>>>>> | boot-4K | 20811 | 250 | 1.2% | 15287 | 973= | 6.8% | > >>>>>>>>> > >>>>>>>>> Functional Testing > >>>>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >>>>>>>>> > >>>>>>>>> I've build-tested defconfig for all arches supported by tuxmake= (which is > >>>>>>>>> most) without issue. > >>>>>>>>> > >>>>>>>>> I've boot-tested arm64 with CONFIG_ARM64_BOOT_TIME_PAGE_SIZE fo= r all page > >>>>>>>>> sizes and a few va-sizes, and additionally have run all the mm-= selftests, > >>>>>>>>> with no regressions observed vs the equivalent compile-time pag= e size build > >>>>>>>>> (although the mm-selftests have a few existing failures when ru= n against > >>>>>>>>> 16K and 64K kernels - those should really be investigated and f= ixed > >>>>>>>>> independently). > >>>>>>>>> > >>>>>>>>> Test coverage is lacking for many of the drivers that I've touc= hed, but in > >>>>>>>>> many cases, I'm hoping the changes are simple enough that revie= w might > >>>>>>>>> suffice? > >>>>>>>>> > >>>>>>>>> Performance Testing > >>>>>>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >>>>>>>>> > >>>>>>>>> I've run some limited performance benchmarks: > >>>>>>>>> > >>>>>>>>> First, a real-world benchmark that causes a lot of page table m= anipulation > >>>>>>>>> (and therefore we would expect to see regression here if we are= going to > >>>>>>>>> see it anywhere); kernel compilation. It barely registers a cha= nge. Values > >>>>>>>>> are times, > >>>>>>>>> so smaller is better. All relative to base-4k: > >>>>>>>>> | | kern | kern | user | user | real= | real | > >>>>>>>>> | > >>>>>>>>> | config | mean | stdev | mean | stdev | mean= | stdev | > >>>>>>>>> | > >>>>>>>>> |-------------|---------|---------|---------|---------|--------= -|---------| > >>>>>>>>> | > >>>>>>>>> | base-4k | 0.0% | 1.1% | 0.0% | 0.3% | 0.0%= | 0.3% | > >>>>>>>>> | compile-4k | -0.2% | 1.1% | -0.2% | 0.3% | -0.1%= | 0.3% | > >>>>>>>>> | boot-4k | 0.1% | 1.0% | -0.3% | 0.2% | -0.2%= | 0.2% | > >>>>>>>>> > >>>>>>>>> The Speedometer JavaScript benchmark also shows no change. Valu= es are runs > >>>>>>>>> per > >>>>>>>>> min, so bigger is better. All relative to base-4k: > >>>>>>>>> | config | mean | stdev | > >>>>>>>>> | > >>>>>>>>> |-------------|---------|---------| > >>>>>>>>> | > >>>>>>>>> | base-4k | 0.0% | 0.8% | > >>>>>>>>> | compile-4k | 0.4% | 0.8% | > >>>>>>>>> | boot-4k | 0.0% | 0.9% | > >>>>>>>>> > >>>>>>>>> Finally, I've run some microbenchmarks known to stress page tab= le > >>>>>>>>> manipulations (originally from David Hildenbrand). The fork tes= t > >>>>>>>>> maps/allocs 1G of anon memory, then measures the cost of fork()= . The munmap > >>>>>>>>> test maps/allocs 1G of anon memory then measures the cost of mu= nmap()ing > >>>>>>>>> it. The fork test is known to be extremely sensitive to any cha= nges that > >>>>>>>>> cause instructions to be aligned differently in cachelines. Whe= n using this > >>>>>>>>> test for other changes, I've seen double digit regressions for = the > >>>>>>>>> slightest thing, so 12% regression on this test is actually fai= rly good. > >>>>>>>>> This likely represents the extreme worst case for regressions t= hat will be > >>>>>>>>> observed across other microbenchmarks (famous last > >>>>>>>>> words). Values are times, so smaller is better. All relative to= base-4k: > >>>>>>>>> | | fork | fork | munmap | munmap | > >>>>>>>>> | > >>>>>>>>> | config | mean | stdev | stdev | stdev | > >>>>>>>>> | > >>>>>>>>> |-------------|---------|---------|---------|---------| > >>>>>>>>> | > >>>>>>>>> | base-4k | 0.0% | 1.3% | 0.0% | 0.3% | > >>>>>>>>> | compile-4k | 0.1% | 1.3% | -0.9% | 0.1% | > >>>>>>>>> | boot-4k | 12.8% | 1.2% | 3.8% | 1.0% | > >>>>>>>>> > >>>>>>>>> NOTE: The series applies on top of v6.11. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Ryan > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Ryan Roberts (57): > >>>>>>>>> mm: Add macros ahead of supporting boot-time page size select= ion > >>>>>>>>> vmlinux: Align to PAGE_SIZE_MAX > >>>>>>>>> mm/memcontrol: Fix seq_buf size to save memory when PAGE_SIZE= is large > >>>>>>>>> mm/page_alloc: Make page_frag_cache boot-time page size compa= tible > >>>>>>>>> mm: Avoid split pmd ptl if pmd level is run-time folded > >>>>>>>>> mm: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> fs: Introduce MAX_BUF_PER_PAGE_SIZE_MAX for array sizing > >>>>>>>>> fs: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> fs/nfs: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> fs/ext4: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> fork: Permit boot-time THREAD_SIZE determination > >>>>>>>>> cgroup: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> bpf: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> pm/hibernate: Remove PAGE_SIZE compile-time constant assumpti= on > >>>>>>>>> stackdepot: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> perf: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> kvm: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> trace: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> crash: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> crypto: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> sunrpc: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> sound: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> net: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> net: fec: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> net: marvell: Remove PAGE_SIZE compile-time constant assumpti= on > >>>>>>>>> net: hns3: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> net: e1000: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> net: igbvf: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> net: igb: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> drivers/base: Remove PAGE_SIZE compile-time constant assumpti= on > >>>>>>>>> edac: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> optee: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> random: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> sata_sil24: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> virtio: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> xen: Remove PAGE_SIZE compile-time constant assumption > >>>>>>>>> arm64: Fix macros to work in C code in addition to the linker= script > >>>>>>>>> arm64: Track early pgtable allocation limit > >>>>>>>>> arm64: Introduce macros required for boot-time page selection > >>>>>>>>> arm64: Refactor early pgtable size calculation macros > >>>>>>>>> arm64: Pass desired page size on command line > >>>>>>>>> arm64: Divorce early init from PAGE_SIZE > >>>>>>>>> arm64: Clean up simple cases of CONFIG_ARM64_*K_PAGES > >>>>>>>>> arm64: Align sections to PAGE_SIZE_MAX > >>>>>>>>> arm64: Rework trampoline rodata mapping > >>>>>>>>> arm64: Generalize fixmap for boot-time page size > >>>>>>>>> arm64: Statically allocate and align for worst-case page size > >>>>>>>>> arm64: Convert switch to if for non-const comparison values > >>>>>>>>> arm64: Convert BUILD_BUG_ON to VM_BUG_ON > >>>>>>>>> arm64: Remove PAGE_SZ asm-offset > >>>>>>>>> arm64: Introduce cpu features for page sizes > >>>>>>>>> arm64: Remove PAGE_SIZE from assembly code > >>>>>>>>> arm64: Runtime-fold pmd level > >>>>>>>>> arm64: Support runtime folding in idmap_kpti_install_ng_mappi= ngs > >>>>>>>>> arm64: TRAMP_VALIAS is no longer compile-time constant > >>>>>>>>> arm64: Determine THREAD_SIZE at boot-time > >>>>>>>>> arm64: Enable boot-time page size selection > >>>>>>>>> > >>>>>>>>> arch/alpha/include/asm/page.h | 1 + > >>>>>>>>> arch/arc/include/asm/page.h | 1 + > >>>>>>>>> arch/arm/include/asm/page.h | 1 + > >>>>>>>>> arch/arm64/Kconfig | 26 ++- > >>>>>>>>> arch/arm64/include/asm/assembler.h | 78 ++++++- > >>>>>>>>> arch/arm64/include/asm/cpufeature.h | 44 +++- > >>>>>>>>> arch/arm64/include/asm/efi.h | 2 +- > >>>>>>>>> arch/arm64/include/asm/fixmap.h | 28 ++- > >>>>>>>>> arch/arm64/include/asm/kernel-pgtable.h | 150 +++++++++-= --- > >>>>>>>>> arch/arm64/include/asm/kvm_arm.h | 21 +- > >>>>>>>>> arch/arm64/include/asm/kvm_hyp.h | 11 + > >>>>>>>>> arch/arm64/include/asm/kvm_pgtable.h | 6 +- > >>>>>>>>> arch/arm64/include/asm/memory.h | 62 ++++-- > >>>>>>>>> arch/arm64/include/asm/page-def.h | 3 +- > >>>>>>>>> arch/arm64/include/asm/pgalloc.h | 16 +- > >>>>>>>>> arch/arm64/include/asm/pgtable-geometry.h | 46 ++++ > >>>>>>>>> arch/arm64/include/asm/pgtable-hwdef.h | 28 ++- > >>>>>>>>> arch/arm64/include/asm/pgtable-prot.h | 2 +- > >>>>>>>>> arch/arm64/include/asm/pgtable.h | 133 +++++++++-= -- > >>>>>>>>> arch/arm64/include/asm/processor.h | 10 +- > >>>>>>>>> arch/arm64/include/asm/sections.h | 1 + > >>>>>>>>> arch/arm64/include/asm/smp.h | 1 + > >>>>>>>>> arch/arm64/include/asm/sparsemem.h | 15 +- > >>>>>>>>> arch/arm64/include/asm/sysreg.h | 54 +++-- > >>>>>>>>> arch/arm64/include/asm/tlb.h | 3 + > >>>>>>>>> arch/arm64/kernel/asm-offsets.c | 4 +- > >>>>>>>>> arch/arm64/kernel/cpufeature.c | 93 ++++++-- > >>>>>>>>> arch/arm64/kernel/efi.c | 2 +- > >>>>>>>>> arch/arm64/kernel/entry.S | 60 +++++- > >>>>>>>>> arch/arm64/kernel/head.S | 46 +++- > >>>>>>>>> arch/arm64/kernel/hibernate-asm.S | 6 +- > >>>>>>>>> arch/arm64/kernel/image-vars.h | 14 ++ > >>>>>>>>> arch/arm64/kernel/image.h | 4 + > >>>>>>>>> arch/arm64/kernel/pi/idreg-override.c | 68 +++++- > >>>>>>>>> arch/arm64/kernel/pi/map_kernel.c | 165 ++++++++++= ---- > >>>>>>>>> arch/arm64/kernel/pi/map_range.c | 201 ++++++++++= ++++++-- > >>>>>>>>> arch/arm64/kernel/pi/pi.h | 63 +++++- > >>>>>>>>> arch/arm64/kernel/relocate_kernel.S | 10 +- > >>>>>>>>> arch/arm64/kernel/vdso-wrap.S | 4 +- > >>>>>>>>> arch/arm64/kernel/vdso.c | 7 +- > >>>>>>>>> arch/arm64/kernel/vdso/vdso.lds.S | 4 +- > >>>>>>>>> arch/arm64/kernel/vdso32-wrap.S | 4 +- > >>>>>>>>> arch/arm64/kernel/vdso32/vdso.lds.S | 4 +- > >>>>>>>>> arch/arm64/kernel/vmlinux.lds.S | 48 +++-- > >>>>>>>>> arch/arm64/kvm/arm.c | 10 + > >>>>>>>>> arch/arm64/kvm/hyp/nvhe/Makefile | 1 + > >>>>>>>>> arch/arm64/kvm/hyp/nvhe/host.S | 10 +- > >>>>>>>>> arch/arm64/kvm/hyp/nvhe/hyp.lds.S | 4 +- > >>>>>>>>> arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c | 16 ++ > >>>>>>>>> arch/arm64/kvm/mmu.c | 39 ++-- > >>>>>>>>> arch/arm64/lib/clear_page.S | 7 +- > >>>>>>>>> arch/arm64/lib/copy_page.S | 33 ++- > >>>>>>>>> arch/arm64/lib/mte.S | 27 ++- > >>>>>>>>> arch/arm64/mm/Makefile | 1 + > >>>>>>>>> arch/arm64/mm/fixmap.c | 38 ++-- > >>>>>>>>> arch/arm64/mm/hugetlbpage.c | 40 +--- > >>>>>>>>> arch/arm64/mm/init.c | 26 +-- > >>>>>>>>> arch/arm64/mm/kasan_init.c | 8 +- > >>>>>>>>> arch/arm64/mm/mmu.c | 53 +++-- > >>>>>>>>> arch/arm64/mm/pgd.c | 12 +- > >>>>>>>>> arch/arm64/mm/pgtable-geometry.c | 24 +++ > >>>>>>>>> arch/arm64/mm/proc.S | 128 ++++++++--= - > >>>>>>>>> arch/arm64/mm/ptdump.c | 3 +- > >>>>>>>>> arch/arm64/tools/cpucaps | 3 + > >>>>>>>>> arch/csky/include/asm/page.h | 3 + > >>>>>>>>> arch/hexagon/include/asm/page.h | 2 + > >>>>>>>>> arch/loongarch/include/asm/page.h | 2 + > >>>>>>>>> arch/m68k/include/asm/page.h | 1 + > >>>>>>>>> arch/microblaze/include/asm/page.h | 1 + > >>>>>>>>> arch/mips/include/asm/page.h | 1 + > >>>>>>>>> arch/nios2/include/asm/page.h | 2 + > >>>>>>>>> arch/openrisc/include/asm/page.h | 1 + > >>>>>>>>> arch/parisc/include/asm/page.h | 1 + > >>>>>>>>> arch/powerpc/include/asm/page.h | 2 + > >>>>>>>>> arch/riscv/include/asm/page.h | 1 + > >>>>>>>>> arch/s390/include/asm/page.h | 1 + > >>>>>>>>> arch/sh/include/asm/page.h | 1 + > >>>>>>>>> arch/sparc/include/asm/page.h | 3 + > >>>>>>>>> arch/um/include/asm/page.h | 2 + > >>>>>>>>> arch/x86/include/asm/page_types.h | 2 + > >>>>>>>>> arch/xtensa/include/asm/page.h | 1 + > >>>>>>>>> crypto/lskcipher.c | 4 +- > >>>>>>>>> drivers/ata/sata_sil24.c | 46 ++-- > >>>>>>>>> drivers/base/node.c | 6 +- > >>>>>>>>> drivers/base/topology.c | 32 +-- > >>>>>>>>> drivers/block/virtio_blk.c | 2 +- > >>>>>>>>> drivers/char/random.c | 4 +- > >>>>>>>>> drivers/edac/edac_mc.h | 13 +- > >>>>>>>>> drivers/firmware/efi/libstub/arm64.c | 3 +- > >>>>>>>>> drivers/irqchip/irq-gic-v3-its.c | 2 +- > >>>>>>>>> drivers/mtd/mtdswap.c | 4 +- > >>>>>>>>> drivers/net/ethernet/freescale/fec.h | 3 +- > >>>>>>>>> drivers/net/ethernet/freescale/fec_main.c | 5 +- > >>>>>>>>> .../net/ethernet/hisilicon/hns3/hns3_enet.h | 4 +- > >>>>>>>>> drivers/net/ethernet/intel/e1000/e1000_main.c | 6 +- > >>>>>>>>> drivers/net/ethernet/intel/igb/igb.h | 25 +-- > >>>>>>>>> drivers/net/ethernet/intel/igb/igb_main.c | 149 +++++++---= --- > >>>>>>>>> drivers/net/ethernet/intel/igbvf/netdev.c | 6 +- > >>>>>>>>> drivers/net/ethernet/marvell/mvneta.c | 9 +- > >>>>>>>>> drivers/net/ethernet/marvell/sky2.h | 2 +- > >>>>>>>>> drivers/tee/optee/call.c | 7 +- > >>>>>>>>> drivers/tee/optee/smc_abi.c | 2 +- > >>>>>>>>> drivers/virtio/virtio_balloon.c | 10 +- > >>>>>>>>> drivers/xen/balloon.c | 11 +- > >>>>>>>>> drivers/xen/biomerge.c | 12 +- > >>>>>>>>> drivers/xen/privcmd.c | 2 +- > >>>>>>>>> drivers/xen/xenbus/xenbus_client.c | 5 +- > >>>>>>>>> drivers/xen/xlate_mmu.c | 6 +- > >>>>>>>>> fs/binfmt_elf.c | 11 +- > >>>>>>>>> fs/buffer.c | 2 +- > >>>>>>>>> fs/coredump.c | 8 +- > >>>>>>>>> fs/ext4/ext4.h | 36 ++-- > >>>>>>>>> fs/ext4/move_extent.c | 2 +- > >>>>>>>>> fs/ext4/readpage.c | 2 +- > >>>>>>>>> fs/fat/dir.c | 4 +- > >>>>>>>>> fs/fat/fatent.c | 4 +- > >>>>>>>>> fs/nfs/nfs42proc.c | 2 +- > >>>>>>>>> fs/nfs/nfs42xattr.c | 2 +- > >>>>>>>>> fs/nfs/nfs4proc.c | 2 +- > >>>>>>>>> include/asm-generic/pgtable-geometry.h | 71 +++++++ > >>>>>>>>> include/asm-generic/vmlinux.lds.h | 38 ++-- > >>>>>>>>> include/linux/buffer_head.h | 1 + > >>>>>>>>> include/linux/cpumask.h | 5 + > >>>>>>>>> include/linux/linkage.h | 4 +- > >>>>>>>>> include/linux/mm.h | 17 +- > >>>>>>>>> include/linux/mm_types.h | 15 +- > >>>>>>>>> include/linux/mm_types_task.h | 2 +- > >>>>>>>>> include/linux/mmzone.h | 3 +- > >>>>>>>>> include/linux/netlink.h | 6 +- > >>>>>>>>> include/linux/percpu-defs.h | 4 +- > >>>>>>>>> include/linux/perf_event.h | 2 +- > >>>>>>>>> include/linux/sched.h | 4 +- > >>>>>>>>> include/linux/slab.h | 7 +- > >>>>>>>>> include/linux/stackdepot.h | 6 +- > >>>>>>>>> include/linux/sunrpc/svc.h | 8 +- > >>>>>>>>> include/linux/sunrpc/svc_rdma.h | 4 +- > >>>>>>>>> include/linux/sunrpc/svcsock.h | 2 +- > >>>>>>>>> include/linux/swap.h | 17 +- > >>>>>>>>> include/linux/swapops.h | 6 +- > >>>>>>>>> include/linux/thread_info.h | 10 +- > >>>>>>>>> include/xen/page.h | 2 + > >>>>>>>>> init/main.c | 7 +- > >>>>>>>>> kernel/bpf/core.c | 9 +- > >>>>>>>>> kernel/bpf/ringbuf.c | 54 ++--- > >>>>>>>>> kernel/cgroup/cgroup.c | 8 +- > >>>>>>>>> kernel/crash_core.c | 2 +- > >>>>>>>>> kernel/events/core.c | 2 +- > >>>>>>>>> kernel/fork.c | 71 +++---- > >>>>>>>>> kernel/power/power.h | 2 +- > >>>>>>>>> kernel/power/snapshot.c | 2 +- > >>>>>>>>> kernel/power/swap.c | 129 +++++++++-= - > >>>>>>>>> kernel/trace/fgraph.c | 2 +- > >>>>>>>>> kernel/trace/trace.c | 2 +- > >>>>>>>>> lib/stackdepot.c | 6 +- > >>>>>>>>> mm/kasan/report.c | 3 +- > >>>>>>>>> mm/memcontrol.c | 11 +- > >>>>>>>>> mm/memory.c | 4 +- > >>>>>>>>> mm/mmap.c | 2 +- > >>>>>>>>> mm/page-writeback.c | 2 +- > >>>>>>>>> mm/page_alloc.c | 31 +-- > >>>>>>>>> mm/slub.c | 2 +- > >>>>>>>>> mm/sparse.c | 2 +- > >>>>>>>>> mm/swapfile.c | 2 +- > >>>>>>>>> mm/vmalloc.c | 7 +- > >>>>>>>>> net/9p/trans_virtio.c | 4 +- > >>>>>>>>> net/core/hotdata.c | 4 +- > >>>>>>>>> net/core/skbuff.c | 4 +- > >>>>>>>>> net/core/sysctl_net_core.c | 2 +- > >>>>>>>>> net/sunrpc/cache.c | 3 +- > >>>>>>>>> net/unix/af_unix.c | 2 +- > >>>>>>>>> sound/soc/soc-utils.c | 4 +- > >>>>>>>>> virt/kvm/kvm_main.c | 2 +- > >>>>>>>>> 172 files changed, 2185 insertions(+), 951 deletions(-) > >>>>>>>>> create mode 100644 arch/arm64/include/asm/pgtable-geometry.h > >>>>>>>>> create mode 100644 arch/arm64/kvm/hyp/nvhe/pgtable-geometry.c > >>>>>>>>> create mode 100644 arch/arm64/mm/pgtable-geometry.c > >>>>>>>>> create mode 100644 include/asm-generic/pgtable-geometry.h > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> 2.43.0 > >>>>>>>> > >>>>>>>> This is a generally very exciting patch set! I'm looking forward= to seeing it > >>>>>>>> land so I can take advantage of it for Fedora ARM and Fedora Asa= hi Remix. > >>>>>>>> > >>>>>>>> That said, I have a couple of questions: > >>>>>>>> > >>>>>>>> * Going forward, how would we handle drivers/modules that requir= e a particular > >>>>>>>> page size? For example, the Apple Silicon IOMMU driver code requ= ires the > >>>>>>>> kernel to operate in 16k page size mode, and it would need to be= disabled in > >>>>>>>> other page sizes. > >>>>>>> > >>>>>>> I think these drivers would want to check PAGE_SIZE at probe time= and fail if an > >>>>>>> unsupported page size is in use. Do you see any issue with that? > >>>>>>> > >>>>>>>> > >>>>>>>> * How would we handle an invalid selection at boot? > >>>>>>> > >>>>>>> What do you mean by invalid here? The current policy validates th= at the > >>>>>>> requested page size is supported by the HW by checking mmfr0. If = no page size is > >>>>>>> passed on the command line, or the passed value is not supported = by the HW, then > >>>>>>> the we default to the largest page size supported by the HW (so f= or Apple > >>>>>>> Silicon that would be 16k since the HW doesn't support 64k). Alth= ough I think it > >>>>>>> may be better to change that policy to use the smallest page size= in this case; > >>>>>>> 4k is the safer bet for compat and will waste much less memory th= an 64k. > >>>>>>> > >>>>>>>> Can we program in a > >>>>>>>> fallback when the "wrong" mode is selected for a chip or somethi= ng similar? > >>>>>>> > >>>>>>> Do you mean effectively add a machanism to force 16k if the detec= ted HW is Apple > >>>>>>> Silicon? The trouble is that we need to select the page size, ver= y early in > >>>>>>> boot, before start_kernel() is called, so we really only have gen= eric arch code > >>>>>>> and the command line with which to make the decision. > >>>>>> > >>>>>> Yes... I think a build-time CONFIG for default page size, which ca= n be > >>>>>> overridden by a karg makes sense... Even on platforms like Apple > >>>>>> Silicon you may want to test very specific things in 4k by overrid= ing > >>>>>> with a karg. > >>>>> > >>>>> Ahh, yes, that would certainly work. I'll work it into the next ver= sion. > >>>>> > >>>> > >>>> Could we maybe extend to have some kind of way to include a table of > >>>> SoC IDs that certain modes are disabled (e.g. 64k on Apple Silicon) > >>> > >>> 64k is already disabled on Apple Silicon because mmfr0 reports that 6= 4k is not > >>> supported. > >>> > >>>> and preferred modes when no arg is set (16k for Apple Silicon)? That > >>> > >>> And it's not obvious that we should hard-code a page size preference = to a SoC > >>> ID. If the CPU can support multiple page sizes, it should be up to th= e SW stack > >>> to decide, not the SoC. > >>> > >>> I'm guessing your desire is to have a single kernel build that will b= oot 16k by > >>> default on Apple Silicon and 4k by default on other systems, all with= out needing > >>> to modify the command line? Personally I think it's cleaner to just r= equire > >>> setting the page size on the command line in these cases. > >>> > >>>> way it'd work something like this: > >>>> > >>>> 1. Table identification of 4/16/64 depending on identified SoC > >>> So I'd prefer not to have this > >>> > >>>> 2. Unidentified ones follow build-time default > >>>> 3. karg forces a mode regardless > >>> But keep these 2. > >>> > >> > > Since we are talking about Apple Silicon and page size, I would like to > > add that on the Apple Silicon SoCs I am working on, the situation is li= ke > > this: > > > > Apple A7 (s5l8960x), A8 (T7000), A8X (T7001): CPU MMU support 4K and 64= K > > page sizes. > > > > Apple A9 (s8000/s8003), A9X (s8001), A10 (t8010), A10X (t8011), A11 (t8= 015): > > CPU MMU Support 16K and 64K page sizes. > > > > However, all of them have 4K page DART IOMMUs. > > > >> I think it makes sense to have it, because it's not just Apple Silicon > >> where such a preference/requirement may be necessary. Apple Silicon > >> technically works at 4k, but is completely broken at 4k because Linux > >> cannot do 16k IOMMU with 4k everything else, so being able to at least > >> prefer 16k out of the box is important. And SoCs like the NVIDIA Grace > >> Hopper platform prefer 64k over other options (though I am unaware of > >> a gross incompatibility that effectively requires it like Apple > >> Silicon has). > >> > >> When we're trying to get to "single generic image that works > >> everywhere", stuff like this matters and I would really like you to > >> consider it from the lens of "we want things to work as automagic as > >> they do on x86". > > For me, in order to get to this level of automagic, there do need to be > > a table of which SoC should use which page size table. > > OK, but it's not clear to me that this table needs to be in the kernel. C= ould it > not be something in user space (e.g. during installation) that configures= the > kernel command line? > This is not compatible with using things like ISOs with UEFI+ACPI enabled desktop/server systems. We need to be able to safely, automatically, and correctly boot up and support hardware. The only place to do that early enough is in the kernel. But this can wait until the core stuff is in. > Regardless, the hard work here is getting the boot-time page size selecti= on > mechanism in place. Once that's there, follow up patches can add the desi= red > policy. I'd rather leave it out for now to avoid anything slowing down th= e core > work. > Sure, this can be done afterward. --=20 =E7=9C=9F=E5=AE=9F=E3=81=AF=E3=81=84=E3=81=A4=E3=82=82=E4=B8=80=E3=81=A4=EF= =BC=81/ Always, there's only one truth!