From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F3042DCC01 for ; Wed, 29 Apr 2026 15:26:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777476399; cv=none; b=XFUQ3GEPBnwlKapNaz0E4CssCfLp/BmIv60McUGIRenqullYj1RZlW3PzRdFKwnSkHCam5j10h/yzCeuCnfB3Mk+xFYcnwVJxs7GqEL1Ssyyg6RsH34Fw1qdf/Jk/8iZoxQQnxapFfRLOwYETAGjtsp+Gw7RkE+z69Fm9aB7M/8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777476399; c=relaxed/simple; bh=srBa4KiJ2ooCPxq/W2oVDrPQLvs/fx7BWj2l2xJ/Eik=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rfaNU91/OdwshUCce83eYHQxYOzLKmu9rQfNMKwf6Kz9cV63zlKNrxPz76yQznl6zPKiqfvVCKkFuMlJs9k6PquQ9dlAbWEVbFg6oqROOuM5xqRTlFXm16dVw/Fm77ykHnNbvSiEEteBsuJN4qjTCbh6ksrby2sUODJ8inb9rVw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UJerEqCq; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UJerEqCq" Received: by smtp.kernel.org (Postfix) with ESMTPSA id CAC72C19425; Wed, 29 Apr 2026 15:26:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1777476399; bh=srBa4KiJ2ooCPxq/W2oVDrPQLvs/fx7BWj2l2xJ/Eik=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=UJerEqCq+JZHeiGRCScwk/WCzjLgAM8Y5Y5bi0ZUjzTNvscJ0eiuY1uxYmAfJ97/o HqRILlUG+ZX6IeOP0WQ6W5kG+P+1X2HqEwEJ6lxvLMZp3JLaWOOVlh3amItFigS/mu exzYZ1t6W7/h7dH2CcmunEClga4TEkOEc+n2wqU4D5s5f2EtwsNuZlkCfAuYopU6rD N40DxKW8LFo4doCNf5BVBYdgrhj4pOvoXeRPboT1IL3evwdpGIht1qOph5x1sL5vmN DY6QXf0yb6VZqJCrdQpbTQT33QbypCXiCTuioiIOfBHysyYZelwcshYgcCyIgtzxZL odOYwayXS/T2w== Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfauth.phl.internal (Postfix) with ESMTP id EC7BEF40074; Wed, 29 Apr 2026 11:26:37 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-06.internal (MEProxy); Wed, 29 Apr 2026 11:26:37 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdekgeektdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpeffhffvvefukfhfgggtuggjsehttdertddttddvnecuhfhrohhmpefmihhrhihlucfu hhhuthhsvghmrghuuceokhgrsheskhgvrhhnvghlrdhorhhgqeenucggtffrrghtthgvrh hnpeeuieejieffkeehfeffffdtkeelfeelhefhfefhudehjeehvdffleeuvddufefgkeen ucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehkihhrih hllhdomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidqudeiudduiedvieehhedq vdekgeeggeejvdekqdhkrghspeepkhgvrhhnvghlrdhorhhgsehshhhuthgvmhhovhdrnh grmhgvpdhnsggprhgtphhtthhopeefvddpmhhouggvpehsmhhtphhouhhtpdhrtghpthht ohepfihilhhlhiesihhnfhhrrgguvggrugdrohhrghdprhgtphhtthhopehlshhfqdhptg eslhhishhtshdrlhhinhhugidqfhhouhhnuggrthhiohhnrdhorhhgpdhrtghpthhtohep lhhinhhugidqmhhmsehkvhgrtghkrdhorhhgpdhrtghpthhtohepgiekieeskhgvrhhnvg hlrdhorhhgpdhrtghpthhtoheplhhinhhugidqkhgvrhhnvghlsehvghgvrhdrkhgvrhhn vghlrdhorhhgpdhrtghpthhtoheprghkphhmsehlihhnuhigqdhfohhunhgurghtihhonh drohhrghdprhgtphhtthhopegurghvihgusehkvghrnhgvlhdrohhrghdprhgtphhtthho pehtghhlgieslhhinhhuthhrohhnihigrdguvgdprhgtphhtthhopehmihhnghhosehrvg guhhgrthdrtghomh X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 29 Apr 2026 11:26:37 -0400 (EDT) Date: Wed, 29 Apr 2026 16:26:36 +0100 From: Kiryl Shutsemau To: Matthew Wilcox Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, x86@kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , David Hildenbrand , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Lorenzo Stoakes , "Liam R. Howlett" , Mike Rapoport , Johannes Weiner , Usama Arif Subject: Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Message-ID: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Wed, Apr 29, 2026 at 03:39:18PM +0100, Matthew Wilcox wrote: > On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote: > > No, there's no new hardware (that I know of). I want to explore what page size > > means. > > > > The kernel uses the same value - PAGE_SIZE - for two things: > > > > - the order-0 buddy allocation size; > > > > - the granularity of virtual address space mapping; > > > > I think we can benefit from separating these two meanings and allowing > > order-0 allocations to be larger than the virtual address space covered by a > > PTE entry. > > I actually want to go in the other direction. I once came up with a > name -- POTAM -- which stands for Power Of Two Allocator with Metadata. ... of the House Targaryen, the First of Her Name! > The use case was something like XFS's buffer cache where we want a > filesystem block size of data (so 0.5KiB to 64KiB) with some metadata > attached (xfs_buf is 664 bytes with debugging enabled!) > > I set this aside to work on folios, but folios offer a back door to > unifying this with the buddy allocator. It's a long road, but here's > a sketch: > > First, we separate memdescs from pages. I believe this lets us shrink > struct page down to 8 bytes (previously presented as various LSFMMs). > > Second, we get rid of 'page' in things like sglist and bvec. This is > already in progress for various other good reasons. > > Third (this bit is new), we replace memmap with something like a maple > tree. That lets us lookup memdescs by physical address (typically > a memdesc will contain either the physical or virtual address of the > memory it controls). > > Fourth, we change the unit of the lookup in the maple tree from being > a PFN to being address / 512 (or whatever size we want to use as our > minimum). > > Now we can have memdescs for an arbitrary power of two which means we > can ditch all the awful code from ppc/s390 page table handling where > they try to share one memdesc between several different page tables. I had similar, but less ambitious idea. Can we get this functionality from slab? Maybe having a kind of kmem_cache that would allow to have metadata for each allocated object. It will be backed by two slabs: one for actual object and one for metadata. Plus some glue that allows to translate object->metadata (not sure if reverse is required). If both object and metadata are power-of-2 it should be doable: pointer to metadata in slab page plus some math. But I have not thought much about the idea yet. Your idea is much bigger and I don't understand implications yet. It seems redefining basis of memory allocation in kernel. Do we still have page allocator? Where page allocator ends and slab begins? But sounds fun to discuss it next week! > It's going to be "fun" avoiding allocation deadlocks where we want to > rebalance the maple tree containing the memdescs ... that's a five year > away problem. -- Kiryl Shutsemau / Kirill A. Shutemov