public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Kiryl Shutsemau <kas@kernel.org>
To: Matthew Wilcox <willy@infradead.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	x86@kernel.org,  linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	 David Hildenbrand <david@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	 Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	 "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Usama Arif <usama.arif@linux.dev>
Subject: Re: [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86
Date: Wed, 29 Apr 2026 16:26:36 +0100	[thread overview]
Message-ID: <afIdgkYJGCi_cC5P@thinkstation> (raw)
In-Reply-To: <afIYFtL6KrBs38rT@casper.infradead.org>

On Wed, Apr 29, 2026 at 03:39:18PM +0100, Matthew Wilcox wrote:
> On Thu, Feb 19, 2026 at 03:08:51PM +0000, Kiryl Shutsemau wrote:
> > No, there's no new hardware (that I know of). I want to explore what page size
> > means.
> > 
> > The kernel uses the same value - PAGE_SIZE - for two things:
> > 
> >   - the order-0 buddy allocation size;
> > 
> >   - the granularity of virtual address space mapping;
> > 
> > I think we can benefit from separating these two meanings and allowing
> > order-0 allocations to be larger than the virtual address space covered by a
> > PTE entry.
> 
> I actually want to go in the other direction.  I once came up with a
> name -- POTAM -- which stands for Power Of Two Allocator with Metadata.

... of the House Targaryen, the First of Her Name!

> The use case was something like XFS's buffer cache where we want a
> filesystem block size of data (so 0.5KiB to 64KiB) with some metadata
> attached (xfs_buf is 664 bytes with debugging enabled!)
> 
> I set this aside to work on folios, but folios offer a back door to
> unifying this with the buddy allocator.  It's a long road, but here's
> a sketch:
> 
> First, we separate memdescs from pages.  I believe this lets us shrink
> struct page down to 8 bytes (previously presented as various LSFMMs).
> 
> Second, we get rid of 'page' in things like sglist and bvec.  This is
> already in progress for various other good reasons.
> 
> Third (this bit is new), we replace memmap with something like a maple
> tree.  That lets us lookup memdescs by physical address (typically
> a memdesc will contain either the physical or virtual address of the
> memory it controls).
> 
> Fourth, we change the unit of the lookup in the maple tree from being
> a PFN to being address / 512 (or whatever size we want to use as our
> minimum).
> 
> Now we can have memdescs for an arbitrary power of two which means we
> can ditch all the awful code from ppc/s390 page table handling where
> they try to share one memdesc between several different page tables.

I had similar, but less ambitious idea. Can we get this functionality
from slab?

Maybe having a kind of kmem_cache that would allow to have metadata for
each allocated object. It will be backed by two slabs: one for actual
object and one for metadata. Plus some glue that allows to translate
object->metadata (not sure if reverse is required). If both object and
metadata are power-of-2 it should be doable: pointer to metadata in slab
page plus some math.

But I have not thought much about the idea yet.

Your idea is much bigger and I don't understand implications yet. It
seems redefining basis of memory allocation in kernel. Do we still have
page allocator? Where page allocator ends and slab begins?

But sounds fun to discuss it next week!

> It's going to be "fun" avoiding allocation deadlocks where we want to
> rebalance the maple tree containing the memdescs ... that's a five year
> away problem.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

  reply	other threads:[~2026-04-29 15:26 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-19 15:08 [LSF/MM/BPF TOPIC] 64k (or 16k) base page size on x86 Kiryl Shutsemau
2026-02-19 15:17 ` Peter Zijlstra
2026-02-19 15:20   ` Peter Zijlstra
2026-02-19 15:27     ` Kiryl Shutsemau
2026-02-19 15:33 ` Pedro Falcato
2026-02-19 15:50   ` Kiryl Shutsemau
2026-02-19 15:53     ` David Hildenbrand (Arm)
2026-02-19 19:31       ` Pedro Falcato
2026-02-19 15:39 ` David Hildenbrand (Arm)
2026-02-19 15:54   ` Kiryl Shutsemau
2026-02-19 16:09     ` David Hildenbrand (Arm)
2026-02-20  2:55       ` Zi Yan
2026-02-19 17:09   ` Kiryl Shutsemau
2026-02-20 10:24     ` David Hildenbrand (Arm)
2026-02-20 12:07       ` Kiryl Shutsemau
2026-02-20 16:30         ` David Hildenbrand (Arm)
2026-02-20 19:33           ` Kalesh Singh
2026-02-23 11:04             ` David Hildenbrand (Arm)
2026-02-23 11:13               ` Kiryl Shutsemau
2026-02-23 11:27                 ` David Hildenbrand (Arm)
2026-02-23 12:16                   ` Kiryl Shutsemau
2026-02-23 15:14                   ` Dave Hansen
2026-02-23 15:31                     ` David Hildenbrand (Arm)
2026-02-23 15:45                       ` Kiryl Shutsemau
2026-02-23 15:49                         ` David Hildenbrand (Arm)
2026-02-23 16:22                       ` Lorenzo Stoakes
2026-02-23 16:34                     ` David Laight
2026-02-19 23:24   ` Kalesh Singh
2026-02-20 12:10     ` Kiryl Shutsemau
2026-02-20 19:21       ` Kalesh Singh
2026-02-19 17:08 ` Dave Hansen
2026-02-19 22:05   ` Kiryl Shutsemau
2026-02-20  3:28     ` Liam R. Howlett
2026-02-20 12:33       ` Kiryl Shutsemau
2026-02-20 15:17         ` Liam R. Howlett
2026-02-20 15:50           ` Kiryl Shutsemau
2026-02-19 17:30 ` Dave Hansen
2026-02-19 22:14   ` Kiryl Shutsemau
2026-02-19 22:21     ` Dave Hansen
2026-02-19 17:47 ` Matthew Wilcox
2026-02-19 22:26   ` Kiryl Shutsemau
2026-02-20  9:04 ` David Laight
2026-02-20 12:12   ` Kiryl Shutsemau
2026-04-29 14:39 ` Matthew Wilcox
2026-04-29 15:26   ` Kiryl Shutsemau [this message]
2026-05-01 18:05   ` David Hildenbrand (Arm)
2026-05-01 18:00 ` Kiryl Shutsemau
2026-05-01 18:02   ` David Hildenbrand (Arm)
2026-05-01 18:12     ` Kiryl Shutsemau
2026-05-01 18:31       ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afIdgkYJGCi_cC5P@thinkstation \
    --to=kas@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mingo@redhat.com \
    --cc=rppt@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=usama.arif@linux.dev \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox