All of lore.kernel.org
 help / color / mirror / Atom feed
From: Uladzislau Rezki <urezki@gmail.com>
To: Brendan Jackman <jackmanb@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	peterz@infradead.org, bp@alien8.de, dave.hansen@linux.intel.com,
	mingo@redhat.com, tglx@linutronix.de, akpm@linux-foundation.org,
	david@redhat.com, derkling@google.com, junaids@google.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	reijiw@google.com, rientjes@google.com, rppt@kernel.org,
	vbabka@suse.cz, x86@kernel.org, yosry.ahmed@linux.dev,
	Matthew Wilcox <willy@infradead.org>,
	Liam Howlett <liam.howlett@oracle.com>,
	"Kirill A. Shutemov" <kas@kernel.org>,
	Harry Yoo <harry.yoo@oracle.com>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>,
	Andy Lutomirski <luto@kernel.org>,
	Josh Poimboeuf <jpoimboe@kernel.org>, Kees Cook <kees@kernel.org>
Subject: Re: [Discuss] First steps for ASI (ASI is fast again)
Date: Fri, 22 Aug 2025 18:56:34 +0200	[thread overview]
Message-ID: <aKihQv8fWzZIgnAW@pc636> (raw)
In-Reply-To: <DC83J9RSZZ0E.3VKGEVIDMSA2R@google.com>

On Thu, Aug 21, 2025 at 12:15:04PM +0000, Brendan Jackman wrote:
> On Thu Aug 21, 2025 at 8:55 AM UTC, Lorenzo Stoakes wrote:
> > +cc Matthew for page cache side
> > +cc Other memory mapping folks for mapping side
> > +cc various x86 folks for x86 side
> > +cc Kees for security side of things
> >
> > On Tue, Aug 12, 2025 at 05:31:09PM +0000, Brendan Jackman wrote:
> >> .:: Intro
> >>
> >> Following up to the plan I posted at [0], I've now prepared an up-to-date ASI
> >> branch that demonstrates a technique for solving the page cache performance
> >> devastation I described in [1]. The branch is at [5].
> >
> > Have looked through your branch at [5], note that the exit_mmap() code is
> > changing very soon see [ljs0]. Also with regard to PGD syncing, Harry introduced
> > a hotfix series recently to address issues around this generalising this PGD
> > sync code which may be usefully relevant to your series.
> >
> > [ljs0]:https://lore.kernel.org/linux-mm/20250815191031.3769540-1-Liam.Howlett@oracle.com/
> > [ljs1]:https://lore.kernel.org/linux-mm/20250818020206.4517-1-harry.yoo@oracle.com/
> 
> Thanks, this is useful info.
> 
> >>
> >> The goal of this prototype is to increase confidence that ASI is viable as a
> >> broad solution for CPU vulnerabilities. (If the community still has to develop
> >> and maintain new mitigations for every individual vuln, because ASI only works
> >> for certain use-cases, then ASI isn't super attractive given its complexity
> >> burden).
> >>
> >> The biggest gap for establishing that confidence was that Google's deployment
> >> still only uses ASI for KVM workloads, not bare-metal processes. And indeed the
> >> page cache turned out to be a massive issue that Google just hasn't run up
> >> against yet internally.
> >>
> >> .:: The "ephmap"
> >>
> >> I won't re-hash the details of the problem here (see [1]) but in short: file
> >> pages aren't mapped into the physmap as seen from ASI's restricted address space.
> >> This causes a major overhead when e.g. read()ing files. The solution we've
> >> always envisaged (and which I very hastily tried to describe at LSF/MM/BPF this
> >> year) was to simply stop read() etc from touching the physmap.
> >>
> >> This is achieved in this prototype by a mechanism that I've called the "ephmap".
> >> The ephmap is a special region of the kernel address space that is local to the
> >> mm (much like the "proclocal" idea from 2019 [2]). Users of the ephmap API can
> >> allocate a subregion of this, and provide pages that get mapped into their
> >> subregion. These subregions are CPU-local. This means that it's cheap to tear
> >> these mappings down, so they can be removed immediately after use (eph =
> >> "ephemeral"), eliminating the need for complex/costly tracking data structures.
> >
> > OK I had a bunch of questions here but looked at the code :)
> >
> > So the idea is we have a per-CPU buffer that is equal to the size of the largest
> > possible folio, for each process.
> >
> > I wonder by the way if we can cache page tables rather than alloc on bring
> > up/tear down? Or just zap? That could help things.
> 
> Yeah if I'm catching your gist correctly, we have done a bit of this in
> the Google-internal version. In cases where it's fine to fail to map
> stuff (as is the case for ephmap users in this branch) you can just have
> a little pool of pre-allocated pagetables that you can allocate from in
> arbitrary contexts. Maybe the ALLOC_TRYLOCK stuff could also be useful
> here, I haven't explored that.
> 
> >>
> >> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
> >> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).
> >
> > I do wonder if we need to have a separate kmap thing or whether we can just
> > adjust what already exists?
> 
> Yeah, I also wondered this. I think we could potentially just change the
> semantics of kmap_local_page() to suit ASI's needs, but I'm not really
> clear if that's consistent with the design or if there are perf
> concerns regarding its existing usecase. I am hoping once we start to
> get the more basic ASI stuff in, this will be a topic that will interest
> the right people, and I'll be able to get some useful input...
> 
> > Presumably we will restrict ASI support to 64-bit kernels only (starting with
> > and perhaps only for x86-64), so we can avoid the highmem bs.
> 
> Yep.
> 
> >>
> >> The ephmap can then be used for accessing file pages. It's also a generic
> >> mechanism for accessing sensitive data, for example it could be used for
> >> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
> >>
> >> .:: State of the branch
> >>
> >> The branch contains:
> >>
> >> - A rebased version of my "ASI integration for the page allocator" RFC [3]. (Up
> >>   to "mm/page_alloc: Add support for ASI-unmapping pages")
> >> - The rest of ASI's basic functionality (up to "mm: asi: Stop ignoring asi=on
> >>   cmdline flag")
> >> - Some test and observability conveniences (up to "mm: asi: Add a tracepoint for
> >>   ASI page faults")
> >> - A prototype of the new performance improvements (the remainder of the
> >>   branch).
> >>
> >> There's a gradient of quality where the earlier patches are closer to "complete"
> >> and the later ones are increasingly messy and hacky. Comments and commit message
> >> describe lots of the hacky elements but the most important things are:
> >>
> >> 1. The logic to take advantage of the ephmap is stuck directly into mm/shmem.c.
> >>    This is just a shortcut to make its behaviour obvious. Since tmpfs is the
> >>    most extreme case of the read/write slowdown this should give us some idea of
> >>    the performance improvements but it obviously hides a lot of important
> >>    complexity wrt how this would be integrated "for real".
> >
> > Right, at what level do you plan to put the 'real' stuff?
> >
> > generic_file_read_iter() + equivalent or something like this? But then you'd
> > miss some fs obv., so I guess filemap_read()?
> 
> Yeah, just putting it into these generic stuff seemed like the most
> obvious way, but I was also hoping there could be some more general way
> to integrate it into the page cache or even something like the iov
> system. I did not see anything like this yet, but I don't think I've
> done the full quota of code-gazing that I'd need to come up with the
> best idea here. (Also maybe the solution becomes obvious if I can find
> the right pair of eyes).
> 
> Anyway, my hope is that the number of filesystems that are both a) very
> special implementation-wise and b) dear to the hearts of
> performance-sensitive users is quite small, so maybe just injecting into
> the right pre-existing filemap.c helpers, plus one or two
> filesystem-specific additions, already gets us almost all the way there.
> 
> >>
> >> 2. The ephmap implementation is extremely stupid. It only works for the simple
> >>    shmem usecase. I don't think this is really important though, whatever we end
> >>    up with needs to be very simple, and it's not even clear that we actually
> >>    want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
> >>    kmap_local_page() itself).
> >
> > Right just testing stuff out, fair enough. Obviously not an upstremable thing
> > but sort of test case right?
> 
> Yeah exactly. 
> 
> Maybe worth adding here that I explored just using vmalloc's allocator
> for this. My experience was that despite looking quite nicely optimised
> re avoiding synchronisation, just the simple fact of traversing its data
> structures is too slow for this usecase (at least, it did poorly on my
> super-sensitive FIO benchmark setup).
> 
Could you please elaborate here? Which test case and what is a problem
for it?

You can fragment the main KVA space where we use a rb-tree to manage
free blocks. But the question is how important your use case and
workload for you?

Thank you!

--
Uladzislau Rezki


  parent reply	other threads:[~2025-08-22 16:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-12 17:31 [Discuss] First steps for ASI (ASI is fast again) Brendan Jackman
2025-08-19 18:03 ` Brendan Jackman
2025-08-21  8:55 ` Lorenzo Stoakes
2025-08-21 12:15   ` Brendan Jackman
2025-08-22 14:22     ` Lorenzo Stoakes
2025-08-22 17:18       ` Matthew Wilcox
2025-08-22 16:56     ` Uladzislau Rezki [this message]
2025-08-22 17:20       ` Brendan Jackman
2025-08-25  9:00         ` Uladzislau Rezki
2025-10-02  7:45 ` David Hildenbrand
2025-10-02 10:50   ` Brendan Jackman
2025-10-02 11:21     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aKihQv8fWzZIgnAW@pc636 \
    --to=urezki@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=derkling@google.com \
    --cc=harry.yoo@oracle.com \
    --cc=jackmanb@google.com \
    --cc=jannh@google.com \
    --cc=jpoimboe@kernel.org \
    --cc=junaids@google.com \
    --cc=kas@kernel.org \
    --cc=kees@kernel.org \
    --cc=liam.howlett@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pfalcato@suse.de \
    --cc=reijiw@google.com \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.