[Discuss] First steps for ASI (ASI is fast again)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [Discuss] First steps for ASI (ASI is fast again)
@ 2025-08-12 17:31 Brendan Jackman
  2025-08-19 18:03 ` Brendan Jackman
  2025-08-21  8:55 ` Lorenzo Stoakes
  0 siblings, 2 replies; 9+ messages in thread
From: Brendan Jackman @ 2025-08-12 17:31 UTC (permalink / raw)
  To: jackmanb, peterz, bp, dave.hansen, mingo, tglx
  Cc: akpm, david, derkling, junaids, linux-kernel, linux-mm, reijiw,
	rientjes, rppt, vbabka, x86, yosry.ahmed

.:: Intro

Following up to the plan I posted at [0], I've now prepared an up-to-date ASI
branch that demonstrates a technique for solving the page cache performance
devastation I described in [1]. The branch is at [5].

The goal of this prototype is to increase confidence that ASI is viable as a
broad solution for CPU vulnerabilities. (If the community still has to develop
and maintain new mitigations for every individual vuln, because ASI only works
for certain use-cases, then ASI isn't super attractive given its complexity
burden).

The biggest gap for establishing that confidence was that Google's deployment
still only uses ASI for KVM workloads, not bare-metal processes. And indeed the
page cache turned out to be a massive issue that Google just hasn't run up
against yet internally.

.:: The "ephmap"

I won't re-hash the details of the problem here (see [1]) but in short: file
pages aren't mapped into the physmap as seen from ASI's restricted address space.
This causes a major overhead when e.g. read()ing files. The solution we've
always envisaged (and which I very hastily tried to describe at LSF/MM/BPF this
year) was to simply stop read() etc from touching the physmap.

This is achieved in this prototype by a mechanism that I've called the "ephmap".
The ephmap is a special region of the kernel address space that is local to the
mm (much like the "proclocal" idea from 2019 [2]). Users of the ephmap API can
allocate a subregion of this, and provide pages that get mapped into their
subregion. These subregions are CPU-local. This means that it's cheap to tear
these mappings down, so they can be removed immediately after use (eph =
"ephemeral"), eliminating the need for complex/costly tracking data structures.

(You might notice the ephmap is extremely similar to kmap_local_page() - see the
commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).

The ephmap can then be used for accessing file pages. It's also a generic
mechanism for accessing sensitive data, for example it could be used for
zeroing sensitive pages, or if necessary for copy-on-write of user pages.

.:: State of the branch

The branch contains:

- A rebased version of my "ASI integration for the page allocator" RFC [3]. (Up
  to "mm/page_alloc: Add support for ASI-unmapping pages")
- The rest of ASI's basic functionality (up to "mm: asi: Stop ignoring asi=on
  cmdline flag")
- Some test and observability conveniences (up to "mm: asi: Add a tracepoint for
  ASI page faults")
- A prototype of the new performance improvements (the remainder of the
  branch).

There's a gradient of quality where the earlier patches are closer to "complete"
and the later ones are increasingly messy and hacky. Comments and commit message
describe lots of the hacky elements but the most important things are:

1. The logic to take advantage of the ephmap is stuck directly into mm/shmem.c.
   This is just a shortcut to make its behaviour obvious. Since tmpfs is the
   most extreme case of the read/write slowdown this should give us some idea of
   the performance improvements but it obviously hides a lot of important
   complexity wrt how this would be integrated "for real".

2. The ephmap implementation is extremely stupid. It only works for the simple
   shmem usecase. I don't think this is really important though, whatever we end
   up with needs to be very simple, and it's not even clear that we actually
   want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
   kmap_local_page() itself).

3. For software correctness, the ephmap only needs to be TLB-flushed on the
   local CPU. But for CPU vulnerability mitigation, flushes are needed on other
   CPUs too. I believe these flushes should only be needed very infrequently.
   "Add ephmap TLB flushes for mitigating CPU vulns" is an illustrative idea of
   how these flushes could be implemented, but it's a bit of a simplistic
   implementation. The commit message has some more details.

.:: Performance

This data was gathered using the scripts at [4]. This is running on a Sapphire
Rapids machine, but with setcpuid=retbleed. This introduces an IBPB in
asi_exit(), which dramatically amplifies the performance impact of ASI. We don't
know of any vulns that would necessitate this IBPB, so this is basically a weird
selectively-paranoid configuration of ASI. It doesn't really make sense from a
security perspective. A few years from now (once the security researchers have
had their fun) we'll know what's _really_ needed on this CPU, it's very unlikely
that it turns out to be exactly an IBPB like this, but it's reasonably likely to
be something with a vaguely similar performance overhead.

Native FIO randread IOPS on tmpfs (this is where the 70% perf degradation was):
+---------+---------+-----------+---------+-----------+---------------+
| variant | samples |      mean |     min |       max | delta mean    |
+---------+---------+-----------+---------+-----------+---------------+
| asi-off |      10 | 1,003,102 | 981,813 | 1,036,142 |               |
| asi-on  |      10 |   871,928 | 848,362 |   885,622 | -13.1%        |
+---------+---------+-----------+---------+-----------+---------------+

Native kernel compilation time:
+---------+---------+--------+--------+--------+-------------+
| variant | samples |   mean |    min |    max | delta mean  |
+---------+---------+--------+--------+--------+-------------+
| asi-off |       3 | 34.84s | 34.42s | 35.31s |             |
| asi-on  |       3 | 37.50s | 37.39s | 37.58s | 7.6%        |
+---------+---------+--------+--------+--------+-------------+

Kernel compilation in a guest VM:
+---------+---------+--------+--------+--------+-------------+
| variant | samples |   mean |    min |    max | delta mean  |
+---------+---------+--------+--------+--------+-------------+
| asi-off |       3 | 52.73s | 52.41s | 53.15s |             |
| asi-on  |       3 | 55.80s | 55.51s | 56.06s | 5.8%        |
+---------+---------+--------+--------+--------+-------------+

Despite my title these numbers are kinda disappointing to be honest, it's not
where I wanted to be by now, but it's still an order-of-magnitude better than
where we were for native FIO a few months ago. I believe almost all of this
remaining slowdown is due to unnecessary ASI exits, the key areas being:

- On every context_switch(). Google's internal implementation has fixed this (we
  only really need it when switching mms).

- Whenever zeroing sensitive pages from the allocator. This could potentially be
  solved with the ephmap but requires a bit of care to avoid opening CPU attack
  windows.

- In copy-on-write for user pages. The ephmap could also help here but the
  current implementation doesn't support it (it only allows one allocation at a
  time per context).

.:: Next steps

Here's where I'd like to go next:

1. Discuss here and get feedback from x86 folks. Dave H said we need "line of
   sight" to a version of ASI that's viable for sandboxing native workloads. I
   don't consider a 13% slowdown "viable" as-is, but I do think this shows we're
   out of the "but what about the page cache" black hole. It seems provably
   solvable now.

2. Once we have some x86 maintainers saying "yep, it looks like this can work
   and it's something we want", I can start turning my page_alloc RFC [3] into a
   proper patchset (or maybe multiple if I can find a way to break things down
   further).

Note what I'm NOT proposing is to carry on working on this branch until ASI is
as fast as I am claiming it eventually will be. I would like to avoid doing that
since I believe the biggest unknowns on that path are now solved, and it would
be more useful to start getting down to nuts and bolts, i.e. reviewing real,
PATCH-quality code and merging precursor stuff. I think this will lead to more
useful discussions about the overall design, since so far all my postings have
been so long and rarefied that it's been hard to really get a good conversation
going.

.:: Conclusion

So, x86 folks: Does this feel like "line of sight" to you? If not, what would
that look like, what experiments should I run?

---

[0] https://lore.kernel.org/lkml/DAJ0LUX8F2IW.Q95PTFBNMFOI@google.com/
[1] https://lore.kernel.org/linux-mm/20250129144320.2675822-1-jackmanb@google.com/
[2] https://lore.kernel.org/linux-mm/20190612170834.14855-1-mhillenb@amazon.de/
[3] https://lore.kernel.org/lkml/20250313-asi-page-alloc-v1-0-04972e046cea@google.com/
[4] https://github.com/bjackman/nixos-flake/commit/be42ba326f8a0854deb1d37143b5c70bf301c9db
[5] https://github.com/bjackman/linux/tree/asi/6.16

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Discuss] First steps for ASI (ASI is fast again)
  2025-08-12 17:31 [Discuss] First steps for ASI (ASI is fast again) Brendan Jackman
@ 2025-08-19 18:03 ` Brendan Jackman
  2025-08-21  8:55 ` Lorenzo Stoakes
  1 sibling, 0 replies; 9+ messages in thread
From: Brendan Jackman @ 2025-08-19 18:03 UTC (permalink / raw)
  To: Brendan Jackman, peterz, bp, dave.hansen, mingo, tglx
  Cc: akpm, david, derkling, junaids, linux-kernel, linux-mm, reijiw,
	rientjes, rppt, vbabka, x86, yosry.ahmed

On Tue Aug 12, 2025 at 5:31 PM UTC, Brendan Jackman wrote:
> .:: Performance

> Native FIO randread IOPS on tmpfs (this is where the 70% perf degradation was):
> +---------+---------+-----------+---------+-----------+---------------+
> | variant | samples |      mean |     min |       max | delta mean    |
> +---------+---------+-----------+---------+-----------+---------------+
> | asi-off |      10 | 1,003,102 | 981,813 | 1,036,142 |               |
> | asi-on  |      10 |   871,928 | 848,362 |   885,622 | -13.1%        |
> +---------+---------+-----------+---------+-----------+---------------+
>
> Native kernel compilation time:
> +---------+---------+--------+--------+--------+-------------+
> | variant | samples |   mean |    min |    max | delta mean  |
> +---------+---------+--------+--------+--------+-------------+
> | asi-off |       3 | 34.84s | 34.42s | 35.31s |             |
> | asi-on  |       3 | 37.50s | 37.39s | 37.58s | 7.6%        |
> +---------+---------+--------+--------+--------+-------------+
>
> Kernel compilation in a guest VM:
> +---------+---------+--------+--------+--------+-------------+
> | variant | samples |   mean |    min |    max | delta mean  |
> +---------+---------+--------+--------+--------+-------------+
> | asi-off |       3 | 52.73s | 52.41s | 53.15s |             |
> | asi-on  |       3 | 55.80s | 55.51s | 56.06s | 5.8%        |
> +---------+---------+--------+--------+--------+-------------+
>
> Despite my title these numbers are kinda disappointing to be honest, it's not
> where I wanted to be by now, but it's still an order-of-magnitude better than
> where we were for native FIO a few months ago. 

Some people have pointed out that I'm treating ASI pretty harshly, I'm
comparing mitigations=off vs ASI, while the "real" alternative to ASI is
whatever the kernel would do by default if we knew about the vulns on
this CPU.

We don't know about that so I can't do the exact comparison, but I can
at least repeat my compilation experiment on Skylake, without ASI,
comparing mitigations=off vs the default:

+-----------------+---------+--------+--------+--------+------------+
| variant         | samples |   mean |    min |    max | delta mean |
+-----------------+---------+--------+--------+--------+------------+
| baseline        |       6 | 54.15s | 53.94s | 54.36s |            |
| mitigations-off |       6 | 46.53s | 46.37s | 46.71s | -14.2%     |
+-----------------+---------+--------+--------+--------+------------+

So that's pretty comparable to my ASI results above.

(I'd love to just run ASI on Skylake and show you those numbers and go
"look, it's faster than the default", but the implementation I've posted
doesn't actually secure a Skylake box, we'll need to add more flushes
and stuff. So that would be unfair in the other direction).

Anyway, I'm gonna crack on with preparing a [PATCH] series now...


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Discuss] First steps for ASI (ASI is fast again)
  2025-08-12 17:31 [Discuss] First steps for ASI (ASI is fast again) Brendan Jackman
  2025-08-19 18:03 ` Brendan Jackman
@ 2025-08-21  8:55 ` Lorenzo Stoakes
  2025-08-21 12:15   ` Brendan Jackman
  1 sibling, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21  8:55 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: peterz, bp, dave.hansen, mingo, tglx, akpm, david, derkling,
	junaids, linux-kernel, linux-mm, reijiw, rientjes, rppt, vbabka,
	x86, yosry.ahmed, Matthew Wilcox, Liam Howlett,
	Kirill A. Shutemov, Harry Yoo, Jann Horn, Pedro Falcato,
	Andy Lutomirski, Josh Poimboeuf, Kees Cook

+cc Matthew for page cache side
+cc Other memory mapping folks for mapping side
+cc various x86 folks for x86 side
+cc Kees for security side of things

On Tue, Aug 12, 2025 at 05:31:09PM +0000, Brendan Jackman wrote:
> .:: Intro
>
> Following up to the plan I posted at [0], I've now prepared an up-to-date ASI
> branch that demonstrates a technique for solving the page cache performance
> devastation I described in [1]. The branch is at [5].

Have looked through your branch at [5], note that the exit_mmap() code is
changing very soon see [ljs0]. Also with regard to PGD syncing, Harry introduced
a hotfix series recently to address issues around this generalising this PGD
sync code which may be usefully relevant to your series.

[ljs0]:https://lore.kernel.org/linux-mm/20250815191031.3769540-1-Liam.Howlett@oracle.com/
[ljs1]:https://lore.kernel.org/linux-mm/20250818020206.4517-1-harry.yoo@oracle.com/

>
> The goal of this prototype is to increase confidence that ASI is viable as a
> broad solution for CPU vulnerabilities. (If the community still has to develop
> and maintain new mitigations for every individual vuln, because ASI only works
> for certain use-cases, then ASI isn't super attractive given its complexity
> burden).
>
> The biggest gap for establishing that confidence was that Google's deployment
> still only uses ASI for KVM workloads, not bare-metal processes. And indeed the
> page cache turned out to be a massive issue that Google just hasn't run up
> against yet internally.
>
> .:: The "ephmap"
>
> I won't re-hash the details of the problem here (see [1]) but in short: file
> pages aren't mapped into the physmap as seen from ASI's restricted address space.
> This causes a major overhead when e.g. read()ing files. The solution we've
> always envisaged (and which I very hastily tried to describe at LSF/MM/BPF this
> year) was to simply stop read() etc from touching the physmap.
>
> This is achieved in this prototype by a mechanism that I've called the "ephmap".
> The ephmap is a special region of the kernel address space that is local to the
> mm (much like the "proclocal" idea from 2019 [2]). Users of the ephmap API can
> allocate a subregion of this, and provide pages that get mapped into their
> subregion. These subregions are CPU-local. This means that it's cheap to tear
> these mappings down, so they can be removed immediately after use (eph =
> "ephemeral"), eliminating the need for complex/costly tracking data structures.

OK I had a bunch of questions here but looked at the code :)

So the idea is we have a per-CPU buffer that is equal to the size of the largest
possible folio, for each process.

I wonder by the way if we can cache page tables rather than alloc on bring
up/tear down? Or just zap? That could help things.

>
> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).

I do wonder if we need to have a separate kmap thing or whether we can just
adjust what already exists?

Presumably we will restrict ASI support to 64-bit kernels only (starting with
and perhaps only for x86-64), so we can avoid the highmem bs.

>
> The ephmap can then be used for accessing file pages. It's also a generic
> mechanism for accessing sensitive data, for example it could be used for
> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
>
> .:: State of the branch
>
> The branch contains:
>
> - A rebased version of my "ASI integration for the page allocator" RFC [3]. (Up
>   to "mm/page_alloc: Add support for ASI-unmapping pages")
> - The rest of ASI's basic functionality (up to "mm: asi: Stop ignoring asi=on
>   cmdline flag")
> - Some test and observability conveniences (up to "mm: asi: Add a tracepoint for
>   ASI page faults")
> - A prototype of the new performance improvements (the remainder of the
>   branch).
>
> There's a gradient of quality where the earlier patches are closer to "complete"
> and the later ones are increasingly messy and hacky. Comments and commit message
> describe lots of the hacky elements but the most important things are:
>
> 1. The logic to take advantage of the ephmap is stuck directly into mm/shmem.c.
>    This is just a shortcut to make its behaviour obvious. Since tmpfs is the
>    most extreme case of the read/write slowdown this should give us some idea of
>    the performance improvements but it obviously hides a lot of important
>    complexity wrt how this would be integrated "for real".

Right, at what level do you plan to put the 'real' stuff?

generic_file_read_iter() + equivalent or something like this? But then you'd
miss some fs obv., so I guess filemap_read()?

>
> 2. The ephmap implementation is extremely stupid. It only works for the simple
>    shmem usecase. I don't think this is really important though, whatever we end
>    up with needs to be very simple, and it's not even clear that we actually
>    want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
>    kmap_local_page() itself).

Right just testing stuff out, fair enough. Obviously not an upstremable thing
but sort of test case right?

>
> 3. For software correctness, the ephmap only needs to be TLB-flushed on the
>    local CPU. But for CPU vulnerability mitigation, flushes are needed on other
>    CPUs too. I believe these flushes should only be needed very infrequently.
>    "Add ephmap TLB flushes for mitigating CPU vulns" is an illustrative idea of
>    how these flushes could be implemented, but it's a bit of a simplistic
>    implementation. The commit message has some more details.

Yeah, I am no security/x86 expert so you'll need insight from those with a
better understanding of both, but I think it's worth taking the time to have
this do the minimum possible that we can prove is necessary in any real-world
scenario.

It's good to start super conservative though.

>
> .:: Performance
>
> This data was gathered using the scripts at [4]. This is running on a Sapphire
> Rapids machine, but with setcpuid=retbleed. This introduces an IBPB in
> asi_exit(), which dramatically amplifies the performance impact of ASI. We don't
> know of any vulns that would necessitate this IBPB, so this is basically a weird
> selectively-paranoid configuration of ASI. It doesn't really make sense from a
> security perspective. A few years from now (once the security researchers have
> had their fun) we'll know what's _really_ needed on this CPU, it's very unlikely
> that it turns out to be exactly an IBPB like this, but it's reasonably likely to
> be something with a vaguely similar performance overhead.

I mean, this all sounds like you should drop this :)

What do the numbers look like without it?

>
> Native FIO randread IOPS on tmpfs (this is where the 70% perf degradation was):
> +---------+---------+-----------+---------+-----------+---------------+
> | variant | samples |      mean |     min |       max | delta mean    |
> +---------+---------+-----------+---------+-----------+---------------+
> | asi-off |      10 | 1,003,102 | 981,813 | 1,036,142 |               |
> | asi-on  |      10 |   871,928 | 848,362 |   885,622 | -13.1%        |
> +---------+---------+-----------+---------+-----------+---------------+
>
> Native kernel compilation time:
> +---------+---------+--------+--------+--------+-------------+
> | variant | samples |   mean |    min |    max | delta mean  |
> +---------+---------+--------+--------+--------+-------------+
> | asi-off |       3 | 34.84s | 34.42s | 35.31s |             |
> | asi-on  |       3 | 37.50s | 37.39s | 37.58s | 7.6%        |
> +---------+---------+--------+--------+--------+-------------+
>
> Kernel compilation in a guest VM:
> +---------+---------+--------+--------+--------+-------------+
> | variant | samples |   mean |    min |    max | delta mean  |
> +---------+---------+--------+--------+--------+-------------+
> | asi-off |       3 | 52.73s | 52.41s | 53.15s |             |
> | asi-on  |       3 | 55.80s | 55.51s | 56.06s | 5.8%        |
> +---------+---------+--------+--------+--------+-------------+

(tiny nit but I think the bottom two are meant to be negative or the first
postiive :P)

>
> Despite my title these numbers are kinda disappointing to be honest, it's not
> where I wanted to be by now, but it's still an order-of-magnitude better than
> where we were for native FIO a few months ago. I believe almost all of this
> remaining slowdown is due to unnecessary ASI exits, the key areas being:

Nice, this broad approach does seem simple.

Obviously we really do need to see these numbers come down significantly for
this to be reasonably workable, as this kind of perf impact could really add up
at scale.

But from all you say it seems very plausible that we can in fact significant
reduce this.

Am guessing the below are general issues that are holding back ASI as a whole
perf-wise?

>
> - On every context_switch(). Google's internal implementation has fixed this (we
>   only really need it when switching mms).

How did you guys fix this?

>
> - Whenever zeroing sensitive pages from the allocator. This could potentially be
>   solved with the ephmap but requires a bit of care to avoid opening CPU attack
>   windows.

Right, seems that having a per-CPU mapping is a generally useful thing. I wonder
if we can actually generalise this past ASI...

By the way a random thought, but we really do need some generic page table code,
there's mm/pagewalk.c which has install_pte(), but David and I have spoken quite
few times about generalising past this (watch this space).

I do intend to add install_pmd() and install_pud() also for the purposes of one
of my currently many pending series :P

>
> - In copy-on-write for user pages. The ephmap could also help here but the
>   current implementation doesn't support it (it only allows one allocation at a
>   time per context).

Hmm, CoW generally a pain. Could you go into more detail as to what's the issue
here?

>
> .:: Next steps
>
> Here's where I'd like to go next:
>
> 1. Discuss here and get feedback from x86 folks. Dave H said we need "line of
>    sight" to a version of ASI that's viable for sandboxing native workloads. I
>    don't consider a 13% slowdown "viable" as-is, but I do think this shows we're
>    out of the "but what about the page cache" black hole. It seems provably
>    solvable now.

Yes I agree.

Obviously it'd be great to get some insight from x86 guys, but strikes me we're
still broadly in mm territory here.

I do think the next step is to take the original ASI series, make it fully
upstremable, and simply introduce the CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
flag, default to N of course, without the ephmap work yet in place, rather a
minimal implementation.

And in the config/docs/commit msgs etc. you can indicate its limitations and
perf overhead.

I think with numerous RFC's and talks we're good for you to just send that as a
normal series and get some proper review going and ideally some bots running
with ASI switched on also (all* + random configs should do that for free) + some
syzbot action.

That way we have the roots in place and can build further upon that, but nobody
is impacted unless they decide to consciously opt in despite the documented
overhead + limitations.

>
> 2. Once we have some x86 maintainers saying "yep, it looks like this can work
>    and it's something we want", I can start turning my page_alloc RFC [3] into a
>    proper patchset (or maybe multiple if I can find a way to break things down
>    further).
>
> Note what I'm NOT proposing is to carry on working on this branch until ASI is
> as fast as I am claiming it eventually will be. I would like to avoid doing that
> since I believe the biggest unknowns on that path are now solved, and it would
> be more useful to start getting down to nuts and bolts, i.e. reviewing real,
> PATCH-quality code and merging precursor stuff. I think this will lead to more
> useful discussions about the overall design, since so far all my postings have
> been so long and rarefied that it's been hard to really get a good conversation
> going.

Yes absolutely agreed.

Send the ASI core series as normal series and let's get the base stuff in tree
and some serious review going.

>
> .:: Conclusion
>
> So, x86 folks: Does this feel like "line of sight" to you? If not, what would
> that look like, what experiments should I run?

From an mm point of view, I think obviously the ephmap stuff you have now is
hacky (as you point out clearly in [5] yourself :) but the general approach
seems sensible.

>
> ---
>
> [0] https://lore.kernel.org/lkml/DAJ0LUX8F2IW.Q95PTFBNMFOI@google.com/
> [1] https://lore.kernel.org/linux-mm/20250129144320.2675822-1-jackmanb@google.com/
> [2] https://lore.kernel.org/linux-mm/20190612170834.14855-1-mhillenb@amazon.de/
> [3] https://lore.kernel.org/lkml/20250313-asi-page-alloc-v1-0-04972e046cea@google.com/
> [4] https://github.com/bjackman/nixos-flake/commit/be42ba326f8a0854deb1d37143b5c70bf301c9db
> [5] https://github.com/bjackman/linux/tree/asi/6.16
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Discuss] First steps for ASI (ASI is fast again)
  2025-08-21  8:55 ` Lorenzo Stoakes
@ 2025-08-21 12:15   ` Brendan Jackman
  2025-08-22 14:22     ` Lorenzo Stoakes
  2025-08-22 16:56     ` Uladzislau Rezki
  0 siblings, 2 replies; 9+ messages in thread
From: Brendan Jackman @ 2025-08-21 12:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: peterz, bp, dave.hansen, mingo, tglx, akpm, david, derkling,
	junaids, linux-kernel, linux-mm, reijiw, rientjes, rppt, vbabka,
	x86, yosry.ahmed, Matthew Wilcox, Liam Howlett,
	Kirill A. Shutemov, Harry Yoo, Jann Horn, Pedro Falcato,
	Andy Lutomirski, Josh Poimboeuf, Kees Cook

On Thu Aug 21, 2025 at 8:55 AM UTC, Lorenzo Stoakes wrote:
> +cc Matthew for page cache side
> +cc Other memory mapping folks for mapping side
> +cc various x86 folks for x86 side
> +cc Kees for security side of things
>
> On Tue, Aug 12, 2025 at 05:31:09PM +0000, Brendan Jackman wrote:
>> .:: Intro
>>
>> Following up to the plan I posted at [0], I've now prepared an up-to-date ASI
>> branch that demonstrates a technique for solving the page cache performance
>> devastation I described in [1]. The branch is at [5].
>
> Have looked through your branch at [5], note that the exit_mmap() code is
> changing very soon see [ljs0]. Also with regard to PGD syncing, Harry introduced
> a hotfix series recently to address issues around this generalising this PGD
> sync code which may be usefully relevant to your series.
>
> [ljs0]:https://lore.kernel.org/linux-mm/20250815191031.3769540-1-Liam.Howlett@oracle.com/
> [ljs1]:https://lore.kernel.org/linux-mm/20250818020206.4517-1-harry.yoo@oracle.com/

Thanks, this is useful info.

>>
>> The goal of this prototype is to increase confidence that ASI is viable as a
>> broad solution for CPU vulnerabilities. (If the community still has to develop
>> and maintain new mitigations for every individual vuln, because ASI only works
>> for certain use-cases, then ASI isn't super attractive given its complexity
>> burden).
>>
>> The biggest gap for establishing that confidence was that Google's deployment
>> still only uses ASI for KVM workloads, not bare-metal processes. And indeed the
>> page cache turned out to be a massive issue that Google just hasn't run up
>> against yet internally.
>>
>> .:: The "ephmap"
>>
>> I won't re-hash the details of the problem here (see [1]) but in short: file
>> pages aren't mapped into the physmap as seen from ASI's restricted address space.
>> This causes a major overhead when e.g. read()ing files. The solution we've
>> always envisaged (and which I very hastily tried to describe at LSF/MM/BPF this
>> year) was to simply stop read() etc from touching the physmap.
>>
>> This is achieved in this prototype by a mechanism that I've called the "ephmap".
>> The ephmap is a special region of the kernel address space that is local to the
>> mm (much like the "proclocal" idea from 2019 [2]). Users of the ephmap API can
>> allocate a subregion of this, and provide pages that get mapped into their
>> subregion. These subregions are CPU-local. This means that it's cheap to tear
>> these mappings down, so they can be removed immediately after use (eph =
>> "ephemeral"), eliminating the need for complex/costly tracking data structures.
>
> OK I had a bunch of questions here but looked at the code :)
>
> So the idea is we have a per-CPU buffer that is equal to the size of the largest
> possible folio, for each process.
>
> I wonder by the way if we can cache page tables rather than alloc on bring
> up/tear down? Or just zap? That could help things.

Yeah if I'm catching your gist correctly, we have done a bit of this in
the Google-internal version. In cases where it's fine to fail to map
stuff (as is the case for ephmap users in this branch) you can just have
a little pool of pre-allocated pagetables that you can allocate from in
arbitrary contexts. Maybe the ALLOC_TRYLOCK stuff could also be useful
here, I haven't explored that.

>>
>> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
>> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).
>
> I do wonder if we need to have a separate kmap thing or whether we can just
> adjust what already exists?

Yeah, I also wondered this. I think we could potentially just change the
semantics of kmap_local_page() to suit ASI's needs, but I'm not really
clear if that's consistent with the design or if there are perf
concerns regarding its existing usecase. I am hoping once we start to
get the more basic ASI stuff in, this will be a topic that will interest
the right people, and I'll be able to get some useful input...

> Presumably we will restrict ASI support to 64-bit kernels only (starting with
> and perhaps only for x86-64), so we can avoid the highmem bs.

Yep.

>>
>> The ephmap can then be used for accessing file pages. It's also a generic
>> mechanism for accessing sensitive data, for example it could be used for
>> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
>>
>> .:: State of the branch
>>
>> The branch contains:
>>
>> - A rebased version of my "ASI integration for the page allocator" RFC [3]. (Up
>>   to "mm/page_alloc: Add support for ASI-unmapping pages")
>> - The rest of ASI's basic functionality (up to "mm: asi: Stop ignoring asi=on
>>   cmdline flag")
>> - Some test and observability conveniences (up to "mm: asi: Add a tracepoint for
>>   ASI page faults")
>> - A prototype of the new performance improvements (the remainder of the
>>   branch).
>>
>> There's a gradient of quality where the earlier patches are closer to "complete"
>> and the later ones are increasingly messy and hacky. Comments and commit message
>> describe lots of the hacky elements but the most important things are:
>>
>> 1. The logic to take advantage of the ephmap is stuck directly into mm/shmem.c.
>>    This is just a shortcut to make its behaviour obvious. Since tmpfs is the
>>    most extreme case of the read/write slowdown this should give us some idea of
>>    the performance improvements but it obviously hides a lot of important
>>    complexity wrt how this would be integrated "for real".
>
> Right, at what level do you plan to put the 'real' stuff?
>
> generic_file_read_iter() + equivalent or something like this? But then you'd
> miss some fs obv., so I guess filemap_read()?

Yeah, just putting it into these generic stuff seemed like the most
obvious way, but I was also hoping there could be some more general way
to integrate it into the page cache or even something like the iov
system. I did not see anything like this yet, but I don't think I've
done the full quota of code-gazing that I'd need to come up with the
best idea here. (Also maybe the solution becomes obvious if I can find
the right pair of eyes).

Anyway, my hope is that the number of filesystems that are both a) very
special implementation-wise and b) dear to the hearts of
performance-sensitive users is quite small, so maybe just injecting into
the right pre-existing filemap.c helpers, plus one or two
filesystem-specific additions, already gets us almost all the way there.

>>
>> 2. The ephmap implementation is extremely stupid. It only works for the simple
>>    shmem usecase. I don't think this is really important though, whatever we end
>>    up with needs to be very simple, and it's not even clear that we actually
>>    want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
>>    kmap_local_page() itself).
>
> Right just testing stuff out, fair enough. Obviously not an upstremable thing
> but sort of test case right?

Yeah exactly. 

Maybe worth adding here that I explored just using vmalloc's allocator
for this. My experience was that despite looking quite nicely optimised
re avoiding synchronisation, just the simple fact of traversing its data
structures is too slow for this usecase (at least, it did poorly on my
super-sensitive FIO benchmark setup).

>> 3. For software correctness, the ephmap only needs to be TLB-flushed on the
>>    local CPU. But for CPU vulnerability mitigation, flushes are needed on other
>>    CPUs too. I believe these flushes should only be needed very infrequently.
>>    "Add ephmap TLB flushes for mitigating CPU vulns" is an illustrative idea of
>>    how these flushes could be implemented, but it's a bit of a simplistic
>>    implementation. The commit message has some more details.
>
> Yeah, I am no security/x86 expert so you'll need insight from those with a
> better understanding of both, but I think it's worth taking the time to have
> this do the minimum possible that we can prove is necessary in any real-world
> scenario.

I can also add a bit of colour here in case it piques any interest.

What I think we can do is an mm-global flush whenever there's a
possibility that the process is losing logical access to a physical
page. So basically I think that's whenever we evict from the page cache,
or the user closes a file.

("Logical access" = we would let them do a read() that gives them the
contents of the page).

The key insight is that a) those events are reeelatively rare and b)
already often involve big TLB flushes. So doing global flushes there is
not that bad, and this allows us to forget about all the particular
details of which pages might have TLB entries on which CPUs and just say
"_some_ CPU in this MM might have _some_ stale TLB entry", which is
simple and efficient to track.

So yeah actually this doesn't really require too much security
understanding, it's mostly just a job of making sure we don't forget a
place where the flush would be needed, and then tying it nicely with the
existing TLB infrastructure so that we can aggregate the flushes and
avoid redundant IPIs. It's fiddly, but in a fun way. So I think this is
"the easy bit".

> It's good to start super conservative though.
>
>>
>> .:: Performance
>>
>> This data was gathered using the scripts at [4]. This is running on a Sapphire
>> Rapids machine, but with setcpuid=retbleed. This introduces an IBPB in
>> asi_exit(), which dramatically amplifies the performance impact of ASI. We don't
>> know of any vulns that would necessitate this IBPB, so this is basically a weird
>> selectively-paranoid configuration of ASI. It doesn't really make sense from a
>> security perspective. A few years from now (once the security researchers have
>> had their fun) we'll know what's _really_ needed on this CPU, it's very unlikely
>> that it turns out to be exactly an IBPB like this, but it's reasonably likely to
>> be something with a vaguely similar performance overhead.
>
> I mean, this all sounds like you should drop this :)
>
> What do the numbers look like without it?

Sure, let's see...

(Minor note: I said above that setcpuid=retbleed triggered this IBPB but
I just noticed that's wrong, in the code I've posted the IBPB is
hard-coded. So to disable it I'm setting clearcpuid=ibpb).

metric: compile-kernel_elapsed (ns)   |  test: compile-kernel_host
+---------+---------+--------+--------+--------+------+
| variant | samples |   mean |    min |    max | Δμ   |
+---------+---------+--------+--------+--------+------+
| asi-off |       0 | 35.10s | 35.00s | 35.16s |      |
| asi-on  |       0 | 36.85s | 36.77s | 37.00s | 5.0% |
+---------+---------+--------+--------+--------+------+

My first guess at the main source of that 5% would be the address space
switches themselves. At the moment you'll see that __asi_enter() and
asi_exit() always clear the noflush bit in CR3 meaning they trash the
TLB. This is not particularly difficult to address, it just means
extending all the existing stuff in tlb.c etc to deal with an additional
address space (this is done in Google's internal version).

(But getting rid of the asi_exits() completely is the higher-priority
optimisation. On most CPUs that TLB trashing is gonna be less
significant than the actual security flushes, which can't be avoided if
we do transition. This is why I introduced the IBPB, since otherwise
Sapphire Rapids makes things look a bit too easy. See the bullet points
below for what I think is needed to eliminate most of the transitions).

>> Native FIO randread IOPS on tmpfs (this is where the 70% perf degradation was):
>> +---------+---------+-----------+---------+-----------+---------------+
>> | variant | samples |      mean |     min |       max | delta mean    |
>> +---------+---------+-----------+---------+-----------+---------------+
>> | asi-off |      10 | 1,003,102 | 981,813 | 1,036,142 |               |
>> | asi-on  |      10 |   871,928 | 848,362 |   885,622 | -13.1%        |
>> +---------+---------+-----------+---------+-----------+---------------+
>>
>> Native kernel compilation time:
>> +---------+---------+--------+--------+--------+-------------+
>> | variant | samples |   mean |    min |    max | delta mean  |
>> +---------+---------+--------+--------+--------+-------------+
>> | asi-off |       3 | 34.84s | 34.42s | 35.31s |             |
>> | asi-on  |       3 | 37.50s | 37.39s | 37.58s | 7.6%        |
>> +---------+---------+--------+--------+--------+-------------+
>>
>> Kernel compilation in a guest VM:
>> +---------+---------+--------+--------+--------+-------------+
>> | variant | samples |   mean |    min |    max | delta mean  |
>> +---------+---------+--------+--------+--------+-------------+
>> | asi-off |       3 | 52.73s | 52.41s | 53.15s |             |
>> | asi-on  |       3 | 55.80s | 55.51s | 56.06s | 5.8%        |
>> +---------+---------+--------+--------+--------+-------------+
>
> (tiny nit but I think the bottom two are meant to be negative or the first
> postiive :P)

The polarities are correct - more FIO IOPS is better, more kernel
compilation duration is worse. (Maybe I should make my scripts aware of
which direction is better for each metric!)

>> Despite my title these numbers are kinda disappointing to be honest, it's not
>> where I wanted to be by now, but it's still an order-of-magnitude better than
>> where we were for native FIO a few months ago. I believe almost all of this
>> remaining slowdown is due to unnecessary ASI exits, the key areas being:
>
> Nice, this broad approach does seem simple.
>
> Obviously we really do need to see these numbers come down significantly for
> this to be reasonably workable, as this kind of perf impact could really add up
> at scale.
>
> But from all you say it seems very plausible that we can in fact significant
> reduce this.
>
> Am guessing the below are general issues that are holding back ASI as a whole
> perf-wise?
>
>>
>> - On every context_switch(). Google's internal implementation has fixed this (we
>>   only really need it when switching mms).
>
> How did you guys fix this?

The only issue here is that it makes CR3 unstable in places where it was
formerly stable: if you're in the restricted address space, an interrupt
might show up and cause an asi_exit() at any time. (CR3 is already
unstable when preemption is on because the PCID can get recycled). So we
just had to updated the CR3 accessor API and then hunt for places that
access CR3 directly.

Other than that, we had to fiddle around with the lifetime of struct asi
a bit (this doesn't really add complexity TBH, we just made it live as
long as the mm_struct). Then we can stay in the restricted address space
across context_switch() within the same mm, including to a kthread and
back.

>> - Whenever zeroing sensitive pages from the allocator. This could potentially be
>>   solved with the ephmap but requires a bit of care to avoid opening CPU attack
>>   windows.
>
> Right, seems that having a per-CPU mapping is a generally useful thing. I wonder
> if we can actually generalise this past ASI...
>
> By the way a random thought, but we really do need some generic page table code,
> there's mm/pagewalk.c which has install_pte(), but David and I have spoken quite
> few times about generalising past this (watch this space).

OK good to know, Yosry and I both did some fiddling around trying to
come up with cute ways to write this kinda code but in the end I think
the best way is quite dependent on maintainer preference.

> I do intend to add install_pmd() and install_pud() also for the purposes of one
> of my currently many pending series :P
>
>>
>> - In copy-on-write for user pages. The ephmap could also help here but the
>>   current implementation doesn't support it (it only allows one allocation at a
>>   time per context).
>
> Hmm, CoW generally a pain. Could you go into more detail as to what's the issue
> here?

It's just that you have two user pages that you wanna touch at once
(src, dst). This crappy ephmap implementation doesn't suppport two
mappings at once in the same context, so the second allocation fails, so
you always get an asi_exit().

>>
>> .:: Next steps
>>
>> Here's where I'd like to go next:
>>
>> 1. Discuss here and get feedback from x86 folks. Dave H said we need "line of
>>    sight" to a version of ASI that's viable for sandboxing native workloads. I
>>    don't consider a 13% slowdown "viable" as-is, but I do think this shows we're
>>    out of the "but what about the page cache" black hole. It seems provably
>>    solvable now.
>
> Yes I agree.
>
> Obviously it'd be great to get some insight from x86 guys, but strikes me we're
> still broadly in mm territory here.

Implementation wise, certainly. It's just that I'd prefer not to take
up loads of everyone's time hashing out implementation details if
there's a risk that the x86 guys NAK it when we get to their part.

> I do think the next step is to take the original ASI series, make it fully
> upstremable, and simply introduce the CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
> flag, default to N of course, without the ephmap work yet in place, rather a
> minimal implementation.

I think even this would actually be too big, reviewing all that at once
would be quite unpleasant even in the absolutely minimal case. But yes I
think we can get a series-of-series that does this :)

> And in the config/docs/commit msgs etc. you can indicate its limitations and
> perf overhead.
>
> I think with numerous RFC's and talks we're good for you to just send that as a
> normal series and get some proper review going and ideally some bots running
> with ASI switched on also (all* + random configs should do that for free) + some
> syzbot action.
>
> That way we have the roots in place and can build further upon that, but nobody
> is impacted unless they decide to consciously opt in despite the documented
> overhead + limitations.
>
>>
>> 2. Once we have some x86 maintainers saying "yep, it looks like this can work
>>    and it's something we want", I can start turning my page_alloc RFC [3] into a
>>    proper patchset (or maybe multiple if I can find a way to break things down
>>    further).
>>
>> Note what I'm NOT proposing is to carry on working on this branch until ASI is
>> as fast as I am claiming it eventually will be. I would like to avoid doing that
>> since I believe the biggest unknowns on that path are now solved, and it would
>> be more useful to start getting down to nuts and bolts, i.e. reviewing real,
>> PATCH-quality code and merging precursor stuff. I think this will lead to more
>> useful discussions about the overall design, since so far all my postings have
>> been so long and rarefied that it's been hard to really get a good conversation
>> going.
>
> Yes absolutely agreed.
>
> Send the ASI core series as normal series and let's get the base stuff in tree
> and some serious review going.
>
>>
>> .:: Conclusion
>>
>> So, x86 folks: Does this feel like "line of sight" to you? If not, what would
>> that look like, what experiments should I run?
>
> From an mm point of view, I think obviously the ephmap stuff you have now is
> hacky (as you point out clearly in [5] yourself :) but the general approach
> seems sensible.

Great, thanks so much for taking a look!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Discuss] First steps for ASI (ASI is fast again)
  2025-08-21 12:15   ` Brendan Jackman
@ 2025-08-22 14:22     ` Lorenzo Stoakes
  2025-08-22 17:18       ` Matthew Wilcox
  2025-08-22 16:56     ` Uladzislau Rezki
  1 sibling, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-08-22 14:22 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: peterz, bp, dave.hansen, mingo, tglx, akpm, david, derkling,
	junaids, linux-kernel, linux-mm, reijiw, rientjes, rppt, vbabka,
	x86, yosry.ahmed, Matthew Wilcox, Liam Howlett,
	Kirill A. Shutemov, Harry Yoo, Jann Horn, Pedro Falcato,
	Andy Lutomirski, Josh Poimboeuf, Kees Cook

On Thu, Aug 21, 2025 at 12:15:04PM +0000, Brendan Jackman wrote:
> > OK I had a bunch of questions here but looked at the code :)
> >
> > So the idea is we have a per-CPU buffer that is equal to the size of the largest
> > possible folio, for each process.
> >
> > I wonder by the way if we can cache page tables rather than alloc on bring
> > up/tear down? Or just zap? That could help things.
>
> Yeah if I'm catching your gist correctly, we have done a bit of this in
> the Google-internal version. In cases where it's fine to fail to map
> stuff (as is the case for ephmap users in this branch) you can just have
> a little pool of pre-allocated pagetables that you can allocate from in
> arbitrary contexts. Maybe the ALLOC_TRYLOCK stuff could also be useful
> here, I haven't explored that.

Yeah nice, seems an easy win!

>
> >>
> >> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
> >> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).
> >
> > I do wonder if we need to have a separate kmap thing or whether we can just
> > adjust what already exists?
>
> Yeah, I also wondered this. I think we could potentially just change the
> semantics of kmap_local_page() to suit ASI's needs, but I'm not really
> clear if that's consistent with the design or if there are perf
> concerns regarding its existing usecase. I am hoping once we start to
> get the more basic ASI stuff in, this will be a topic that will interest
> the right people, and I'll be able to get some useful input...

I think Matthew again might have some thoughts here.

>
> > Presumably we will restrict ASI support to 64-bit kernels only (starting with
> > and perhaps only for x86-64), so we can avoid the highmem bs.
>
> Yep.

Cool. If only we could move the rest of the kernel to this :)

>
> >>
> >> The ephmap can then be used for accessing file pages. It's also a generic
> >> mechanism for accessing sensitive data, for example it could be used for
> >> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
> >>
> >> .:: State of the branch
> >>
> >> The branch contains:
> >>
> >> - A rebased version of my "ASI integration for the page allocator" RFC [3]. (Up
> >>   to "mm/page_alloc: Add support for ASI-unmapping pages")
> >> - The rest of ASI's basic functionality (up to "mm: asi: Stop ignoring asi=on
> >>   cmdline flag")
> >> - Some test and observability conveniences (up to "mm: asi: Add a tracepoint for
> >>   ASI page faults")
> >> - A prototype of the new performance improvements (the remainder of the
> >>   branch).
> >>
> >> There's a gradient of quality where the earlier patches are closer to "complete"
> >> and the later ones are increasingly messy and hacky. Comments and commit message
> >> describe lots of the hacky elements but the most important things are:
> >>
> >> 1. The logic to take advantage of the ephmap is stuck directly into mm/shmem.c.
> >>    This is just a shortcut to make its behaviour obvious. Since tmpfs is the
> >>    most extreme case of the read/write slowdown this should give us some idea of
> >>    the performance improvements but it obviously hides a lot of important
> >>    complexity wrt how this would be integrated "for real".
> >
> > Right, at what level do you plan to put the 'real' stuff?
> >
> > generic_file_read_iter() + equivalent or something like this? But then you'd
> > miss some fs obv., so I guess filemap_read()?
>
> Yeah, just putting it into these generic stuff seemed like the most
> obvious way, but I was also hoping there could be some more general way
> to integrate it into the page cache or even something like the iov
> system. I did not see anything like this yet, but I don't think I've
> done the full quota of code-gazing that I'd need to come up with the
> best idea here. (Also maybe the solution becomes obvious if I can find
> the right pair of eyes).

I think you'd need filemap_read() and possibly filemap_splcie_read()? Not
sure iterator stuff is right level of abstraction at all as should be
explicitly about page cache, but then maybe we just want to use this
_generally_? Probably a combination of:

- Checking what every filesystem ultimately uses
- Emperically testing different approaches

Is the way to go.

>
> Anyway, my hope is that the number of filesystems that are both a) very
> special implementation-wise and b) dear to the hearts of
> performance-sensitive users is quite small, so maybe just injecting into
> the right pre-existing filemap.c helpers, plus one or two
> filesystem-specific additions, already gets us almost all the way there.

Yeah I think the bulk use some form of generic_*().

>
> >>
> >> 2. The ephmap implementation is extremely stupid. It only works for the simple
> >>    shmem usecase. I don't think this is really important though, whatever we end
> >>    up with needs to be very simple, and it's not even clear that we actually
> >>    want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
> >>    kmap_local_page() itself).
> >
> > Right just testing stuff out, fair enough. Obviously not an upstremable thing
> > but sort of test case right?
>
> Yeah exactly.
>
> Maybe worth adding here that I explored just using vmalloc's allocator
> for this. My experience was that despite looking quite nicely optimised
> re avoiding synchronisation, just the simple fact of traversing its data
> structures is too slow for this usecase (at least, it did poorly on my
> super-sensitive FIO benchmark setup).

Yeah I think honestly vmalloc is fairly unopitimised in many ways, while
Ulad is doing fantastic work, there's a lot of legacy cruft and duplication
there.

>
> >> 3. For software correctness, the ephmap only needs to be TLB-flushed on the
> >>    local CPU. But for CPU vulnerability mitigation, flushes are needed on other
> >>    CPUs too. I believe these flushes should only be needed very infrequently.
> >>    "Add ephmap TLB flushes for mitigating CPU vulns" is an illustrative idea of
> >>    how these flushes could be implemented, but it's a bit of a simplistic
> >>    implementation. The commit message has some more details.
> >
> > Yeah, I am no security/x86 expert so you'll need insight from those with a
> > better understanding of both, but I think it's worth taking the time to have
> > this do the minimum possible that we can prove is necessary in any real-world
> > scenario.
>
> I can also add a bit of colour here in case it piques any interest.
>
> What I think we can do is an mm-global flush whenever there's a
> possibility that the process is losing logical access to a physical
> page. So basically I think that's whenever we evict from the page cache,
> or the user closes a file.
>
> ("Logical access" = we would let them do a read() that gives them the
> contents of the page).
>
> The key insight is that a) those events are reeelatively rare and b)
> already often involve big TLB flushes. So doing global flushes there is
> not that bad, and this allows us to forget about all the particular
> details of which pages might have TLB entries on which CPUs and just say
> "_some_ CPU in this MM might have _some_ stale TLB entry", which is
> simple and efficient to track.

I guess rare to get truncation mid-way through a read(), closing it mid-way
would be... a bug surely? :P

I may be missing context here however.

But yes we can probably not worry at all about perf of _that_

>
> So yeah actually this doesn't really require too much security
> understanding, it's mostly just a job of making sure we don't forget a
> place where the flush would be needed, and then tying it nicely with the
> existing TLB infrastructure so that we can aggregate the flushes and
> avoid redundant IPIs. It's fiddly, but in a fun way. So I think this is
> "the easy bit".
>

Cool.

I guess starting conservative is sensible for security though.

> > It's good to start super conservative though.
> >
> >>
> >> .:: Performance
> >>
> >> This data was gathered using the scripts at [4]. This is running on a Sapphire
> >> Rapids machine, but with setcpuid=retbleed. This introduces an IBPB in
> >> asi_exit(), which dramatically amplifies the performance impact of ASI. We don't
> >> know of any vulns that would necessitate this IBPB, so this is basically a weird
> >> selectively-paranoid configuration of ASI. It doesn't really make sense from a
> >> security perspective. A few years from now (once the security researchers have
> >> had their fun) we'll know what's _really_ needed on this CPU, it's very unlikely
> >> that it turns out to be exactly an IBPB like this, but it's reasonably likely to
> >> be something with a vaguely similar performance overhead.
> >
> > I mean, this all sounds like you should drop this :)
> >
> > What do the numbers look like without it?
>
> Sure, let's see...
>
> (Minor note: I said above that setcpuid=retbleed triggered this IBPB but
> I just noticed that's wrong, in the code I've posted the IBPB is
> hard-coded. So to disable it I'm setting clearcpuid=ibpb).
>
> metric: compile-kernel_elapsed (ns)   |  test: compile-kernel_host
> +---------+---------+--------+--------+--------+------+
> | variant | samples |   mean |    min |    max | Δμ   |
> +---------+---------+--------+--------+--------+------+
> | asi-off |       0 | 35.10s | 35.00s | 35.16s |      |
> | asi-on  |       0 | 36.85s | 36.77s | 37.00s | 5.0% |
> +---------+---------+--------+--------+--------+------+
>
> My first guess at the main source of that 5% would be the address space
> switches themselves. At the moment you'll see that __asi_enter() and
> asi_exit() always clear the noflush bit in CR3 meaning they trash the
> TLB. This is not particularly difficult to address, it just means
> extending all the existing stuff in tlb.c etc to deal with an additional
> address space (this is done in Google's internal version).

Cool, sounds like it would just be a bit fiddly then.

>
> (But getting rid of the asi_exits() completely is the higher-priority
> optimisation. On most CPUs that TLB trashing is gonna be less
> significant than the actual security flushes, which can't be avoided if
> we do transition. This is why I introduced the IBPB, since otherwise
> Sapphire Rapids makes things look a bit too easy. See the bullet points
> below for what I think is needed to eliminate most of the transitions).
>

Ack.

> >> Native FIO randread IOPS on tmpfs (this is where the 70% perf degradation was):
> >> +---------+---------+-----------+---------+-----------+---------------+
> >> | variant | samples |      mean |     min |       max | delta mean    |
> >> +---------+---------+-----------+---------+-----------+---------------+
> >> | asi-off |      10 | 1,003,102 | 981,813 | 1,036,142 |               |
> >> | asi-on  |      10 |   871,928 | 848,362 |   885,622 | -13.1%        |
> >> +---------+---------+-----------+---------+-----------+---------------+
> >>
> >> Native kernel compilation time:
> >> +---------+---------+--------+--------+--------+-------------+
> >> | variant | samples |   mean |    min |    max | delta mean  |
> >> +---------+---------+--------+--------+--------+-------------+
> >> | asi-off |       3 | 34.84s | 34.42s | 35.31s |             |
> >> | asi-on  |       3 | 37.50s | 37.39s | 37.58s | 7.6%        |
> >> +---------+---------+--------+--------+--------+-------------+
> >>
> >> Kernel compilation in a guest VM:
> >> +---------+---------+--------+--------+--------+-------------+
> >> | variant | samples |   mean |    min |    max | delta mean  |
> >> +---------+---------+--------+--------+--------+-------------+
> >> | asi-off |       3 | 52.73s | 52.41s | 53.15s |             |
> >> | asi-on  |       3 | 55.80s | 55.51s | 56.06s | 5.8%        |
> >> +---------+---------+--------+--------+--------+-------------+
> >
> > (tiny nit but I think the bottom two are meant to be negative or the first
> > postiive :P)
>
> The polarities are correct - more FIO IOPS is better, more kernel
> compilation duration is worse. (Maybe I should make my scripts aware of
> which direction is better for each metric!)
>

Ahhh so, right. I just saw it as a raw directional delta so either you
decide +ve or -ve is good. But that makes sense!


> >> Despite my title these numbers are kinda disappointing to be honest, it's not
> >> where I wanted to be by now, but it's still an order-of-magnitude better than
> >> where we were for native FIO a few months ago. I believe almost all of this
> >> remaining slowdown is due to unnecessary ASI exits, the key areas being:
> >
> > Nice, this broad approach does seem simple.
> >
> > Obviously we really do need to see these numbers come down significantly for
> > this to be reasonably workable, as this kind of perf impact could really add up
> > at scale.
> >
> > But from all you say it seems very plausible that we can in fact significant
> > reduce this.
> >
> > Am guessing the below are general issues that are holding back ASI as a whole
> > perf-wise?
> >
> >>
> >> - On every context_switch(). Google's internal implementation has fixed this (we
> >>   only really need it when switching mms).
> >
> > How did you guys fix this?
>
> The only issue here is that it makes CR3 unstable in places where it was
> formerly stable: if you're in the restricted address space, an interrupt
> might show up and cause an asi_exit() at any time. (CR3 is already
> unstable when preemption is on because the PCID can get recycled). So we
> just had to updated the CR3 accessor API and then hunt for places that
> access CR3 directly.

Ack. Doesn't seem... too egregious?

>
> Other than that, we had to fiddle around with the lifetime of struct asi
> a bit (this doesn't really add complexity TBH, we just made it live as
> long as the mm_struct). Then we can stay in the restricted address space
> across context_switch() within the same mm, including to a kthread and
> back.
>
Ugh mm lifetime is already a bit horrendous with the various forking stuff
and exit_mmap() is a horrendous nightmare, ref. Liam's recent RFC on this.

So need to tread carefully :)


> >> - Whenever zeroing sensitive pages from the allocator. This could potentially be
> >>   solved with the ephmap but requires a bit of care to avoid opening CPU attack
> >>   windows.
> >
> > Right, seems that having a per-CPU mapping is a generally useful thing. I wonder
> > if we can actually generalise this past ASI...
> >
> > By the way a random thought, but we really do need some generic page table code,
> > there's mm/pagewalk.c which has install_pte(), but David and I have spoken quite
> > few times about generalising past this (watch this space).
>
> OK good to know, Yosry and I both did some fiddling around trying to
> come up with cute ways to write this kinda code but in the end I think
> the best way is quite dependent on maintainer preference.
>

Yeah, the whole situation is a bit of a mess still tbh. Let's see on review.

> > I do intend to add install_pmd() and install_pud() also for the purposes of one
> > of my currently many pending series :P
> >
> >>
> >> - In copy-on-write for user pages. The ephmap could also help here but the
> >>   current implementation doesn't support it (it only allows one allocation at a
> >>   time per context).
> >
> > Hmm, CoW generally a pain. Could you go into more detail as to what's the issue
> > here?
>
> It's just that you have two user pages that you wanna touch at once
> (src, dst). This crappy ephmap implementation doesn't suppport two
> mappings at once in the same context, so the second allocation fails, so
> you always get an asi_exit().

Right... well like can we just have space for 2 then? ;) it's mappings not
actually allocating pages so... :)

>
> >>
> >> .:: Next steps
> >>
> >> Here's where I'd like to go next:
> >>
> >> 1. Discuss here and get feedback from x86 folks. Dave H said we need "line of
> >>    sight" to a version of ASI that's viable for sandboxing native workloads. I
> >>    don't consider a 13% slowdown "viable" as-is, but I do think this shows we're
> >>    out of the "but what about the page cache" black hole. It seems provably
> >>    solvable now.
> >
> > Yes I agree.
> >
> > Obviously it'd be great to get some insight from x86 guys, but strikes me we're
> > still broadly in mm territory here.
>
> Implementation wise, certainly. It's just that I'd prefer not to take
> up loads of everyone's time hashing out implementation details if
> there's a risk that the x86 guys NAK it when we get to their part.

I think it's better to just go ahead with the series, everybody's super
busy, so you're less likely to get meaningful responses like this by doing
so.

>
> > I do think the next step is to take the original ASI series, make it fully
> > upstremable, and simply introduce the CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
> > flag, default to N of course, without the ephmap work yet in place, rather a
> > minimal implementation.
>
> I think even this would actually be too big, reviewing all that at once
> would be quite unpleasant even in the absolutely minimal case. But yes I
> think we can get a series-of-series that does this :)

Well generally I mean we should just get going with some iterative series :)

>
> > And in the config/docs/commit msgs etc. you can indicate its limitations and
> > perf overhead.
> >
> > I think with numerous RFC's and talks we're good for you to just send that as a
> > normal series and get some proper review going and ideally some bots running
> > with ASI switched on also (all* + random configs should do that for free) + some
> > syzbot action.
> >
> > That way we have the roots in place and can build further upon that, but nobody
> > is impacted unless they decide to consciously opt in despite the documented
> > overhead + limitations.
> >
> >>
> >> 2. Once we have some x86 maintainers saying "yep, it looks like this can work
> >>    and it's something we want", I can start turning my page_alloc RFC [3] into a
> >>    proper patchset (or maybe multiple if I can find a way to break things down
> >>    further).
> >>
> >> Note what I'm NOT proposing is to carry on working on this branch until ASI is
> >> as fast as I am claiming it eventually will be. I would like to avoid doing that
> >> since I believe the biggest unknowns on that path are now solved, and it would
> >> be more useful to start getting down to nuts and bolts, i.e. reviewing real,
> >> PATCH-quality code and merging precursor stuff. I think this will lead to more
> >> useful discussions about the overall design, since so far all my postings have
> >> been so long and rarefied that it's been hard to really get a good conversation
> >> going.
> >
> > Yes absolutely agreed.
> >
> > Send the ASI core series as normal series and let's get the base stuff in tree
> > and some serious review going.
> >
> >>
> >> .:: Conclusion
> >>
> >> So, x86 folks: Does this feel like "line of sight" to you? If not, what would
> >> that look like, what experiments should I run?
> >
> > From an mm point of view, I think obviously the ephmap stuff you have now is
> > hacky (as you point out clearly in [5] yourself :) but the general approach
> > seems sensible.
>
> Great, thanks so much for taking a look!

No problem :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Discuss] First steps for ASI (ASI is fast again)
  2025-08-22 14:22     ` Lorenzo Stoakes
@ 2025-08-22 17:18       ` Matthew Wilcox
  0 siblings, 0 replies; 9+ messages in thread
From: Matthew Wilcox @ 2025-08-22 17:18 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Brendan Jackman, peterz, bp, dave.hansen, mingo, tglx, akpm,
	david, derkling, junaids, linux-kernel, linux-mm, reijiw,
	rientjes, rppt, vbabka, x86, yosry.ahmed, Liam Howlett,
	Kirill A. Shutemov, Harry Yoo, Jann Horn, Pedro Falcato,
	Andy Lutomirski, Josh Poimboeuf, Kees Cook

On Fri, Aug 22, 2025 at 03:22:04PM +0100, Lorenzo Stoakes wrote:
> > What I think we can do is an mm-global flush whenever there's a
> > possibility that the process is losing logical access to a physical
> > page. So basically I think that's whenever we evict from the page cache,
> > or the user closes a file.
> >
> > ("Logical access" = we would let them do a read() that gives them the
> > contents of the page).
> >
> > The key insight is that a) those events are reeelatively rare and b)
> > already often involve big TLB flushes. So doing global flushes there is
> > not that bad, and this allows us to forget about all the particular
> > details of which pages might have TLB entries on which CPUs and just say
> > "_some_ CPU in this MM might have _some_ stale TLB entry", which is
> > simple and efficient to track.
> 
> I guess rare to get truncation mid-way through a read(), closing it mid-way
> would be... a bug surely? :P

Truncation isn't a problem.  The contents of the file were visible to
the process before.  The folio can't get recycled while we have a
reference to it.  You might get stale data, but that's just the race
going one way instead of the other.

> > > Hmm, CoW generally a pain. Could you go into more detail as to what's the issue
> > > here?
> >
> > It's just that you have two user pages that you wanna touch at once
> > (src, dst). This crappy ephmap implementation doesn't suppport two
> > mappings at once in the same context, so the second allocation fails, so
> > you always get an asi_exit().
> 
> Right... well like can we just have space for 2 then? ;) it's mappings not
> actually allocating pages so... :)

For reference, kmap_local/atomic supports up to 16 at once.  That may
be excessive, but it's cheap.  Of course, kmap only supports a single
page at a time, not an entire folio.  Now, the tradeoffs for kmap_local
are based on how much address space is available to a 32-bit process (ie
1GB, shared between lowmem, vmalloc space, ioremap space, kmap space,
and probably a bunch of things I'm forgetting.

There's MUCH more space available on 64-bit and I'm sure we can find
32MB to allow us to map 16 * 2MB folios.  We can even make it easy and
always map on 2MB boundaries.  We might get into A Bit Of Trouble if
we decide that we want to map x86 1GB pages or ARM 512MB (I think ARM
actually goes up to 4TB theoretically).

If we're going this way, we might want to rework
folio_test_partial_kmap() callers to instead ask "what is the mapped
boundary of this folio", which might actually clean them up a bit.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Discuss] First steps for ASI (ASI is fast again)
  2025-08-21 12:15   ` Brendan Jackman
  2025-08-22 14:22     ` Lorenzo Stoakes
@ 2025-08-22 16:56     ` Uladzislau Rezki
  2025-08-22 17:20       ` Brendan Jackman
  1 sibling, 1 reply; 9+ messages in thread
From: Uladzislau Rezki @ 2025-08-22 16:56 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Lorenzo Stoakes, peterz, bp, dave.hansen, mingo, tglx, akpm,
	david, derkling, junaids, linux-kernel, linux-mm, reijiw,
	rientjes, rppt, vbabka, x86, yosry.ahmed, Matthew Wilcox,
	Liam Howlett, Kirill A. Shutemov, Harry Yoo, Jann Horn,
	Pedro Falcato, Andy Lutomirski, Josh Poimboeuf, Kees Cook

On Thu, Aug 21, 2025 at 12:15:04PM +0000, Brendan Jackman wrote:
> On Thu Aug 21, 2025 at 8:55 AM UTC, Lorenzo Stoakes wrote:
> > +cc Matthew for page cache side
> > +cc Other memory mapping folks for mapping side
> > +cc various x86 folks for x86 side
> > +cc Kees for security side of things
> >
> > On Tue, Aug 12, 2025 at 05:31:09PM +0000, Brendan Jackman wrote:
> >> .:: Intro
> >>
> >> Following up to the plan I posted at [0], I've now prepared an up-to-date ASI
> >> branch that demonstrates a technique for solving the page cache performance
> >> devastation I described in [1]. The branch is at [5].
> >
> > Have looked through your branch at [5], note that the exit_mmap() code is
> > changing very soon see [ljs0]. Also with regard to PGD syncing, Harry introduced
> > a hotfix series recently to address issues around this generalising this PGD
> > sync code which may be usefully relevant to your series.
> >
> > [ljs0]:https://lore.kernel.org/linux-mm/20250815191031.3769540-1-Liam.Howlett@oracle.com/
> > [ljs1]:https://lore.kernel.org/linux-mm/20250818020206.4517-1-harry.yoo@oracle.com/
> 
> Thanks, this is useful info.
> 
> >>
> >> The goal of this prototype is to increase confidence that ASI is viable as a
> >> broad solution for CPU vulnerabilities. (If the community still has to develop
> >> and maintain new mitigations for every individual vuln, because ASI only works
> >> for certain use-cases, then ASI isn't super attractive given its complexity
> >> burden).
> >>
> >> The biggest gap for establishing that confidence was that Google's deployment
> >> still only uses ASI for KVM workloads, not bare-metal processes. And indeed the
> >> page cache turned out to be a massive issue that Google just hasn't run up
> >> against yet internally.
> >>
> >> .:: The "ephmap"
> >>
> >> I won't re-hash the details of the problem here (see [1]) but in short: file
> >> pages aren't mapped into the physmap as seen from ASI's restricted address space.
> >> This causes a major overhead when e.g. read()ing files. The solution we've
> >> always envisaged (and which I very hastily tried to describe at LSF/MM/BPF this
> >> year) was to simply stop read() etc from touching the physmap.
> >>
> >> This is achieved in this prototype by a mechanism that I've called the "ephmap".
> >> The ephmap is a special region of the kernel address space that is local to the
> >> mm (much like the "proclocal" idea from 2019 [2]). Users of the ephmap API can
> >> allocate a subregion of this, and provide pages that get mapped into their
> >> subregion. These subregions are CPU-local. This means that it's cheap to tear
> >> these mappings down, so they can be removed immediately after use (eph =
> >> "ephemeral"), eliminating the need for complex/costly tracking data structures.
> >
> > OK I had a bunch of questions here but looked at the code :)
> >
> > So the idea is we have a per-CPU buffer that is equal to the size of the largest
> > possible folio, for each process.
> >
> > I wonder by the way if we can cache page tables rather than alloc on bring
> > up/tear down? Or just zap? That could help things.
> 
> Yeah if I'm catching your gist correctly, we have done a bit of this in
> the Google-internal version. In cases where it's fine to fail to map
> stuff (as is the case for ephmap users in this branch) you can just have
> a little pool of pre-allocated pagetables that you can allocate from in
> arbitrary contexts. Maybe the ALLOC_TRYLOCK stuff could also be useful
> here, I haven't explored that.
> 
> >>
> >> (You might notice the ephmap is extremely similar to kmap_local_page() - see the
> >> commit that introduces it ("x86: mm: Introduce the ephmap") for discussion).
> >
> > I do wonder if we need to have a separate kmap thing or whether we can just
> > adjust what already exists?
> 
> Yeah, I also wondered this. I think we could potentially just change the
> semantics of kmap_local_page() to suit ASI's needs, but I'm not really
> clear if that's consistent with the design or if there are perf
> concerns regarding its existing usecase. I am hoping once we start to
> get the more basic ASI stuff in, this will be a topic that will interest
> the right people, and I'll be able to get some useful input...
> 
> > Presumably we will restrict ASI support to 64-bit kernels only (starting with
> > and perhaps only for x86-64), so we can avoid the highmem bs.
> 
> Yep.
> 
> >>
> >> The ephmap can then be used for accessing file pages. It's also a generic
> >> mechanism for accessing sensitive data, for example it could be used for
> >> zeroing sensitive pages, or if necessary for copy-on-write of user pages.
> >>
> >> .:: State of the branch
> >>
> >> The branch contains:
> >>
> >> - A rebased version of my "ASI integration for the page allocator" RFC [3]. (Up
> >>   to "mm/page_alloc: Add support for ASI-unmapping pages")
> >> - The rest of ASI's basic functionality (up to "mm: asi: Stop ignoring asi=on
> >>   cmdline flag")
> >> - Some test and observability conveniences (up to "mm: asi: Add a tracepoint for
> >>   ASI page faults")
> >> - A prototype of the new performance improvements (the remainder of the
> >>   branch).
> >>
> >> There's a gradient of quality where the earlier patches are closer to "complete"
> >> and the later ones are increasingly messy and hacky. Comments and commit message
> >> describe lots of the hacky elements but the most important things are:
> >>
> >> 1. The logic to take advantage of the ephmap is stuck directly into mm/shmem.c.
> >>    This is just a shortcut to make its behaviour obvious. Since tmpfs is the
> >>    most extreme case of the read/write slowdown this should give us some idea of
> >>    the performance improvements but it obviously hides a lot of important
> >>    complexity wrt how this would be integrated "for real".
> >
> > Right, at what level do you plan to put the 'real' stuff?
> >
> > generic_file_read_iter() + equivalent or something like this? But then you'd
> > miss some fs obv., so I guess filemap_read()?
> 
> Yeah, just putting it into these generic stuff seemed like the most
> obvious way, but I was also hoping there could be some more general way
> to integrate it into the page cache or even something like the iov
> system. I did not see anything like this yet, but I don't think I've
> done the full quota of code-gazing that I'd need to come up with the
> best idea here. (Also maybe the solution becomes obvious if I can find
> the right pair of eyes).
> 
> Anyway, my hope is that the number of filesystems that are both a) very
> special implementation-wise and b) dear to the hearts of
> performance-sensitive users is quite small, so maybe just injecting into
> the right pre-existing filemap.c helpers, plus one or two
> filesystem-specific additions, already gets us almost all the way there.
> 
> >>
> >> 2. The ephmap implementation is extremely stupid. It only works for the simple
> >>    shmem usecase. I don't think this is really important though, whatever we end
> >>    up with needs to be very simple, and it's not even clear that we actually
> >>    want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
> >>    kmap_local_page() itself).
> >
> > Right just testing stuff out, fair enough. Obviously not an upstremable thing
> > but sort of test case right?
> 
> Yeah exactly. 
> 
> Maybe worth adding here that I explored just using vmalloc's allocator
> for this. My experience was that despite looking quite nicely optimised
> re avoiding synchronisation, just the simple fact of traversing its data
> structures is too slow for this usecase (at least, it did poorly on my
> super-sensitive FIO benchmark setup).
> 
Could you please elaborate here? Which test case and what is a problem
for it?

You can fragment the main KVA space where we use a rb-tree to manage
free blocks. But the question is how important your use case and
workload for you?

Thank you!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Discuss] First steps for ASI (ASI is fast again)
  2025-08-22 16:56     ` Uladzislau Rezki
@ 2025-08-22 17:20       ` Brendan Jackman
  2025-08-25  9:00         ` Uladzislau Rezki
  0 siblings, 1 reply; 9+ messages in thread
From: Brendan Jackman @ 2025-08-22 17:20 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Lorenzo Stoakes, peterz, bp, dave.hansen, mingo, tglx, akpm,
	david, derkling, junaids, linux-kernel, linux-mm, reijiw,
	rientjes, rppt, vbabka, x86, yosry.ahmed, Matthew Wilcox,
	Liam Howlett, Kirill A. Shutemov, Harry Yoo, Jann Horn,
	Pedro Falcato, Andy Lutomirski, Josh Poimboeuf, Kees Cook

On Fri Aug 22, 2025 at 4:56 PM UTC, Uladzislau Rezki wrote:
>> >> 2. The ephmap implementation is extremely stupid. It only works for the simple
>> >>    shmem usecase. I don't think this is really important though, whatever we end
>> >>    up with needs to be very simple, and it's not even clear that we actually
>> >>    want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
>> >>    kmap_local_page() itself).
>> >
>> > Right just testing stuff out, fair enough. Obviously not an upstremable thing
>> > but sort of test case right?
>> 
>> Yeah exactly. 
>> 
>> Maybe worth adding here that I explored just using vmalloc's allocator
>> for this. My experience was that despite looking quite nicely optimised
>> re avoiding synchronisation, just the simple fact of traversing its data
>> structures is too slow for this usecase (at least, it did poorly on my
>> super-sensitive FIO benchmark setup).
>> 
> Could you please elaborate here? Which test case and what is a problem
> for it?

What I'm trying to do here is allocate some virtual space, map some
memory into it, read it through that mapping, then tear it down again.
The test case was an FIO benchmark reading 4k blocks from tmpfs, which I
think is a pretty tight loop. Maybe this is the kinda thing where the
syscall overhead is pretty significant, so that it's an unrealistic
workload, I'm not too sure. But it was a nice way to get a maximal
measure of the ASI perf hit on filesystem access.

I didn't make careful notes but I vaguely remember I was seeing
something like 10% hits to this workload that I attributed to the
vmalloc calls based on profiling with perf.

I didn't interpret this as "vmalloc is bad" but rather "this is an abuse
of vmalloc". Allocating anything at all for this usecase is quite
unfortunate really.

Anyway, the good news is I don't think we actually need a general
purpose allocator here. I think we can just have something very simple,
stack based and completely CPU-local. I just tried vmalloc() at the
beginning coz it was the hammer I happened to be holding at the time!

> You can fragment the main KVA space where we use a rb-tree to manage
> free blocks. But the question is how important your use case and
> workload for you?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Discuss] First steps for ASI (ASI is fast again)
  2025-08-22 17:20       ` Brendan Jackman
@ 2025-08-25  9:00         ` Uladzislau Rezki
  0 siblings, 0 replies; 9+ messages in thread
From: Uladzislau Rezki @ 2025-08-25  9:00 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Uladzislau Rezki, Lorenzo Stoakes, peterz, bp, dave.hansen, mingo,
	tglx, akpm, david, derkling, junaids, linux-kernel, linux-mm,
	reijiw, rientjes, rppt, vbabka, x86, yosry.ahmed, Matthew Wilcox,
	Liam Howlett, Kirill A. Shutemov, Harry Yoo, Jann Horn,
	Pedro Falcato, Andy Lutomirski, Josh Poimboeuf, Kees Cook

On Fri, Aug 22, 2025 at 05:20:28PM +0000, Brendan Jackman wrote:
> On Fri Aug 22, 2025 at 4:56 PM UTC, Uladzislau Rezki wrote:
> >> >> 2. The ephmap implementation is extremely stupid. It only works for the simple
> >> >>    shmem usecase. I don't think this is really important though, whatever we end
> >> >>    up with needs to be very simple, and it's not even clear that we actually
> >> >>    want a whole new subsystem anyway. (e.g. maybe it's better to just adapt
> >> >>    kmap_local_page() itself).
> >> >
> >> > Right just testing stuff out, fair enough. Obviously not an upstremable thing
> >> > but sort of test case right?
> >> 
> >> Yeah exactly. 
> >> 
> >> Maybe worth adding here that I explored just using vmalloc's allocator
> >> for this. My experience was that despite looking quite nicely optimised
> >> re avoiding synchronisation, just the simple fact of traversing its data
> >> structures is too slow for this usecase (at least, it did poorly on my
> >> super-sensitive FIO benchmark setup).
> >> 
> > Could you please elaborate here? Which test case and what is a problem
> > for it?
> 
> What I'm trying to do here is allocate some virtual space, map some
> memory into it, read it through that mapping, then tear it down again.
> The test case was an FIO benchmark reading 4k blocks from tmpfs, which I
> think is a pretty tight loop. Maybe this is the kinda thing where the
> syscall overhead is pretty significant, so that it's an unrealistic
> workload, I'm not too sure. But it was a nice way to get a maximal
> measure of the ASI perf hit on filesystem access.
> 
> I didn't make careful notes but I vaguely remember I was seeing
> something like 10% hits to this workload that I attributed to the
> vmalloc calls based on profiling with perf.
> 
If you could post a perf profiling data for your workload that would
be more helpful. At least i could figure where the cycles are consumed
the most in vmalloc path.

Thanks!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-08-25  9:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-12 17:31 [Discuss] First steps for ASI (ASI is fast again) Brendan Jackman
2025-08-19 18:03 ` Brendan Jackman
2025-08-21  8:55 ` Lorenzo Stoakes
2025-08-21 12:15   ` Brendan Jackman
2025-08-22 14:22     ` Lorenzo Stoakes
2025-08-22 17:18       ` Matthew Wilcox
2025-08-22 16:56     ` Uladzislau Rezki
2025-08-22 17:20       ` Brendan Jackman
2025-08-25  9:00         ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).