public inbox for linux-coco@lists.linux.dev
 help / color / mirror / Atom feed
From: "Kalra, Ashish" <ashish.kalra@amd.com>
To: Dave Hansen <dave.hansen@intel.com>,
	Sean Christopherson <seanjc@google.com>
Cc: tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	peterz@infradead.org, thomas.lendacky@amd.com,
	herbert@gondor.apana.org.au, davem@davemloft.net,
	ardb@kernel.org, pbonzini@redhat.com, aik@amd.com,
	Michael.Roth@amd.com, KPrateek.Nayak@amd.com,
	Tycho.Andersen@amd.com, Nathan.Fontenot@amd.com,
	jackyli@google.com, pgonda@google.com, rientjes@google.com,
	jacobhxu@google.com, xin@zytor.com,
	pawan.kumar.gupta@linux.intel.com, babu.moger@amd.com,
	dyoung@redhat.com, nikunj@amd.com, john.allen@amd.com,
	darwi@linutronix.de, linux-kernel@vger.kernel.org,
	linux-crypto@vger.kernel.org, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
Subject: Re: [PATCH v2 3/7] x86/sev: add support for RMPOPT instruction
Date: Mon, 16 Mar 2026 14:03:23 -0500	[thread overview]
Message-ID: <cdedb126-777a-4e40-a5a5-93aa5dbc38aa@amd.com> (raw)
In-Reply-To: <5102edd8-8eaa-4688-b3f7-3004c4cbc8f3@intel.com>

Hello Dave,

On 3/11/2026 5:20 PM, Dave Hansen wrote:
> On 3/11/26 14:24, Kalra, Ashish wrote:
> ...
>> There are 2 active SNP VMs here, with one SNP VM being terminated, the other SNP VM is still running, both VMs are configured with 100GB guest RAM: 
>>
>> When this loop is executed when the SNP guest terminates:
>>
>> [  232.789187] SEV-SNP: RMPOPT execution time 391609638 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~391 ms
>>
>> [  234.647462] SEV-SNP: RMPOPT execution time 457933019 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~457 ms
> 
> That's better, but it's not quite what am looking for.
> 
> The most important case (IMNHO) is when RMPOPT falls flat on its face:
> it tries to optimize the full 2TB of memory and manages to optimize nothing.
> 
> I doubt that two 100GB VMs will get close to that case. It's
> theoretically possible, but unlikely.
> 
> You also didn't mention 4k vs. 2M vs. 1G mappings.
> 
>> Now, there are a couple of additional RMPOPT optimizations which can be applied to this loop : 
>>
>> 1). RMPOPT can skip the bulk of its work if another CPU has already optimized that region.
>> The optimal thing may be to optimize all memory on one CPU first, and then let all the others
>> run RMPOPT in parallel.
> 
> Ahh, so the RMP table itself caches the result of the RMPOPT in its 1G
> metadata, then the CPUs can just copy it into their core-local
> optimization table at RMPOPT time?
> 
> That's handy.
> 
> *But*, for the purposes of finding pathological behavior, it's actually
> contrary to what I think I was asking for which was having all 1G pages
> filled with some private memory. If the system was in the state I want
> to see tested, that optimization won't function.

True that in this case RMPOPT will not do any optimizations and the system performance will be worst, but actually 
if you see in this case, for this loop which we are considering, the loop will actually have the smallest runtime.
More on this below.

> 
>> [  363.926595] SEV-SNP: RMPOPT execution time 317016656 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~317 ms
>>
>> [  365.415243] SEV-SNP: RMPOPT execution time 369659769 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~369 ms.
>>
>> So, with these two optimizations applied, there is like a ~16-20% performance improvement (when SNP guest terminates) in the execution of this loop
>> which is executing RMPOPT on upto 2TB of RAM on all CPUs.
>>
>> Any thoughts, feedback on the performance numbers ? 
> 
> 16-20% isn't horrible, but it isn't really a fundamental change.
> 
> It would also be nice to see elapsed time for each CPU. Having one
> pegged CPU for 400ms and 99 mostly idle ones is way different than
> having 100 pegged CPUs for 400ms.
> 
> That's why I was interested in "how long it takes per-cpu".
> 
> But you could get some pretty good info with your new optimized loop:
> 
>                 start = ktime_get();
> 
>                 for (pa = pa_start; pa < pa_end; pa += PUD_SIZE)
>                         rmpopt() // current CPU
> 
>                 middle = ktime_get();
> 
>                 for (pa = pa_start; pa < pa_end; pa += PUD_SIZE)
>                         on_each_cpu_mask(...) // remote CPUs
> 
>                 end = ktime_get();
> 
> If you do that ^ with a system:
> 
> 	1. full of private memory

Again, for this case RMPOPT fails to do any optimizations, but for this loop which we are considering, this case will have the smallest runtime.


> 	2. empty of private memory
> 	3. empty again

In both these cases, RMPOPT does the best optimizations for system performance, but for the loop which we are considering, these cases will have
the longest runtime, as in this case RMPOPT has to check *all* the RMP entries in each 1GB region (and for every 1G region it is executed for) and
so each RMPOPT instruction and this loop itself will take the maximum time.

Here are the actual numbers: 

These measurements are done with the *new* optimized loop: 

		...
		/* Only one thread per core needs to issue RMPOPT instruction */
        	for_each_online_cpu(cpu) {
                	if (!topology_is_primary_thread(cpu))
                        	continue;

                	cpumask_set_cpu(cpu, cpus);
        	}

		...
		start = ktime_get();
                
                /*
                 * RMPOPT is optimized to skip the bulk of its work if another CPU has already
                 * optimized that region. Optimize all memory on one CPU first, and then let all
                 * the others run RMPOPT in parallel.
                 */
                cpumask_clear_cpu(smp_processor_id(), cpus);

                /* current CPU */
                for (pa = pa_start; pa < pa_end; pa += PUD_SIZE)
                        rmpopt((void *)(pa | RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS));

                for (pa = pa_start; pa < pa_end; pa += PUD_SIZE) {
                        /* Bit zero passes the function to the RMPOPT instruction. */
                        on_each_cpu_mask(cpus, rmpopt,
                                         (void *)(pa | RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS),
                                         true); 
                }
                end = ktime_get();

                elapsed_ns = ktime_to_ns(ktime_sub(end, start));
                pr_info("RMPOPT execution time %llu ns for physical address range 0x%016llx - 0x%016llx on all cpus\n",
                                elapsed_ns, pa_start, pa_end);
		...

Case 2 and 3: 

When the following loop is executed, after SNP is enabled at snp_rmptable_init(), the RMP table does not have any assigned pages, which is 
essentially case 2.

So the loop has the worst runtime, as can be seen below: 

[   12.961935] SEV-SNP: RMP optimizations enabled on physical address range @1GB alignment [0x0000000000000000 - 0x0000020000000000]
[   13.286659] SEV-SNP: RMPOPT execution time 311135734 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~311 ms.

At this point, i simulate the case you are looking for, where the RAM is full of private memory/assigned pages, essentially case 1.

In other words, i simulated a case, where the first 4K page at every 1GB boundary is an assigned page.
This means that RMPOPT will exit immediately and early as it finds an assigned page on the first page it checks in every 1GB range, as below: 

	...
      	for (pfn = 0; pfn < max_pfn; pfn += (1 << (PUD_SHIFT - PAGE_SHIFT)))
                  rmp_make_private(pfn, 0, PG_LEVEL_4K, 0, true);
	...
              
And so RMPOPT instruction itself and executing this loop after programming the RMP table as above has the smallest runtime: 

[   13.430801] SEV-SNP: RMP optimizations enabled on physical address range @1GB alignment [0x0000000000000000 - 0x0000020000000000]
[   13.539667] SEV-SNP: RMPOPT execution time 95275588 ns for physical address range 0x0000000000000000 - 0x0000020000000000 on all cpus -> ~95 ms.

To summarize, these two are the worst and best performance numbers for this loop which we are considering.

Best runtime for the loop:
When RMPOPT exits early as it finds an assigned page on the first RMP entry it checks in the 1GB -> ~95ms.

Worst runtime for the loop:
When RMPOPT does not find any assigned page in the full 1GB range it is checking -> ~311ms. 

So looking at this range [95ms - 311ms], we need to decide if we want to use the kthread approach ?

> 
> You'll hopefully see:
> 
> 	1. RMPOPT fall on its face. Worst case scenario (what I want to
> 	   see most)
> 	2. RMPOPT sees great success, but has to scan the RMP at least
> 	   once. Remote CPUs get a free ride on the first CPU's scan.
> 	   Largest (middle-start) vs. (end-middle)/nr_cpus delta.
> 	3. RMPOPT best case. Everything is already optimized.
> 
>> Ideally we should be issuing RMPOPTs to only optimize the 1G regions that contained memory associated with that guest and that should be 
>> significantly less than the whole 2TB RAM range. 
>>
>> But that is something we planned for 1GB hugetlb guest_memfd support getting merged and which i believe has dependency on:
>> 1). in-place conversion for guest_memfd, 
>> 2). 2M hugepage support for guest_memfd and finally 
>> 3). 1GB hugeTLB support for guest_memfd.
> 
> It's a no-brainer to do RMPOPT when you have 1GB pages around. You'll
> see zero argument from me.
> 

Yes.

> Doing things per-guest and for smaller pages gets a little bit harder to
> reason about. In the end, this is all about trying to optimize against
> the RMP table which is a global resource. It's going to get wonky if
> RMPOPT is driven purely by guest-local data. There are lots of potential
> pitfalls.
> 
> For now, let's just do it as simply as possible. Get maximum bang for
> our buck with minimal data structures and see how that works out. It
> might end up being a:
> 
> 	queue_delayed_work()
> 
> to do some cleanup a few seconds out after each SNP guest terminates. If
> a bunch of guests terminate all at once it'll at least only do a single
> set of IPIs.

Again, looking at the numbers above, what are your suggestions for 

1). using the kthread approach OR 
2). probably scheduling it for later execution after SNP guest termination via a workqueue OR
3). use some additional data structure like a bitmap to track 1G pages in guest_memfd 
to do the RMP re-optimizations.

Thanks,
Ashish

  reply	other threads:[~2026-03-16 19:03 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-02 21:35 [PATCH v2 0/7] Add RMPOPT support Ashish Kalra
2026-03-02 21:35 ` [PATCH v2 1/7] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag Ashish Kalra
2026-03-02 23:00   ` Dave Hansen
2026-03-05 12:36   ` Borislav Petkov
2026-03-02 21:35 ` [PATCH v2 2/7] x86/sev: add support for enabling RMPOPT Ashish Kalra
2026-03-02 22:32   ` Dave Hansen
2026-03-02 22:55     ` Kalra, Ashish
2026-03-02 23:00       ` Dave Hansen
2026-03-02 23:11         ` Kalra, Ashish
2026-03-02 22:33   ` Dave Hansen
2026-03-06 15:18   ` Borislav Petkov
2026-03-06 15:33     ` Tom Lendacky
2026-03-02 21:36 ` [PATCH v2 3/7] x86/sev: add support for RMPOPT instruction Ashish Kalra
2026-03-02 22:57   ` Dave Hansen
2026-03-02 23:09     ` Kalra, Ashish
2026-03-02 23:15       ` Dave Hansen
2026-03-04 15:56     ` Andrew Cooper
2026-03-04 16:03       ` Dave Hansen
2026-03-25 21:53       ` Kalra, Ashish
2026-03-26  0:40         ` Andrew Cooper
2026-03-26  2:02           ` Kalra, Ashish
2026-03-26  2:14             ` Kalra, Ashish
2026-03-04 15:01   ` Sean Christopherson
2026-03-04 15:25     ` Dave Hansen
2026-03-04 15:32       ` Dave Hansen
2026-03-05  1:40       ` Kalra, Ashish
2026-03-05 19:22         ` Kalra, Ashish
2026-03-05 19:40           ` Dave Hansen
2026-03-11 21:24             ` Kalra, Ashish
2026-03-11 22:20               ` Dave Hansen
2026-03-16 19:03                 ` Kalra, Ashish [this message]
2026-03-18 14:00                   ` Dave Hansen
2026-03-02 21:36 ` [PATCH v2 4/7] x86/sev: Add interface to re-enable RMP optimizations Ashish Kalra
2026-03-02 21:36 ` [PATCH v2 5/7] KVM: guest_memfd: Add cleanup interface for guest teardown Ashish Kalra
2026-03-09  9:01   ` Ackerley Tng
2026-03-10 22:18     ` Kalra, Ashish
2026-03-11  6:00       ` Ackerley Tng
2026-03-11 21:49         ` Kalra, Ashish
2026-03-27 17:16           ` Ackerley Tng
2026-03-02 21:37 ` [PATCH v2 6/7] KVM: SEV: Implement SEV-SNP specific guest cleanup Ashish Kalra
2026-03-02 21:37 ` [PATCH v2 7/7] x86/sev: Add debugfs support for RMPOPT Ashish Kalra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cdedb126-777a-4e40-a5a5-93aa5dbc38aa@amd.com \
    --to=ashish.kalra@amd.com \
    --cc=KPrateek.Nayak@amd.com \
    --cc=Michael.Roth@amd.com \
    --cc=Nathan.Fontenot@amd.com \
    --cc=Tycho.Andersen@amd.com \
    --cc=aik@amd.com \
    --cc=ardb@kernel.org \
    --cc=babu.moger@amd.com \
    --cc=bp@alien8.de \
    --cc=darwi@linutronix.de \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=davem@davemloft.net \
    --cc=dyoung@redhat.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=hpa@zytor.com \
    --cc=jackyli@google.com \
    --cc=jacobhxu@google.com \
    --cc=john.allen@amd.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=nikunj@amd.com \
    --cc=pawan.kumar.gupta@linux.intel.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pgonda@google.com \
    --cc=rientjes@google.com \
    --cc=seanjc@google.com \
    --cc=tglx@kernel.org \
    --cc=thomas.lendacky@amd.com \
    --cc=x86@kernel.org \
    --cc=xin@zytor.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox