* x86: strange behavior of invlpg
@ 2016-05-14 9:35 Nadav Amit
2016-05-16 9:28 ` Paolo Bonzini
0 siblings, 1 reply; 7+ messages in thread
From: Nadav Amit @ 2016-05-14 9:35 UTC (permalink / raw)
To: kvm
I encountered a strange phenomenum and I would appreciate your sanity check
and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad
flush.
I created a small kvm-unit-test (below) to show what I talk about. The test
touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to
an arbitrary (other) address, or (3) runs memory barrier.
It appears that the execution time of the test is indeed determined by TLB
misses, since the runtime of the memory barrier flavor is considerably lower.
What I find strange is that if I compute the net access time for tests 1 & 2,
by deducing the time of the flushes, the time is almost identical. I am aware
that invlpg flushes the page-walk caches, but I would still expect the invlpg
flavor to run considerably faster than the full-flush flavor.
Am I missing something?
On my Haswell EP I get the following results:
with invlpg: 948965249
with full flush: 1047927009
invlpg only 127682028
full flushes only 224055273
access net 107691277 --> considerably lower than w/flushes
w/full flush net 823871736
w/invlpg net 821283221 --> almost identical to full-flush net
---
#include "libcflat.h"
#include "fwcfg.h"
#include "vm.h"
#include "smp.h"
#define N_PAGES (50)
#define ITERATIONS (500000)
volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE)));
int main(void)
{
void *another_addr = (void*)0x50f9000;
int i, j;
unsigned long t_start, t_single, t_full, t_single_only, t_full_only,
t_access;
unsigned long cr3;
char v = 0;
setup_vm();
cr3 = read_cr3();
t_start = rdtsc();
for (i = 0; i < ITERATIONS; i++) {
invlpg(another_addr);
for (j = 0; j < N_PAGES; j++)
v = buf[PAGE_SIZE * j];
}
t_single = rdtsc() - t_start;
printf("with invlpg: %lu\n", t_single);
t_start = rdtsc();
for (i = 0; i < ITERATIONS; i++) {
write_cr3(cr3);
for (j = 0; j < N_PAGES; j++)
v = buf[PAGE_SIZE * j];
}
t_full = rdtsc() - t_start;
printf("with full flush: %lu\n", t_full);
t_start = rdtsc();
for (i = 0; i < ITERATIONS; i++)
invlpg(another_addr);
t_single_only = rdtsc() - t_start;
printf("invlpg only %lu\n", t_single_only);
t_start = rdtsc();
for (i = 0; i < ITERATIONS; i++)
write_cr3(cr3);
t_full_only = rdtsc() - t_start;
printf("full flushes only %lu\n", t_full_only);
t_start = rdtsc();
for (i = 0; i < ITERATIONS; i++) {
for (j = 0; j < N_PAGES; j++)
v = buf[PAGE_SIZE * j];
mb();
}
t_access = rdtsc()-t_start;
printf("access net %lu\n", t_access);
printf("w/full flush net %lu\n", t_full - t_full_only);
printf("w/invlpg net %lu\n", t_single - t_single_only);
(void)v;
return 0;
}
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86: strange behavior of invlpg
2016-05-14 9:35 x86: strange behavior of invlpg Nadav Amit
@ 2016-05-16 9:28 ` Paolo Bonzini
2016-05-16 16:51 ` Nadav Amit
0 siblings, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2016-05-16 9:28 UTC (permalink / raw)
To: Nadav Amit, kvm
On 14/05/2016 11:35, Nadav Amit wrote:
> I encountered a strange phenomenum and I would appreciate your sanity check
> and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad
> flush.
>
> I created a small kvm-unit-test (below) to show what I talk about. The test
> touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to
> an arbitrary (other) address, or (3) runs memory barrier.
>
> It appears that the execution time of the test is indeed determined by TLB
> misses, since the runtime of the memory barrier flavor is considerably lower.
Did you check the performance counters? Another explanation is that
there are no TLB misses, but CR3 writes are optimized in such a way
that they do not incur TLB misses either. (Disclaimer: I didn't check
the performance counters to prove the alternative theory ;)).
> What I find strange is that if I compute the net access time for tests 1 & 2,
> by deducing the time of the flushes, the time is almost identical. I am aware
> that invlpg flushes the page-walk caches, but I would still expect the invlpg
> flavor to run considerably faster than the full-flush flavor.
That's interesting. I guess you're using EPT because I get very
similar number on an Ivy Bridge laptop:
with invlpg: 902,224,568
with full flush: 880,103,513
invlpg only 113,186,461
full flushes only 100,236,620
access net 104,454,125
w/full flush net 779,866,893
w/invlpg net 789,038,107
(commas added for readability).
Out of curiosity I tried making all pages global (patch after my
signature). Both invlpg and write to CR3 become much faster, but
invlpg now is faster than full flush, even though in theory it
should be the opposite...
with invlpg: 223,079,661
with full flush: 294,280,788
invlpg only 126,236,334
full flushes only 107,614,525
access net 90,830,503
w/full flush net 186,666,263
w/invlpg net 96,843,327
Thanks for the interesting test!
Paolo
diff --git a/lib/x86/vm.c b/lib/x86/vm.c
index 7ce7bbc..3b9b81a 100644
--- a/lib/x86/vm.c
+++ b/lib/x86/vm.c
@@ -2,6 +2,7 @@
#include "vm.h"
#include "libcflat.h"
+#define PTE_GLOBAL 256
#define PAGE_SIZE 4096ul
#ifdef __x86_64__
#define LARGE_PAGE_SIZE (512 * PAGE_SIZE)
@@ -106,14 +107,14 @@ unsigned long *install_large_page(unsigned long *cr3,
void *virt)
{
return install_pte(cr3, 2, virt,
- phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0);
+ phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0);
}
unsigned long *install_page(unsigned long *cr3,
unsigned long phys,
void *virt)
{
- return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0);
+ return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0);
}
> Am I missing something?
>
>
> On my Haswell EP I get the following results:
>
> with invlpg: 948965249
> with full flush: 1047927009
> invlpg only 127682028
> full flushes only 224055273
> access net 107691277 --> considerably lower than w/flushes
> w/full flush net 823871736
> w/invlpg net 821283221 --> almost identical to full-flush net
>
> ---
>
>
> #include "libcflat.h"
> #include "fwcfg.h"
> #include "vm.h"
> #include "smp.h"
>
> #define N_PAGES (50)
> #define ITERATIONS (500000)
> volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE)));
>
> int main(void)
> {
> void *another_addr = (void*)0x50f9000;
> int i, j;
> unsigned long t_start, t_single, t_full, t_single_only, t_full_only,
> t_access;
> unsigned long cr3;
> char v = 0;
>
> setup_vm();
>
> cr3 = read_cr3();
>
> t_start = rdtsc();
> for (i = 0; i < ITERATIONS; i++) {
> invlpg(another_addr);
> for (j = 0; j < N_PAGES; j++)
> v = buf[PAGE_SIZE * j];
> }
> t_single = rdtsc() - t_start;
> printf("with invlpg: %lu\n", t_single);
>
> t_start = rdtsc();
> for (i = 0; i < ITERATIONS; i++) {
> write_cr3(cr3);
> for (j = 0; j < N_PAGES; j++)
> v = buf[PAGE_SIZE * j];
> }
> t_full = rdtsc() - t_start;
> printf("with full flush: %lu\n", t_full);
>
> t_start = rdtsc();
> for (i = 0; i < ITERATIONS; i++)
> invlpg(another_addr);
> t_single_only = rdtsc() - t_start;
> printf("invlpg only %lu\n", t_single_only);
>
> t_start = rdtsc();
> for (i = 0; i < ITERATIONS; i++)
> write_cr3(cr3);
> t_full_only = rdtsc() - t_start;
> printf("full flushes only %lu\n", t_full_only);
>
> t_start = rdtsc();
> for (i = 0; i < ITERATIONS; i++) {
> for (j = 0; j < N_PAGES; j++)
> v = buf[PAGE_SIZE * j];
> mb();
> }
> t_access = rdtsc()-t_start;
> printf("access net %lu\n", t_access);
> printf("w/full flush net %lu\n", t_full - t_full_only);
> printf("w/invlpg net %lu\n", t_single - t_single_only);
>
> (void)v;
> return 0;
> }--
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: x86: strange behavior of invlpg
2016-05-16 9:28 ` Paolo Bonzini
@ 2016-05-16 16:51 ` Nadav Amit
2016-05-16 16:56 ` Paolo Bonzini
0 siblings, 1 reply; 7+ messages in thread
From: Nadav Amit @ 2016-05-16 16:51 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm
Thanks! I appreciate it.
I think your experiment with global paging just corraborate that the
latency is caused by TLB misses. I measured TLB misses (and especially STLB
misses) in other experiments but not in this one. I will run some more
experiments, specifically to test how AMD behaves.
I should note this is a byproduct of a study I did, and it is not as if I was
looking for strange behaviors (no more validation papers for me!).
The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
it is a CPU “feature”. Once we understand it, the very least it may affect
the recommended value of “tlb_single_page_flush_ceiling”, that controls when
the kernel performs full TLB flush vs. selective flushes.
Nadav
Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 14/05/2016 11:35, Nadav Amit wrote:
>> I encountered a strange phenomenum and I would appreciate your sanity check
>> and opinion. It looks as if 'invlpg' that runs in a VM causes a very broad
>> flush.
>>
>> I created a small kvm-unit-test (below) to show what I talk about. The test
>> touches 50 pages, and then either: (1) runs full flush, (2) runs invlpg to
>> an arbitrary (other) address, or (3) runs memory barrier.
>>
>> It appears that the execution time of the test is indeed determined by TLB
>> misses, since the runtime of the memory barrier flavor is considerably lower.
>
> Did you check the performance counters? Another explanation is that
> there are no TLB misses, but CR3 writes are optimized in such a way
> that they do not incur TLB misses either. (Disclaimer: I didn't check
> the performance counters to prove the alternative theory ;)).
>
>> What I find strange is that if I compute the net access time for tests 1 & 2,
>> by deducing the time of the flushes, the time is almost identical. I am aware
>> that invlpg flushes the page-walk caches, but I would still expect the invlpg
>> flavor to run considerably faster than the full-flush flavor.
>
> That's interesting. I guess you're using EPT because I get very
> similar number on an Ivy Bridge laptop:
>
> with invlpg: 902,224,568
> with full flush: 880,103,513
> invlpg only 113,186,461
> full flushes only 100,236,620
> access net 104,454,125
> w/full flush net 779,866,893
> w/invlpg net 789,038,107
>
> (commas added for readability).
>
> Out of curiosity I tried making all pages global (patch after my
> signature). Both invlpg and write to CR3 become much faster, but
> invlpg now is faster than full flush, even though in theory it
> should be the opposite...
>
> with invlpg: 223,079,661
> with full flush: 294,280,788
> invlpg only 126,236,334
> full flushes only 107,614,525
> access net 90,830,503
> w/full flush net 186,666,263
> w/invlpg net 96,843,327
>
> Thanks for the interesting test!
>
> Paolo
>
> diff --git a/lib/x86/vm.c b/lib/x86/vm.c
> index 7ce7bbc..3b9b81a 100644
> --- a/lib/x86/vm.c
> +++ b/lib/x86/vm.c
> @@ -2,6 +2,7 @@
> #include "vm.h"
> #include "libcflat.h"
>
> +#define PTE_GLOBAL 256
> #define PAGE_SIZE 4096ul
> #ifdef __x86_64__
> #define LARGE_PAGE_SIZE (512 * PAGE_SIZE)
> @@ -106,14 +107,14 @@ unsigned long *install_large_page(unsigned long *cr3,
> void *virt)
> {
> return install_pte(cr3, 2, virt,
> - phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE, 0);
> + phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_PSE | PTE_GLOBAL, 0);
> }
>
> unsigned long *install_page(unsigned long *cr3,
> unsigned long phys,
> void *virt)
> {
> - return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER, 0);
> + return install_pte(cr3, 1, virt, phys | PTE_PRESENT | PTE_WRITE | PTE_USER | PTE_GLOBAL, 0);
> }
>
>
>
>> Am I missing something?
>>
>>
>> On my Haswell EP I get the following results:
>>
>> with invlpg: 948965249
>> with full flush: 1047927009
>> invlpg only 127682028
>> full flushes only 224055273
>> access net 107691277 --> considerably lower than w/flushes
>> w/full flush net 823871736
>> w/invlpg net 821283221 --> almost identical to full-flush net
>>
>> ---
>>
>>
>> #include "libcflat.h"
>> #include "fwcfg.h"
>> #include "vm.h"
>> #include "smp.h"
>>
>> #define N_PAGES (50)
>> #define ITERATIONS (500000)
>> volatile char buf[N_PAGES * PAGE_SIZE] __attribute__ ((aligned (PAGE_SIZE)));
>>
>> int main(void)
>> {
>> void *another_addr = (void*)0x50f9000;
>> int i, j;
>> unsigned long t_start, t_single, t_full, t_single_only, t_full_only,
>> t_access;
>> unsigned long cr3;
>> char v = 0;
>>
>> setup_vm();
>>
>> cr3 = read_cr3();
>>
>> t_start = rdtsc();
>> for (i = 0; i < ITERATIONS; i++) {
>> invlpg(another_addr);
>> for (j = 0; j < N_PAGES; j++)
>> v = buf[PAGE_SIZE * j];
>> }
>> t_single = rdtsc() - t_start;
>> printf("with invlpg: %lu\n", t_single);
>>
>> t_start = rdtsc();
>> for (i = 0; i < ITERATIONS; i++) {
>> write_cr3(cr3);
>> for (j = 0; j < N_PAGES; j++)
>> v = buf[PAGE_SIZE * j];
>> }
>> t_full = rdtsc() - t_start;
>> printf("with full flush: %lu\n", t_full);
>>
>> t_start = rdtsc();
>> for (i = 0; i < ITERATIONS; i++)
>> invlpg(another_addr);
>> t_single_only = rdtsc() - t_start;
>> printf("invlpg only %lu\n", t_single_only);
>>
>> t_start = rdtsc();
>> for (i = 0; i < ITERATIONS; i++)
>> write_cr3(cr3);
>> t_full_only = rdtsc() - t_start;
>> printf("full flushes only %lu\n", t_full_only);
>>
>> t_start = rdtsc();
>> for (i = 0; i < ITERATIONS; i++) {
>> for (j = 0; j < N_PAGES; j++)
>> v = buf[PAGE_SIZE * j];
>> mb();
>> }
>> t_access = rdtsc()-t_start;
>> printf("access net %lu\n", t_access);
>> printf("w/full flush net %lu\n", t_full - t_full_only);
>> printf("w/invlpg net %lu\n", t_single - t_single_only);
>>
>> (void)v;
>> return 0;
>> }--
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86: strange behavior of invlpg
2016-05-16 16:51 ` Nadav Amit
@ 2016-05-16 16:56 ` Paolo Bonzini
2016-05-16 19:39 ` Nadav Amit
2018-02-15 22:43 ` Nadav Amit
0 siblings, 2 replies; 7+ messages in thread
From: Paolo Bonzini @ 2016-05-16 16:56 UTC (permalink / raw)
To: Nadav Amit; +Cc: kvm
On 16/05/2016 18:51, Nadav Amit wrote:
> Thanks! I appreciate it.
>
> I think your experiment with global paging just corraborate that the
> latency is caused by TLB misses. I measured TLB misses (and especially STLB
> misses) in other experiments but not in this one. I will run some more
> experiments, specifically to test how AMD behaves.
I'm curious about AMD too now...
with invlpg: 285,639,427
with full flush: 584,419,299
invlpg only 70,681,128
full flushes only 265,238,766
access net 242,538,804
w/full flush net 319,180,533
w/invlpg net 214,958,299
Roughly the same with and without pte.g. So AMD behaves as it should.
> I should note this is a byproduct of a study I did, and it is not as if I was
> looking for strange behaviors (no more validation papers for me!).
>
> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
> it is a CPU “feature”. Once we understand it, the very least it may affect
> the recommended value of “tlb_single_page_flush_ceiling”, that controls when
> the kernel performs full TLB flush vs. selective flushes.
Do you have a kernel module to reproduce the test on bare metal? (/me is
lazy).
Paolo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86: strange behavior of invlpg
2016-05-16 16:56 ` Paolo Bonzini
@ 2016-05-16 19:39 ` Nadav Amit
2016-05-17 4:27 ` Nadav Amit
2018-02-15 22:43 ` Nadav Amit
1 sibling, 1 reply; 7+ messages in thread
From: Nadav Amit @ 2016-05-16 19:39 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm
Argh... I don’t get the same behavior in the guest with the module test.
I’ll need some more time to figure it out.
Just a small comment regarding your “global” test: you forgot to set
CR4.PGE.
Once I set it, I get reasonable numbers (excluding the invlpg flavor).
with invlpg: 964431529
with full flush: 268190767
invlpg only 126114041
full flushes only 185971818
access net 111229828
w/full flush net 82218949 —> similar to access net
w/invlpg net 838317488
I’ll be back when I have more understanding of the situation.
Thanks,
Nadav
Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 16/05/2016 18:51, Nadav Amit wrote:
>> Thanks! I appreciate it.
>>
>> I think your experiment with global paging just corraborate that the
>> latency is caused by TLB misses. I measured TLB misses (and especially STLB
>> misses) in other experiments but not in this one. I will run some more
>> experiments, specifically to test how AMD behaves.
>
> I'm curious about AMD too now...
>
> with invlpg: 285,639,427
> with full flush: 584,419,299
> invlpg only 70,681,128
> full flushes only 265,238,766
> access net 242,538,804
> w/full flush net 319,180,533
> w/invlpg net 214,958,299
>
> Roughly the same with and without pte.g. So AMD behaves as it should.
>
>> I should note this is a byproduct of a study I did, and it is not as if I was
>> looking for strange behaviors (no more validation papers for me!).
>>
>> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
>> it is a CPU “feature”. Once we understand it, the very least it may affect
>> the recommended value of “tlb_single_page_flush_ceiling”, that controls when
>> the kernel performs full TLB flush vs. selective flushes.
>
> Do you have a kernel module to reproduce the test on bare metal? (/me is
> lazy).
>
> Paolo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86: strange behavior of invlpg
2016-05-16 19:39 ` Nadav Amit
@ 2016-05-17 4:27 ` Nadav Amit
0 siblings, 0 replies; 7+ messages in thread
From: Nadav Amit @ 2016-05-17 4:27 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm
Ok, it seems to be related to the guest/host page sizes.
It seems that if you run 2MB pages in a VM on top of 4KB pages in
the host, any invlpg in the VM causes all 2MB guest pages to be flushed.
I’ll try to find time to make sure there is nothing else to it.
Thanks for the assistance, and let me know if you need my hacky tests.
The measurements below are of VM and bare-metal:
Host Guest Full Flush Selective Flush
PGsize PGsize (dTLB misses) (dTLB misses)
-----------------------------------------------
VM 4KB 4KB 103,008,052 93,172
4KB 2MB 102,022,557 102,038,021
2MB 4KB 103,005,083 2,888
2MB 2MB 4,002,969 2,556
HOST 4KB 50,000,572 789
2MB 1,000,454 537
Nadav Amit <nadav.amit@gmail.com> wrote:
> Argh... I don’t get the same behavior in the guest with the module test.
> I’ll need some more time to figure it out.
>
> Just a small comment regarding your “global” test: you forgot to set
> CR4.PGE.
>
> Once I set it, I get reasonable numbers (excluding the invlpg flavor).
>
> with invlpg: 964431529
> with full flush: 268190767
> invlpg only 126114041
> full flushes only 185971818
> access net 111229828
> w/full flush net 82218949 —> similar to access net
> w/invlpg net 838317488
>
> I’ll be back when I have more understanding of the situation.
>
> Thanks,
> Nadav
>
>
> Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> On 16/05/2016 18:51, Nadav Amit wrote:
>>> Thanks! I appreciate it.
>>>
>>> I think your experiment with global paging just corraborate that the
>>> latency is caused by TLB misses. I measured TLB misses (and especially STLB
>>> misses) in other experiments but not in this one. I will run some more
>>> experiments, specifically to test how AMD behaves.
>>
>> I'm curious about AMD too now...
>>
>> with invlpg: 285,639,427
>> with full flush: 584,419,299
>> invlpg only 70,681,128
>> full flushes only 265,238,766
>> access net 242,538,804
>> w/full flush net 319,180,533
>> w/invlpg net 214,958,299
>>
>> Roughly the same with and without pte.g. So AMD behaves as it should.
>>
>>> I should note this is a byproduct of a study I did, and it is not as if I was
>>> looking for strange behaviors (no more validation papers for me!).
>>>
>>> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
>>> it is a CPU “feature”. Once we understand it, the very least it may affect
>>> the recommended value of “tlb_single_page_flush_ceiling”, that controls when
>>> the kernel performs full TLB flush vs. selective flushes.
>>
>> Do you have a kernel module to reproduce the test on bare metal? (/me is
>> lazy).
>>
>> Paolo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86: strange behavior of invlpg
2016-05-16 16:56 ` Paolo Bonzini
2016-05-16 19:39 ` Nadav Amit
@ 2018-02-15 22:43 ` Nadav Amit
1 sibling, 0 replies; 7+ messages in thread
From: Nadav Amit @ 2018-02-15 22:43 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm
Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 16/05/2016 18:51, Nadav Amit wrote:
>> Thanks! I appreciate it.
>>
>> I think your experiment with global paging just corraborate that the
>> latency is caused by TLB misses. I measured TLB misses (and especially STLB
>> misses) in other experiments but not in this one. I will run some more
>> experiments, specifically to test how AMD behaves.
>
> I'm curious about AMD too now...
>
> with invlpg: 285,639,427
> with full flush: 584,419,299
> invlpg only 70,681,128
> full flushes only 265,238,766
> access net 242,538,804
> w/full flush net 319,180,533
> w/invlpg net 214,958,299
>
> Roughly the same with and without pte.g. So AMD behaves as it should.
>
>> I should note this is a byproduct of a study I did, and it is not as if I was
>> looking for strange behaviors (no more validation papers for me!).
>>
>> The strangest thing is that on bare-metal I don’t see this phenomenon - I doubt
>> it is a CPU “feature”. Once we understand it, the very least it may affect
>> the recommended value of “tlb_single_page_flush_ceiling”, that controls when
>> the kernel performs full TLB flush vs. selective flushes.
>
> Do you have a kernel module to reproduce the test on bare metal? (/me is
> lazy).
It came to my mind that I didn’t tell you what turned eventually to be the
issue. (Yes, I know it is a very old thread, but you may still be
interested).
It turns out that Intel has something that is called “page fracturing”.
After the TLB caches a translation that came from 2MB guest page and 4KB
host page, INVLPG ends up flushing the entire TLB is flushed.
I guess they need to do it to follow the SDM 4.10.4.1 (regarding pages
larger than 4 KBytes): "The INVLPG instruction and page faults provide the
same assurances that they provide when a single TLB entry is used: they
invalidate all TLB entries corresponding to the translation specified by the
paging structures.”
Thanks again for your help,
Nadav
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-02-15 22:43 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-14 9:35 x86: strange behavior of invlpg Nadav Amit
2016-05-16 9:28 ` Paolo Bonzini
2016-05-16 16:51 ` Nadav Amit
2016-05-16 16:56 ` Paolo Bonzini
2016-05-16 19:39 ` Nadav Amit
2016-05-17 4:27 ` Nadav Amit
2018-02-15 22:43 ` Nadav Amit
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox