* [PATCH 0/3] Calculate memory access latency stats
@ 2022-11-15 17:32 Colton Lewis
2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Colton Lewis @ 2022-11-15 17:32 UTC (permalink / raw)
To: kvm
Cc: pbonzini, maz, dmatlack, seanjc, bgardon, oupton, ricarkol,
Colton Lewis
Sample the latency of memory accesses in dirty_log_perf_test and
report summary stats to give a picture of the latency
distribution. Specifically, focus on the right tail with the 50th,
90th, and 99th percentile reported in ns.
This patch depends on my previous dirty_log_perf_test patch adding the
ability to randomize memory accesses. It needs the pRNG to do random
sampling.
https://lore.kernel.org/kvm/20221107182208.479157-1-coltonlewis@google.com/
Colton Lewis (3):
KVM: selftests: Allocate additional space for latency samples
KVM: selftests: Collect memory access latency samples
KVM: selftests: Print summary stats of memory latency distribution
.../selftests/kvm/dirty_log_perf_test.c | 2 +
.../selftests/kvm/include/perf_test_util.h | 2 +
.../selftests/kvm/lib/perf_test_util.c | 103 +++++++++++++++++-
3 files changed, 104 insertions(+), 3 deletions(-)
--
2.38.1.431.g37b22c650d-goog
^ permalink raw reply [flat|nested] 19+ messages in thread* [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples 2022-11-15 17:32 [PATCH 0/3] Calculate memory access latency stats Colton Lewis @ 2022-11-15 17:32 ` Colton Lewis 2023-01-17 20:32 ` Ricardo Koller 2023-01-18 16:49 ` Sean Christopherson 2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis 2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis 2 siblings, 2 replies; 19+ messages in thread From: Colton Lewis @ 2022-11-15 17:32 UTC (permalink / raw) To: kvm Cc: pbonzini, maz, dmatlack, seanjc, bgardon, oupton, ricarkol, Colton Lewis Allocate additional space for latency samples. This has been separated out to call attention to the additional VM memory allocation. The test runs out of physical pages without the additional allocation. The 100 multiple for pages was determined by trial and error. A more well-reasoned calculation would be preferable. Signed-off-by: Colton Lewis <coltonlewis@google.com> --- tools/testing/selftests/kvm/lib/perf_test_util.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c index 137be359b09e..a48904b64e19 100644 --- a/tools/testing/selftests/kvm/lib/perf_test_util.c +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c @@ -38,6 +38,12 @@ static bool all_vcpu_threads_running; static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; +#define SAMPLES_PER_VCPU 1000 +#define SAMPLE_CAPACITY (SAMPLES_PER_VCPU * KVM_MAX_VCPUS) + +/* Store all samples in a flat array so they can be easily sorted later. */ +uint64_t latency_samples[SAMPLE_CAPACITY]; + /* * Continuously write to the first 8 bytes of each page in the * specified region. @@ -122,7 +128,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus, { struct perf_test_args *pta = &perf_test_args; struct kvm_vm *vm; - uint64_t guest_num_pages, slot0_pages = 0; + uint64_t guest_num_pages, sample_pages, slot0_pages = 0; uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src); uint64_t region_end_gfn; int i; @@ -161,7 +167,9 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus, * The memory is also added to memslot 0, but that's a benign side * effect as KVM allows aliasing HVAs in meslots. */ - vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages, + sample_pages = 100 * sizeof(latency_samples) / pta->guest_page_size; + vm = __vm_create_with_vcpus(mode, nr_vcpus, + slot0_pages + guest_num_pages + sample_pages, perf_test_guest_code, vcpus); pta->vm = vm; -- 2.38.1.431.g37b22c650d-goog ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples 2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis @ 2023-01-17 20:32 ` Ricardo Koller 2023-01-18 16:49 ` Sean Christopherson 1 sibling, 0 replies; 19+ messages in thread From: Ricardo Koller @ 2023-01-17 20:32 UTC (permalink / raw) To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton Hi Colton, On Tue, Nov 15, 2022 at 05:32:56PM +0000, Colton Lewis wrote: > Allocate additional space for latency samples. This has been separated > out to call attention to the additional VM memory allocation. The test > runs out of physical pages without the additional allocation. The 100 > multiple for pages was determined by trial and error. A more > well-reasoned calculation would be preferable. > > Signed-off-by: Colton Lewis <coltonlewis@google.com> > --- > tools/testing/selftests/kvm/lib/perf_test_util.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c > index 137be359b09e..a48904b64e19 100644 > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c > @@ -38,6 +38,12 @@ static bool all_vcpu_threads_running; > > static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; > > +#define SAMPLES_PER_VCPU 1000 > +#define SAMPLE_CAPACITY (SAMPLES_PER_VCPU * KVM_MAX_VCPUS) > + > +/* Store all samples in a flat array so they can be easily sorted later. */ > +uint64_t latency_samples[SAMPLE_CAPACITY]; > + > /* > * Continuously write to the first 8 bytes of each page in the > * specified region. > @@ -122,7 +128,7 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus, > { > struct perf_test_args *pta = &perf_test_args; > struct kvm_vm *vm; > - uint64_t guest_num_pages, slot0_pages = 0; > + uint64_t guest_num_pages, sample_pages, slot0_pages = 0; > uint64_t backing_src_pagesz = get_backing_src_pagesz(backing_src); > uint64_t region_end_gfn; > int i; > @@ -161,7 +167,9 @@ struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus, > * The memory is also added to memslot 0, but that's a benign side > * effect as KVM allows aliasing HVAs in meslots. > */ > - vm = __vm_create_with_vcpus(mode, nr_vcpus, slot0_pages + guest_num_pages, > + sample_pages = 100 * sizeof(latency_samples) / pta->guest_page_size; I don't think there's any need to guess. The number of accesses is vcpu_args->pages (one access per guest page). So all memory could be allocated dynamically to hold "vcpu_args->pages * sample_sz". > + vm = __vm_create_with_vcpus(mode, nr_vcpus, > + slot0_pages + guest_num_pages + sample_pages, > perf_test_guest_code, vcpus); > > pta->vm = vm; > -- > 2.38.1.431.g37b22c650d-goog > Thanks, Ricardo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples 2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis 2023-01-17 20:32 ` Ricardo Koller @ 2023-01-18 16:49 ` Sean Christopherson 2023-01-26 18:00 ` Colton Lewis 1 sibling, 1 reply; 19+ messages in thread From: Sean Christopherson @ 2023-01-18 16:49 UTC (permalink / raw) To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, bgardon, oupton, ricarkol On Tue, Nov 15, 2022, Colton Lewis wrote: > Allocate additional space for latency samples. This has been separated > out to call attention to the additional VM memory allocation. A blurb in the changelog is sufficient, no need to split allocation and use into two patches. I would actually collapse all three into one. The changes aren't so big that errors will be difficult to bisect, and without the final printing, the other changes are useless for all intents and purposes, i.e. if for some reason we want to revert the sampling, it will be all or nothing. I do think it makes sense to separate the system counter stuff to a separate patch (and land it in generic code), e.g. add helpers to read the system counter from the guest and convert the result to nanoseconds in a separate patch. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples 2023-01-18 16:49 ` Sean Christopherson @ 2023-01-26 18:00 ` Colton Lewis 0 siblings, 0 replies; 19+ messages in thread From: Colton Lewis @ 2023-01-26 18:00 UTC (permalink / raw) To: Sean Christopherson Cc: kvm, pbonzini, maz, dmatlack, bgardon, oupton, ricarkol Sean Christopherson <seanjc@google.com> writes: > On Tue, Nov 15, 2022, Colton Lewis wrote: >> Allocate additional space for latency samples. This has been separated >> out to call attention to the additional VM memory allocation. > A blurb in the changelog is sufficient, no need to split allocation and > use into two > patches. I would actually collapse all three into one. The changes > aren't so big > that errors will be difficult to bisect, and without the final printing, > the other > changes are useless for all intents and purposes, i.e. if for some reason > we want > to revert the sampling, it will be all or nothing. Will do. It's easier to merge commits than split them up. > I do think it makes sense to separate the system counter stuff to a > separate > patch (and land it in generic code), e.g. add helpers to read the system > counter > from the guest and convert the result to nanoseconds in a separate > patch. Will do. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2022-11-15 17:32 [PATCH 0/3] Calculate memory access latency stats Colton Lewis 2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis @ 2022-11-15 17:32 ` Colton Lewis 2023-01-17 20:43 ` Ricardo Koller 2023-01-17 20:48 ` Ricardo Koller 2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis 2 siblings, 2 replies; 19+ messages in thread From: Colton Lewis @ 2022-11-15 17:32 UTC (permalink / raw) To: kvm Cc: pbonzini, maz, dmatlack, seanjc, bgardon, oupton, ricarkol, Colton Lewis Collect memory access latency measured in clock cycles. This introduces a dependency on the timers for ARM and x86. No other architectures are implemented and their samples will all be 0. Because keeping all samples is impractical due to the space required in some cases (pooled memory w/ 64 vcpus would be 64 GB/vcpu * 64 vcpus * 250,000 samples/GB * 8 bytes/sample ~ 8 Gb extra memory just for samples), resevior sampling is used to only keep a small number of samples per vcpu (1000 samples in this patch). Resevoir sampling means despite keeping only a small number of samples, each sample has an equal chance of making it to the resevoir. Simple proofs of this can be found online. This makes the resevoir a good representation of the distribution of samples and enables calculation of reasonably accurate percentiles. All samples are stored in a statically allocated flat array for ease of combining them later. Samples are stored at an offset in this array calculated by the vcpu index (so vcpu 5 sample 10 would be stored at address sample_times + 5 * vcpu_idx + 10). Signed-off-by: Colton Lewis <coltonlewis@google.com> --- .../selftests/kvm/lib/perf_test_util.c | 34 +++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c index a48904b64e19..0311da76bae0 100644 --- a/tools/testing/selftests/kvm/lib/perf_test_util.c +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c @@ -4,6 +4,9 @@ */ #include <inttypes.h> +#if defined(__aarch64__) +#include "aarch64/arch_timer.h" +#endif #include "kvm_util.h" #include "perf_test_util.h" #include "processor.h" @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; /* Store all samples in a flat array so they can be easily sorted later. */ uint64_t latency_samples[SAMPLE_CAPACITY]; +static uint64_t perf_test_timer_read(void) +{ +#if defined(__aarch64__) + return timer_get_cntct(VIRTUAL); +#elif defined(__x86_64__) + return rdtsc(); +#else +#warn __func__ " is not implemented for this architecture, will return 0" + return 0; +#endif +} + /* * Continuously write to the first 8 bytes of each page in the * specified region. @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx) int i; struct guest_random_state rand_state = new_guest_random_state(pta->random_seed + vcpu_idx); + uint64_t *latency_samples_offset = latency_samples + SAMPLES_PER_VCPU * vcpu_idx; + uint64_t count_before; + uint64_t count_after; + uint32_t maybe_sample; gva = vcpu_args->gva; pages = vcpu_args->pages; @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx) addr = gva + (page * pta->guest_page_size); - if (guest_random_u32(&rand_state) % 100 < pta->write_percent) + if (guest_random_u32(&rand_state) % 100 < pta->write_percent) { + count_before = perf_test_timer_read(); *(uint64_t *)addr = 0x0123456789ABCDEF; - else + count_after = perf_test_timer_read(); + } else { + count_before = perf_test_timer_read(); READ_ONCE(*(uint64_t *)addr); + count_after = perf_test_timer_read(); + } + + maybe_sample = guest_random_u32(&rand_state) % (i + 1); + if (i < SAMPLES_PER_VCPU) + latency_samples_offset[i] = count_after - count_before; + else if (maybe_sample < SAMPLES_PER_VCPU) + latency_samples_offset[maybe_sample] = count_after - count_before; } GUEST_SYNC(1); -- 2.38.1.431.g37b22c650d-goog ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis @ 2023-01-17 20:43 ` Ricardo Koller 2023-01-18 16:32 ` Sean Christopherson 2023-01-26 17:58 ` Colton Lewis 2023-01-17 20:48 ` Ricardo Koller 1 sibling, 2 replies; 19+ messages in thread From: Ricardo Koller @ 2023-01-17 20:43 UTC (permalink / raw) To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton On Tue, Nov 15, 2022 at 05:32:57PM +0000, Colton Lewis wrote: > Collect memory access latency measured in clock cycles. > > This introduces a dependency on the timers for ARM and x86. No other > architectures are implemented and their samples will all be 0. > > Because keeping all samples is impractical due to the space required > in some cases (pooled memory w/ 64 vcpus would be 64 GB/vcpu * 64 > vcpus * 250,000 samples/GB * 8 bytes/sample ~ 8 Gb extra memory just > for samples), resevior sampling is used to only keep a small number of nit: reservoir > samples per vcpu (1000 samples in this patch). Didn't see this before my previous comment. But, I guess it still applies: isn't it possible to know the number of events to store? to avoid the "100" obtained via trial and error. > > Resevoir sampling means despite keeping only a small number of > samples, each sample has an equal chance of making it to the > resevoir. Simple proofs of this can be found online. This makes the > resevoir a good representation of the distribution of samples and reservoir > enables calculation of reasonably accurate percentiles. > > All samples are stored in a statically allocated flat array for ease > of combining them later. Samples are stored at an offset in this array > calculated by the vcpu index (so vcpu 5 sample 10 would be stored at > address sample_times + 5 * vcpu_idx + 10). > > Signed-off-by: Colton Lewis <coltonlewis@google.com> > --- > .../selftests/kvm/lib/perf_test_util.c | 34 +++++++++++++++++-- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c > index a48904b64e19..0311da76bae0 100644 > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c > @@ -4,6 +4,9 @@ > */ > #include <inttypes.h> > > +#if defined(__aarch64__) > +#include "aarch64/arch_timer.h" > +#endif > #include "kvm_util.h" > #include "perf_test_util.h" > #include "processor.h" > @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; > /* Store all samples in a flat array so they can be easily sorted later. */ > uint64_t latency_samples[SAMPLE_CAPACITY]; > > +static uint64_t perf_test_timer_read(void) > +{ > +#if defined(__aarch64__) > + return timer_get_cntct(VIRTUAL); > +#elif defined(__x86_64__) > + return rdtsc(); > +#else > +#warn __func__ " is not implemented for this architecture, will return 0" > + return 0; > +#endif > +} > + > /* > * Continuously write to the first 8 bytes of each page in the > * specified region. > @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx) > int i; > struct guest_random_state rand_state = > new_guest_random_state(pta->random_seed + vcpu_idx); > + uint64_t *latency_samples_offset = latency_samples + SAMPLES_PER_VCPU * vcpu_idx; > + uint64_t count_before; > + uint64_t count_after; > + uint32_t maybe_sample; > > gva = vcpu_args->gva; > pages = vcpu_args->pages; > @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx) > > addr = gva + (page * pta->guest_page_size); > > - if (guest_random_u32(&rand_state) % 100 < pta->write_percent) > + if (guest_random_u32(&rand_state) % 100 < pta->write_percent) { > + count_before = perf_test_timer_read(); > *(uint64_t *)addr = 0x0123456789ABCDEF; > - else > + count_after = perf_test_timer_read(); > + } else { > + count_before = perf_test_timer_read(); > READ_ONCE(*(uint64_t *)addr); > + count_after = perf_test_timer_read(); "count_before ... ACCESS count_after" could be moved to some macro, e.g.,: t = MEASURE(READ_ONCE(*(uint64_t *)addr)); > + } > + > + maybe_sample = guest_random_u32(&rand_state) % (i + 1); > + if (i < SAMPLES_PER_VCPU) > + latency_samples_offset[i] = count_after - count_before; > + else if (maybe_sample < SAMPLES_PER_VCPU) > + latency_samples_offset[maybe_sample] = count_after - count_before; I would prefer these reservoir sampling details to be in a helper, e.g.,: reservoir_sample_record(t, i); > } > > GUEST_SYNC(1); > -- > 2.38.1.431.g37b22c650d-goog > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2023-01-17 20:43 ` Ricardo Koller @ 2023-01-18 16:32 ` Sean Christopherson 2023-01-26 18:00 ` Colton Lewis 2023-01-26 17:58 ` Colton Lewis 1 sibling, 1 reply; 19+ messages in thread From: Sean Christopherson @ 2023-01-18 16:32 UTC (permalink / raw) To: Ricardo Koller Cc: Colton Lewis, kvm, pbonzini, maz, dmatlack, bgardon, oupton On Tue, Jan 17, 2023, Ricardo Koller wrote: > On Tue, Nov 15, 2022 at 05:32:57PM +0000, Colton Lewis wrote: > > @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; > > /* Store all samples in a flat array so they can be easily sorted later. */ > > uint64_t latency_samples[SAMPLE_CAPACITY]; > > > > +static uint64_t perf_test_timer_read(void) > > +{ > > +#if defined(__aarch64__) > > + return timer_get_cntct(VIRTUAL); > > +#elif defined(__x86_64__) > > + return rdtsc(); > > +#else > > +#warn __func__ " is not implemented for this architecture, will return 0" > > + return 0; > > +#endif > > +} I would prefer to put the guest-side timer helpers into common code, e.g. as guest_read_system_counter(), replacing system_counter_offset_test.c's one-off version. > > /* > > * Continuously write to the first 8 bytes of each page in the > > * specified region. > > @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx) > > int i; > > struct guest_random_state rand_state = > > new_guest_random_state(pta->random_seed + vcpu_idx); > > + uint64_t *latency_samples_offset = latency_samples + SAMPLES_PER_VCPU * vcpu_idx; "offset" is confusing because the system counter (TSC in x86) has an offset for the guest-perceived value. Maybe just "latencies"? > > + uint64_t count_before; > > + uint64_t count_after; Maybe s/count/time? Yeah, it's technically wrong to call it "time", but "count" is too generic. > > + uint32_t maybe_sample; > > > > gva = vcpu_args->gva; > > pages = vcpu_args->pages; > > @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx) > > > > addr = gva + (page * pta->guest_page_size); > > > > - if (guest_random_u32(&rand_state) % 100 < pta->write_percent) > > + if (guest_random_u32(&rand_state) % 100 < pta->write_percent) { > > + count_before = perf_test_timer_read(); > > *(uint64_t *)addr = 0x0123456789ABCDEF; > > - else > > + count_after = perf_test_timer_read(); > > + } else { > > + count_before = perf_test_timer_read(); > > READ_ONCE(*(uint64_t *)addr); > > + count_after = perf_test_timer_read(); > > "count_before ... ACCESS count_after" could be moved to some macro, > e.g.,: > t = MEASURE(READ_ONCE(*(uint64_t *)addr)); Even better, capture the read vs. write in a local variable to self-document the use of the RNG, then the motivation for reading the system counter inside the if/else-statements goes away. That way we don't need to come up with a name that documents what MEASURE() measures. write = guest_random_u32(&rand_state) % 100 < args->write_percent; time_before = guest_system_counter_read(); if (write) *(uint64_t *)addr = 0x0123456789ABCDEF; else READ_ONCE(*(uint64_t *)addr); time_after = guest_system_counter_read(); > > + } > > + > > + maybe_sample = guest_random_u32(&rand_state) % (i + 1); No need to generate a random number for iterations that always sample. And I think it will make the code easier to follow if there is a single write to the array. The derivation of the index is what's interesting and different, we should use code to highlight that. /* * Always sample early iterations to ensure at least the * number of requested samples is collected. Once the * array has been filled, <here is a comment from Colton * briefly explaining the math>. * if (i < SAMPLES_PER_VCPU) idx = i; else idx = guest_random_u32(&rand_state) % (i + 1); if (idx < SAMPLES_PER_VCPU) latencies[idx] = time_after - time_before; > > + if (i < SAMPLES_PER_VCPU) Would it make sense to let the user configure the number of samples? Seems easy enough and would let the user ultimately decide how much memory to burn on samples. > > + latency_samples_offset[i] = count_after - count_before; > > + else if (maybe_sample < SAMPLES_PER_VCPU) > > + latency_samples_offset[maybe_sample] = count_after - count_before; > > I would prefer these reservoir sampling details to be in a helper, > e.g.,: > reservoir_sample_record(t, i); Heh, I vote to open code the behavior. I dislike fancy names that hide relatively simple logic. IMO, readers won't care how the modulo math provides even distribution, just that it does and that the early iterations always samples to ensure ever bucket is filled. In this case, I can pretty much guarantee that I'd end up spending more time digging into what "reservoir" means than I would to understand the basic flow. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2023-01-18 16:32 ` Sean Christopherson @ 2023-01-26 18:00 ` Colton Lewis 2023-01-26 19:07 ` Sean Christopherson 0 siblings, 1 reply; 19+ messages in thread From: Colton Lewis @ 2023-01-26 18:00 UTC (permalink / raw) To: Sean Christopherson Cc: ricarkol, kvm, pbonzini, maz, dmatlack, bgardon, oupton Sean Christopherson <seanjc@google.com> writes: > On Tue, Jan 17, 2023, Ricardo Koller wrote: >> On Tue, Nov 15, 2022 at 05:32:57PM +0000, Colton Lewis wrote: >> > @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; >> > /* Store all samples in a flat array so they can be easily sorted >> later. */ >> > uint64_t latency_samples[SAMPLE_CAPACITY]; >> > >> > +static uint64_t perf_test_timer_read(void) >> > +{ >> > +#if defined(__aarch64__) >> > + return timer_get_cntct(VIRTUAL); >> > +#elif defined(__x86_64__) >> > + return rdtsc(); >> > +#else >> > +#warn __func__ " is not implemented for this architecture, will >> return 0" >> > + return 0; >> > +#endif >> > +} > I would prefer to put the guest-side timer helpers into common code, e.g. > as > guest_read_system_counter(), replacing system_counter_offset_test.c's > one-off > version. Will do. >> > /* >> > * Continuously write to the first 8 bytes of each page in the >> > * specified region. >> > @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx) >> > int i; >> > struct guest_random_state rand_state = >> > new_guest_random_state(pta->random_seed + vcpu_idx); >> > + uint64_t *latency_samples_offset = latency_samples + >> SAMPLES_PER_VCPU * vcpu_idx; > "offset" is confusing because the system counter (TSC in x86) has an > offset for > the guest-perceived value. Maybe just "latencies"? Will do. >> > + uint64_t count_before; >> > + uint64_t count_after; > Maybe s/count/time? Yeah, it's technically wrong to call it "time", > but "count" > is too generic. I could say "cycles". >> > + uint32_t maybe_sample; >> > >> > gva = vcpu_args->gva; >> > pages = vcpu_args->pages; >> > @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx) >> > >> > addr = gva + (page * pta->guest_page_size); >> > >> > - if (guest_random_u32(&rand_state) % 100 < pta->write_percent) >> > + if (guest_random_u32(&rand_state) % 100 < pta->write_percent) { >> > + count_before = perf_test_timer_read(); >> > *(uint64_t *)addr = 0x0123456789ABCDEF; >> > - else >> > + count_after = perf_test_timer_read(); >> > + } else { >> > + count_before = perf_test_timer_read(); >> > READ_ONCE(*(uint64_t *)addr); >> > + count_after = perf_test_timer_read(); >> "count_before ... ACCESS count_after" could be moved to some macro, >> e.g.,: >> t = MEASURE(READ_ONCE(*(uint64_t *)addr)); > Even better, capture the read vs. write in a local variable to > self-document the > use of the RNG, then the motivation for reading the system counter inside > the > if/else-statements goes away. That way we don't need to come up with a > name > that documents what MEASURE() measures. > write = guest_random_u32(&rand_state) % 100 < args->write_percent; > time_before = guest_system_counter_read(); > if (write) > *(uint64_t *)addr = 0x0123456789ABCDEF; > else > READ_ONCE(*(uint64_t *)addr); > time_after = guest_system_counter_read(); Couldn't timing before and after the if statement produce bad measurements? We might be including a branch mispredict in our memory access latency and this could happen a lot because it's random so no way for the CPU to predict. >> > + } >> > + >> > + maybe_sample = guest_random_u32(&rand_state) % (i + 1); > No need to generate a random number for iterations that always sample. > And I > think it will make the code easier to follow if there is a single write > to the > array. The derivation of the index is what's interesting and different, > we should > use code to highlight that. > /* > * Always sample early iterations to ensure at least the > * number of requested samples is collected. Once the > * array has been filled, <here is a comment from Colton > * briefly explaining the math>. > * > if (i < SAMPLES_PER_VCPU) > idx = i; > else > idx = guest_random_u32(&rand_state) % (i + 1); > if (idx < SAMPLES_PER_VCPU) > latencies[idx] = time_after - time_before; Will do. >> > + if (i < SAMPLES_PER_VCPU) > Would it make sense to let the user configure the number of samples? > Seems easy > enough and would let the user ultimately decide how much memory to burn > on samples. Theoretically users may wish to tweak the accuracy vs memory use tradeoff. Seemed like a shakey value proposition to me because of diminishing returns to increased accuracy, but I will include an option if you insist. >> > + latency_samples_offset[i] = count_after - count_before; >> > + else if (maybe_sample < SAMPLES_PER_VCPU) >> > + latency_samples_offset[maybe_sample] = count_after - count_before; >> I would prefer these reservoir sampling details to be in a helper, >> e.g.,: >> reservoir_sample_record(t, i); > Heh, I vote to open code the behavior. I dislike fancy names that hide > relatively > simple logic. IMO, readers won't care how the modulo math provides even > distribution, > just that it does and that the early iterations always samples to ensure > ever bucket > is filled. > In this case, I can pretty much guarantee that I'd end up spending more > time > digging into what "reservoir" means than I would to understand the basic > flow. I agree. Logic is simple enough more names will confuse as to what's really happening. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2023-01-26 18:00 ` Colton Lewis @ 2023-01-26 19:07 ` Sean Christopherson 0 siblings, 0 replies; 19+ messages in thread From: Sean Christopherson @ 2023-01-26 19:07 UTC (permalink / raw) To: Colton Lewis; +Cc: ricarkol, kvm, pbonzini, maz, dmatlack, bgardon, oupton On Thu, Jan 26, 2023, Colton Lewis wrote: > Sean Christopherson <seanjc@google.com> writes: > > Maybe s/count/time? Yeah, it's technically wrong to call it "time", but > > "count" is too generic. > > I could say "cycles". Works for me. > > > > + uint32_t maybe_sample; > > > > > > > > gva = vcpu_args->gva; > > > > pages = vcpu_args->pages; > > > > @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx) > > > > > > > > addr = gva + (page * pta->guest_page_size); > > > > > > > > - if (guest_random_u32(&rand_state) % 100 < pta->write_percent) > > > > + if (guest_random_u32(&rand_state) % 100 < pta->write_percent) { > > > > + count_before = perf_test_timer_read(); > > > > *(uint64_t *)addr = 0x0123456789ABCDEF; > > > > - else > > > > + count_after = perf_test_timer_read(); > > > > + } else { > > > > + count_before = perf_test_timer_read(); > > > > READ_ONCE(*(uint64_t *)addr); > > > > + count_after = perf_test_timer_read(); > > > > "count_before ... ACCESS count_after" could be moved to some macro, > > > e.g.,: > > > t = MEASURE(READ_ONCE(*(uint64_t *)addr)); > > > Even better, capture the read vs. write in a local variable to > > self-document the > > use of the RNG, then the motivation for reading the system counter > > inside the > > if/else-statements goes away. That way we don't need to come up with a > > name > > that documents what MEASURE() measures. > > > write = guest_random_u32(&rand_state) % 100 < args->write_percent; > > > time_before = guest_system_counter_read(); > > if (write) > > *(uint64_t *)addr = 0x0123456789ABCDEF; > > else > > READ_ONCE(*(uint64_t *)addr); > > time_after = guest_system_counter_read(); > > Couldn't timing before and after the if statement produce bad > measurements? We might be including a branch mispredict in our memory > access latency and this could happen a lot because it's random so no way > for the CPU to predict. Hmm, I was assuming the latency due to a (mispredicted) branch would be in the noise compared to the latency of a VM-Exit needed to handle the fault. On the other hand, adding a common macro would be trivial, it's only the naming that's hard. What if we keep with the "cycles" theme and do as Ricardo suggested? E.g. minus backslashes, this doesn't look awful. #define MEASURE_CYCLES(x) ({ uint64_t start; start = guest_system_counter_read(); x; guest_system_counter_read() - start; }) > > > > + if (i < SAMPLES_PER_VCPU) > > > Would it make sense to let the user configure the number of samples? > > Seems easy enough and would let the user ultimately decide how much memory > > to burn on samples. > > Theoretically users may wish to tweak the accuracy vs memory use > tradeoff. Seemed like a shakey value proposition to me because of > diminishing returns to increased accuracy, but I will include an option > if you insist. It's not so much that I expect users to want better accuracy, it's that I dislike I arbitrary magic numbers. E.g. why 1000 and not 500 or 2000? Especially since the number of accesses _is_ controllable by the user. Hmm, the other thing to consider is that, as proposed, the vast majority of the capacity will go unused, e.g. default is 4, max is 512 (and that should really be tied to KVM's max of 4096). What if we invert the calculation and define the capacity in bytes, and derive the number of samples based on capacity / nr_vcpus? And then push the capacity as high as possible, e.g. I assume there's a threshold where the test will fail because of selftests poor page table management. That way the sampling is as efficient as possible, and the arbitrary capacity comes from limitations within the framework, as opposed to a human defined magic number. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2023-01-17 20:43 ` Ricardo Koller 2023-01-18 16:32 ` Sean Christopherson @ 2023-01-26 17:58 ` Colton Lewis 2023-01-26 18:30 ` Ricardo Koller 1 sibling, 1 reply; 19+ messages in thread From: Colton Lewis @ 2023-01-26 17:58 UTC (permalink / raw) To: Ricardo Koller; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton Ricardo Koller <ricarkol@google.com> writes: > nit: reservoir Will fix > Didn't see this before my previous comment. But, I guess it still > applies: isn't it possible to know the number of events to store? to > avoid the "100" obtained via trial and error. That's what I thought I was calculating with sample_pages = sizeof(latency_samples) / pta->guest_page_size. Both values are in bytes so that should give the number of guest pages I need to allocate to hold latency_samples. The 100 is a fudge factor added to the calculation after encountering crashes. Any idea why the math above is incorrect? > reservoir Will fix. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2023-01-26 17:58 ` Colton Lewis @ 2023-01-26 18:30 ` Ricardo Koller 0 siblings, 0 replies; 19+ messages in thread From: Ricardo Koller @ 2023-01-26 18:30 UTC (permalink / raw) To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton On Thu, Jan 26, 2023 at 9:58 AM Colton Lewis <coltonlewis@google.com> wrote: > > Ricardo Koller <ricarkol@google.com> writes: > > > > nit: reservoir > > > Will fix > > > > Didn't see this before my previous comment. But, I guess it still > > applies: isn't it possible to know the number of events to store? to > > avoid the "100" obtained via trial and error. > > > That's what I thought I was calculating with sample_pages = > sizeof(latency_samples) / pta->guest_page_size. Both values are in bytes > so that should give the number of guest pages I need to allocate to hold > latency_samples. The 100 is a fudge factor added to the calculation > after encountering crashes. Any idea why the math above is incorrect? > Mm, I don't know to be honest. But I do think it's required to calculate the exact number of bytes needed. Otherwise, 100 might not be enough for a bigger VM with more samples. > > > reservoir > > > Will fix. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis 2023-01-17 20:43 ` Ricardo Koller @ 2023-01-17 20:48 ` Ricardo Koller 2023-01-26 17:59 ` Colton Lewis 1 sibling, 1 reply; 19+ messages in thread From: Ricardo Koller @ 2023-01-17 20:48 UTC (permalink / raw) To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton On Tue, Nov 15, 2022 at 05:32:57PM +0000, Colton Lewis wrote: > Collect memory access latency measured in clock cycles. > > This introduces a dependency on the timers for ARM and x86. No other > architectures are implemented and their samples will all be 0. > > Because keeping all samples is impractical due to the space required > in some cases (pooled memory w/ 64 vcpus would be 64 GB/vcpu * 64 > vcpus * 250,000 samples/GB * 8 bytes/sample ~ 8 Gb extra memory just > for samples), resevior sampling is used to only keep a small number of > samples per vcpu (1000 samples in this patch). > > Resevoir sampling means despite keeping only a small number of Could you add a reference? I'm trying to understand the algorithm, but I'm not even sure what's the version implemented ("Optimal: Algorithm L"?). > samples, each sample has an equal chance of making it to the > resevoir. Simple proofs of this can be found online. This makes the > resevoir a good representation of the distribution of samples and > enables calculation of reasonably accurate percentiles. > > All samples are stored in a statically allocated flat array for ease > of combining them later. Samples are stored at an offset in this array > calculated by the vcpu index (so vcpu 5 sample 10 would be stored at > address sample_times + 5 * vcpu_idx + 10). > > Signed-off-by: Colton Lewis <coltonlewis@google.com> > --- > .../selftests/kvm/lib/perf_test_util.c | 34 +++++++++++++++++-- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c > index a48904b64e19..0311da76bae0 100644 > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c > @@ -4,6 +4,9 @@ > */ > #include <inttypes.h> > > +#if defined(__aarch64__) > +#include "aarch64/arch_timer.h" > +#endif > #include "kvm_util.h" > #include "perf_test_util.h" > #include "processor.h" > @@ -44,6 +47,18 @@ static struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; > /* Store all samples in a flat array so they can be easily sorted later. */ > uint64_t latency_samples[SAMPLE_CAPACITY]; > > +static uint64_t perf_test_timer_read(void) > +{ > +#if defined(__aarch64__) > + return timer_get_cntct(VIRTUAL); > +#elif defined(__x86_64__) > + return rdtsc(); > +#else > +#warn __func__ " is not implemented for this architecture, will return 0" > + return 0; > +#endif > +} > + > /* > * Continuously write to the first 8 bytes of each page in the > * specified region. > @@ -59,6 +74,10 @@ void perf_test_guest_code(uint32_t vcpu_idx) > int i; > struct guest_random_state rand_state = > new_guest_random_state(pta->random_seed + vcpu_idx); > + uint64_t *latency_samples_offset = latency_samples + SAMPLES_PER_VCPU * vcpu_idx; > + uint64_t count_before; > + uint64_t count_after; > + uint32_t maybe_sample; > > gva = vcpu_args->gva; > pages = vcpu_args->pages; > @@ -75,10 +94,21 @@ void perf_test_guest_code(uint32_t vcpu_idx) > > addr = gva + (page * pta->guest_page_size); > > - if (guest_random_u32(&rand_state) % 100 < pta->write_percent) > + if (guest_random_u32(&rand_state) % 100 < pta->write_percent) { > + count_before = perf_test_timer_read(); > *(uint64_t *)addr = 0x0123456789ABCDEF; > - else > + count_after = perf_test_timer_read(); > + } else { > + count_before = perf_test_timer_read(); > READ_ONCE(*(uint64_t *)addr); > + count_after = perf_test_timer_read(); > + } > + > + maybe_sample = guest_random_u32(&rand_state) % (i + 1); > + if (i < SAMPLES_PER_VCPU) > + latency_samples_offset[i] = count_after - count_before; > + else if (maybe_sample < SAMPLES_PER_VCPU) > + latency_samples_offset[maybe_sample] = count_after - count_before; > } > > GUEST_SYNC(1); > -- > 2.38.1.431.g37b22c650d-goog > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2/3] KVM: selftests: Collect memory access latency samples 2023-01-17 20:48 ` Ricardo Koller @ 2023-01-26 17:59 ` Colton Lewis 0 siblings, 0 replies; 19+ messages in thread From: Colton Lewis @ 2023-01-26 17:59 UTC (permalink / raw) To: Ricardo Koller; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton Ricardo Koller <ricarkol@google.com> writes: >> Resevoir sampling means despite keeping only a small number of > Could you add a reference? I'm trying to understand the algorithm, but > I'm not even sure what's the version implemented ("Optimal: Algorithm > L"?). I can add an external link to Wikipedia. That was the source I used. The version implemented here is the simple version that Wikipedia names "Algorithm R". ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution 2022-11-15 17:32 [PATCH 0/3] Calculate memory access latency stats Colton Lewis 2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis 2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis @ 2022-11-15 17:32 ` Colton Lewis 2023-01-17 20:45 ` Ricardo Koller 2023-01-18 16:43 ` Sean Christopherson 2 siblings, 2 replies; 19+ messages in thread From: Colton Lewis @ 2022-11-15 17:32 UTC (permalink / raw) To: kvm Cc: pbonzini, maz, dmatlack, seanjc, bgardon, oupton, ricarkol, Colton Lewis Print summary stats of the memory latency distribution in nanoseconds. For every iteration, this prints the minimum, the maximum, and the 50th, 90th, and 99th percentiles. Stats are calculated by sorting the samples taken from all vcpus and picking from the index corresponding with each percentile. The conversion to nanoseconds needs the frequency of the Intel timestamp counter, which is estimated by reading the counter before and after sleeping for 1 second. This is not a pretty trick, but it also exists in vmx_nested_tsc_scaling_test.c Signed-off-by: Colton Lewis <coltonlewis@google.com> --- .../selftests/kvm/dirty_log_perf_test.c | 2 + .../selftests/kvm/include/perf_test_util.h | 2 + .../selftests/kvm/lib/perf_test_util.c | 62 +++++++++++++++++++ 3 files changed, 66 insertions(+) diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c index 202f38a72851..2bc066bba460 100644 --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c @@ -274,6 +274,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) ts_diff = timespec_elapsed(start); pr_info("Populate memory time: %ld.%.9lds\n", ts_diff.tv_sec, ts_diff.tv_nsec); + perf_test_print_percentiles(vm, nr_vcpus); /* Enable dirty logging */ clock_gettime(CLOCK_MONOTONIC, &start); @@ -304,6 +305,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff); pr_info("Iteration %d dirty memory time: %ld.%.9lds\n", iteration, ts_diff.tv_sec, ts_diff.tv_nsec); + perf_test_print_percentiles(vm, nr_vcpus); clock_gettime(CLOCK_MONOTONIC, &start); get_dirty_log(vm, bitmaps, p->slots); diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h index 3d0b75ea866a..ca378c262f12 100644 --- a/tools/testing/selftests/kvm/include/perf_test_util.h +++ b/tools/testing/selftests/kvm/include/perf_test_util.h @@ -47,6 +47,8 @@ struct perf_test_args { extern struct perf_test_args perf_test_args; +void perf_test_print_percentiles(struct kvm_vm *vm, int nr_vcpus); + struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus, uint64_t vcpu_memory_bytes, int slots, enum vm_mem_backing_src_type backing_src, diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c index 0311da76bae0..927d22421f7c 100644 --- a/tools/testing/selftests/kvm/lib/perf_test_util.c +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c @@ -115,6 +115,68 @@ void perf_test_guest_code(uint32_t vcpu_idx) } } +#if defined(__x86_64__) +/* This could be determined with the right sequence of cpuid + * instructions, but that's oddly complicated. + */ +static uint64_t perf_test_intel_timer_frequency(void) +{ + uint64_t count_before; + uint64_t count_after; + uint64_t measured_freq; + uint64_t adjusted_freq; + + count_before = perf_test_timer_read(); + sleep(1); + count_after = perf_test_timer_read(); + + /* Using 1 second implies our units are in Hz already. */ + measured_freq = count_after - count_before; + /* Truncate to the nearest MHz. Clock frequencies are round numbers. */ + adjusted_freq = measured_freq / 1000000 * 1000000; + + return adjusted_freq; +} +#endif + +static double perf_test_cycles_to_ns(double cycles) +{ +#if defined(__aarch64__) + return cycles * (1e9 / timer_get_cntfrq()); +#elif defined(__x86_64__) + static uint64_t timer_frequency; + + if (timer_frequency == 0) + timer_frequency = perf_test_intel_timer_frequency(); + + return cycles * (1e9 / timer_frequency); +#else +#warn __func__ " is not implemented for this architecture, will return 0" + return 0.0; +#endif +} + +/* compare function for qsort */ +static int perf_test_qcmp(const void *a, const void *b) +{ + return *(int *)a - *(int *)b; +} + +void perf_test_print_percentiles(struct kvm_vm *vm, int nr_vcpus) +{ + uint64_t n_samples = nr_vcpus * SAMPLES_PER_VCPU; + + sync_global_from_guest(vm, latency_samples); + qsort(latency_samples, n_samples, sizeof(uint64_t), &perf_test_qcmp); + + pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n", + perf_test_cycles_to_ns((double)latency_samples[0]), + perf_test_cycles_to_ns((double)latency_samples[n_samples / 2]), + perf_test_cycles_to_ns((double)latency_samples[n_samples * 9 / 10]), + perf_test_cycles_to_ns((double)latency_samples[n_samples * 99 / 100]), + perf_test_cycles_to_ns((double)latency_samples[n_samples - 1])); +} + void perf_test_setup_vcpus(struct kvm_vm *vm, int nr_vcpus, struct kvm_vcpu *vcpus[], uint64_t vcpu_memory_bytes, -- 2.38.1.431.g37b22c650d-goog ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution 2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis @ 2023-01-17 20:45 ` Ricardo Koller 2023-01-26 17:58 ` Colton Lewis 2023-01-18 16:43 ` Sean Christopherson 1 sibling, 1 reply; 19+ messages in thread From: Ricardo Koller @ 2023-01-17 20:45 UTC (permalink / raw) To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton On Tue, Nov 15, 2022 at 05:32:58PM +0000, Colton Lewis wrote: > Print summary stats of the memory latency distribution in > nanoseconds. For every iteration, this prints the minimum, the > maximum, and the 50th, 90th, and 99th percentiles. > > Stats are calculated by sorting the samples taken from all vcpus and > picking from the index corresponding with each percentile. > > The conversion to nanoseconds needs the frequency of the Intel > timestamp counter, which is estimated by reading the counter before > and after sleeping for 1 second. This is not a pretty trick, but it > also exists in vmx_nested_tsc_scaling_test.c > > Signed-off-by: Colton Lewis <coltonlewis@google.com> > --- > .../selftests/kvm/dirty_log_perf_test.c | 2 + > .../selftests/kvm/include/perf_test_util.h | 2 + > .../selftests/kvm/lib/perf_test_util.c | 62 +++++++++++++++++++ > 3 files changed, 66 insertions(+) > > diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c > index 202f38a72851..2bc066bba460 100644 > --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c > +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c > @@ -274,6 +274,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) > ts_diff = timespec_elapsed(start); > pr_info("Populate memory time: %ld.%.9lds\n", > ts_diff.tv_sec, ts_diff.tv_nsec); > + perf_test_print_percentiles(vm, nr_vcpus); > > /* Enable dirty logging */ > clock_gettime(CLOCK_MONOTONIC, &start); > @@ -304,6 +305,7 @@ static void run_test(enum vm_guest_mode mode, void *arg) > vcpu_dirty_total = timespec_add(vcpu_dirty_total, ts_diff); > pr_info("Iteration %d dirty memory time: %ld.%.9lds\n", > iteration, ts_diff.tv_sec, ts_diff.tv_nsec); > + perf_test_print_percentiles(vm, nr_vcpus); > > clock_gettime(CLOCK_MONOTONIC, &start); > get_dirty_log(vm, bitmaps, p->slots); > diff --git a/tools/testing/selftests/kvm/include/perf_test_util.h b/tools/testing/selftests/kvm/include/perf_test_util.h > index 3d0b75ea866a..ca378c262f12 100644 > --- a/tools/testing/selftests/kvm/include/perf_test_util.h > +++ b/tools/testing/selftests/kvm/include/perf_test_util.h > @@ -47,6 +47,8 @@ struct perf_test_args { > > extern struct perf_test_args perf_test_args; > > +void perf_test_print_percentiles(struct kvm_vm *vm, int nr_vcpus); > + > struct kvm_vm *perf_test_create_vm(enum vm_guest_mode mode, int nr_vcpus, > uint64_t vcpu_memory_bytes, int slots, > enum vm_mem_backing_src_type backing_src, > diff --git a/tools/testing/selftests/kvm/lib/perf_test_util.c b/tools/testing/selftests/kvm/lib/perf_test_util.c > index 0311da76bae0..927d22421f7c 100644 > --- a/tools/testing/selftests/kvm/lib/perf_test_util.c > +++ b/tools/testing/selftests/kvm/lib/perf_test_util.c > @@ -115,6 +115,68 @@ void perf_test_guest_code(uint32_t vcpu_idx) > } > } > > +#if defined(__x86_64__) > +/* This could be determined with the right sequence of cpuid > + * instructions, but that's oddly complicated. > + */ > +static uint64_t perf_test_intel_timer_frequency(void) > +{ > + uint64_t count_before; > + uint64_t count_after; > + uint64_t measured_freq; > + uint64_t adjusted_freq; > + > + count_before = perf_test_timer_read(); > + sleep(1); > + count_after = perf_test_timer_read(); > + > + /* Using 1 second implies our units are in Hz already. */ > + measured_freq = count_after - count_before; > + /* Truncate to the nearest MHz. Clock frequencies are round numbers. */ > + adjusted_freq = measured_freq / 1000000 * 1000000; > + > + return adjusted_freq; > +} > +#endif > + > +static double perf_test_cycles_to_ns(double cycles) > +{ > +#if defined(__aarch64__) > + return cycles * (1e9 / timer_get_cntfrq()); > +#elif defined(__x86_64__) > + static uint64_t timer_frequency; > + > + if (timer_frequency == 0) > + timer_frequency = perf_test_intel_timer_frequency(); > + > + return cycles * (1e9 / timer_frequency); > +#else > +#warn __func__ " is not implemented for this architecture, will return 0" > + return 0.0; > +#endif > +} > + > +/* compare function for qsort */ > +static int perf_test_qcmp(const void *a, const void *b) > +{ > + return *(int *)a - *(int *)b; > +} > + > +void perf_test_print_percentiles(struct kvm_vm *vm, int nr_vcpus) > +{ > + uint64_t n_samples = nr_vcpus * SAMPLES_PER_VCPU; > + > + sync_global_from_guest(vm, latency_samples); > + qsort(latency_samples, n_samples, sizeof(uint64_t), &perf_test_qcmp); > + > + pr_info("Latency distribution (ns) = min:%6.0lf, 50th:%6.0lf, 90th:%6.0lf, 99th:%6.0lf, max:%6.0lf\n", > + perf_test_cycles_to_ns((double)latency_samples[0]), > + perf_test_cycles_to_ns((double)latency_samples[n_samples / 2]), > + perf_test_cycles_to_ns((double)latency_samples[n_samples * 9 / 10]), > + perf_test_cycles_to_ns((double)latency_samples[n_samples * 99 / 100]), > + perf_test_cycles_to_ns((double)latency_samples[n_samples - 1])); > +} Latency distribution (ns) = min: 732, 50th: 792, 90th: 901, 99th: ^^^ nit: would prefer to avoid the spaces > + > void perf_test_setup_vcpus(struct kvm_vm *vm, int nr_vcpus, > struct kvm_vcpu *vcpus[], > uint64_t vcpu_memory_bytes, > -- > 2.38.1.431.g37b22c650d-goog > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution 2023-01-17 20:45 ` Ricardo Koller @ 2023-01-26 17:58 ` Colton Lewis 0 siblings, 0 replies; 19+ messages in thread From: Colton Lewis @ 2023-01-26 17:58 UTC (permalink / raw) To: Ricardo Koller; +Cc: kvm, pbonzini, maz, dmatlack, seanjc, bgardon, oupton Ricardo Koller <ricarkol@google.com> writes: > Latency distribution (ns) = min: 732, 50th: 792, 90th: 901, 99th: > ^^^ > nit: would prefer to avoid the spaces The spaces are there so each number has a fixed width and makes the lines easy to visually compare because the numbers are aligned to the same columns. No way to avoid some extra spaces and preserve this property. What if a number is a different length than expected? I chose a uniform width I was confident would always fit a measurement of the max. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution 2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis 2023-01-17 20:45 ` Ricardo Koller @ 2023-01-18 16:43 ` Sean Christopherson 2023-01-26 17:59 ` Colton Lewis 1 sibling, 1 reply; 19+ messages in thread From: Sean Christopherson @ 2023-01-18 16:43 UTC (permalink / raw) To: Colton Lewis; +Cc: kvm, pbonzini, maz, dmatlack, bgardon, oupton, ricarkol On Tue, Nov 15, 2022, Colton Lewis wrote: > Print summary stats of the memory latency distribution in > nanoseconds. For every iteration, this prints the minimum, the > maximum, and the 50th, 90th, and 99th percentiles. > > Stats are calculated by sorting the samples taken from all vcpus and > picking from the index corresponding with each percentile. > > The conversion to nanoseconds needs the frequency of the Intel > timestamp counter, which is estimated by reading the counter before > and after sleeping for 1 second. This is not a pretty trick, but it > also exists in vmx_nested_tsc_scaling_test.c This test shouldn't need to guesstimate the frequency, just use a VM-scoped KVM_GET_TSC_KHZ, which will provide KVM's default TSC frequency, i.e. the host frequency. For hardware with a constant TSC, which is everything modern, that will be as accurate as we can get. For hardware without a constant TSC, well, buy new hardware :-) vmx_nested_tsc_scaling_test.c does the weird sleep() behavior because it's trying to validate that the guest TSC counts at the correct rate, i.e. it is validating KVM_GET_TSC_KHZ to some extent, and so obviously doesn't fully trust its result. For tests that just want to measure time, there's no reason not to trust KVM. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution 2023-01-18 16:43 ` Sean Christopherson @ 2023-01-26 17:59 ` Colton Lewis 0 siblings, 0 replies; 19+ messages in thread From: Colton Lewis @ 2023-01-26 17:59 UTC (permalink / raw) To: Sean Christopherson Cc: kvm, pbonzini, maz, dmatlack, bgardon, oupton, ricarkol Sean Christopherson <seanjc@google.com> writes: > This test shouldn't need to guesstimate the frequency, just use a > VM-scoped > KVM_GET_TSC_KHZ, which will provide KVM's default TSC frequency, i.e. the > host > frequency. For hardware with a constant TSC, which is everything modern, > that > will be as accurate as we can get. For hardware without a constant TSC, > well, buy > new hardware :-) Thanks. That simplifies things. I encountered that number before but was unsure if I could trust it for reasons I no longer remember. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2023-01-26 19:08 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-11-15 17:32 [PATCH 0/3] Calculate memory access latency stats Colton Lewis 2022-11-15 17:32 ` [PATCH 1/3] KVM: selftests: Allocate additional space for latency samples Colton Lewis 2023-01-17 20:32 ` Ricardo Koller 2023-01-18 16:49 ` Sean Christopherson 2023-01-26 18:00 ` Colton Lewis 2022-11-15 17:32 ` [PATCH 2/3] KVM: selftests: Collect memory access " Colton Lewis 2023-01-17 20:43 ` Ricardo Koller 2023-01-18 16:32 ` Sean Christopherson 2023-01-26 18:00 ` Colton Lewis 2023-01-26 19:07 ` Sean Christopherson 2023-01-26 17:58 ` Colton Lewis 2023-01-26 18:30 ` Ricardo Koller 2023-01-17 20:48 ` Ricardo Koller 2023-01-26 17:59 ` Colton Lewis 2022-11-15 17:32 ` [PATCH 3/3] KVM: selftests: Print summary stats of memory latency distribution Colton Lewis 2023-01-17 20:45 ` Ricardo Koller 2023-01-26 17:58 ` Colton Lewis 2023-01-18 16:43 ` Sean Christopherson 2023-01-26 17:59 ` Colton Lewis
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.