* [RFC 00/10] KVM: Add TMEM host/guest support @ 2012-06-06 13:07 Sasha Levin 2012-06-06 13:24 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Sasha Levin @ 2012-06-06 13:07 UTC (permalink / raw) To: avi, mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk Cc: kvm, Sasha Levin This patch series adds support for passing TMEM commands between KVM guests and the host. This opens the possibility to use TMEM cross-guests and posibly across hosts with RAMster. Since frontswap was merged in the 3.4 cycle, the kernel now has all facilities required to work with TMEM. There is no longer a dependency on out of tree code. We can split this patch series into two: - The guest side, which is basically two shims that proxy mm/cleancache.c and mm/frontswap.c requests from the guest back to the host. This is done using a new KVM_HC_TMEM hypercall. - The host side, which is a rather small shim which connects KVM to zcache. It's worth noting that this patch series don't have any significant logic in it, and is mostly a collection of shims to pass TMEM commands across hypercalls. I ran benchmarks using both the "streaming test" proposed by Avi, and some general fio tests. Since the fio tests showed similar results to the streaming test, and no anomalies, here is the summary of the streaming tests: First, trying to stream a 26GB random file without KVM TMEM: real 7m36.046s user 0m17.113s sys 5m23.809s And with KVM TMEM: real 7m36.018s user 0m17.124s sys 5m28.391s - No significant difference. Now, trying to stream a 16gb file that compresses nicely, first without KVM TMEM: real 5m10.299s user 0m11.311s sys 3m40.139s And a second run without dropping cache: real 4m33.951s user 0m10.869s sys 3m13.789s Now, with KVM TMEM: real 4m55.528s user 0m11.119s sys 3m33.243s And a second run: real 2m53.713s user 0m7.971s sys 2m29.807s So KVM TMEM shows a nice performance increase once it can store pages on the host. Sasha Levin (10): KVM: reintroduce hc_gpa KVM: wire up the TMEM HC zcache: export zcache interface KVM: add KVM TMEM entries in the appropriate config menu entry KVM: bring in general tmem definitions zcache: move out client declaration and add a KVM client KVM: add KVM TMEM host side interface KVM: add KVM TMEM guest support KVM: support guest side cleancache KVM: support guest side frontswap arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/Makefile | 2 + arch/x86/kvm/tmem/Kconfig | 43 +++++++++++ arch/x86/kvm/tmem/Makefile | 6 ++ arch/x86/kvm/tmem/cleancache.c | 120 +++++++++++++++++++++++++++++ arch/x86/kvm/tmem/frontswap.c | 139 ++++++++++++++++++++++++++++++++++ arch/x86/kvm/tmem/guest.c | 95 +++++++++++++++++++++++ arch/x86/kvm/tmem/guest.h | 11 +++ arch/x86/kvm/tmem/host.c | 78 +++++++++++++++++++ arch/x86/kvm/tmem/host.h | 20 +++++ arch/x86/kvm/tmem/tmem.h | 62 +++++++++++++++ arch/x86/kvm/x86.c | 13 +++ drivers/staging/zcache/zcache-main.c | 48 ++++++++++-- drivers/staging/zcache/zcache.h | 20 +++++ include/linux/kvm_para.h | 1 + 15 files changed, 652 insertions(+), 7 deletions(-) create mode 100644 arch/x86/kvm/tmem/Kconfig create mode 100644 arch/x86/kvm/tmem/Makefile create mode 100644 arch/x86/kvm/tmem/cleancache.c create mode 100644 arch/x86/kvm/tmem/frontswap.c create mode 100644 arch/x86/kvm/tmem/guest.c create mode 100644 arch/x86/kvm/tmem/guest.h create mode 100644 arch/x86/kvm/tmem/host.c create mode 100644 arch/x86/kvm/tmem/host.h create mode 100644 arch/x86/kvm/tmem/tmem.h create mode 100644 drivers/staging/zcache/zcache.h -- 1.7.8.6 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-06 13:07 [RFC 00/10] KVM: Add TMEM host/guest support Sasha Levin @ 2012-06-06 13:24 ` Avi Kivity 2012-06-08 13:20 ` Sasha Levin 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2012-06-06 13:24 UTC (permalink / raw) To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm On 06/06/2012 04:07 PM, Sasha Levin wrote: > This patch series adds support for passing TMEM commands between KVM guests > and the host. This opens the possibility to use TMEM cross-guests and > posibly across hosts with RAMster. > > Since frontswap was merged in the 3.4 cycle, the kernel now has all facilities > required to work with TMEM. There is no longer a dependency on out of tree > code. > > We can split this patch series into two: > > - The guest side, which is basically two shims that proxy mm/cleancache.c > and mm/frontswap.c requests from the guest back to the host. This is done > using a new KVM_HC_TMEM hypercall. > > - The host side, which is a rather small shim which connects KVM to zcache. > > > It's worth noting that this patch series don't have any significant logic in > it, and is mostly a collection of shims to pass TMEM commands across hypercalls. > > I ran benchmarks using both the "streaming test" proposed by Avi, and some > general fio tests. Since the fio tests showed similar results to the > streaming test, and no anomalies, here is the summary of the streaming tests: > > First, trying to stream a 26GB random file without KVM TMEM: > real 7m36.046s > user 0m17.113s > sys 5m23.809s > > And with KVM TMEM: > real 7m36.018s > user 0m17.124s > sys 5m28.391s These results give about 47 usec per page system time (seems quite high), whereas the difference is 0.7 user per page (seems quite low, for 1 or 2 syscalls per page). Can you post a snapshot of kvm_stat while this is running? > > - No significant difference. > > Now, trying to stream a 16gb file that compresses nicely, first without KVM TMEM: > real 5m10.299s > user 0m11.311s > sys 3m40.139s > > And a second run without dropping cache: > real 4m33.951s > user 0m10.869s > sys 3m13.789s > > Now, with KVM TMEM: > real 4m55.528s > user 0m11.119s > sys 3m33.243s How is the first run faster? Is it not doing extra work, pushing pages to the host? > > And a second run: > real 2m53.713s > user 0m7.971s > sys 2m29.807s A nice result, yes. > > So KVM TMEM shows a nice performance increase once it can store pages on the host. How was caching set up? cache=none (in qemu terms) is most representative, but cache=writeback also allows the host to cache guest pages, while cache=writeback with cleancache enabled in the host should give the same effect, but with the extra hypercalls, but with an extra copy to manage the host pagecache. It would be good to see results for all three settings. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-06 13:24 ` Avi Kivity @ 2012-06-08 13:20 ` Sasha Levin 2012-06-08 16:06 ` Dan Magenheimer 2012-06-11 8:09 ` Avi Kivity 0 siblings, 2 replies; 20+ messages in thread From: Sasha Levin @ 2012-06-08 13:20 UTC (permalink / raw) To: Avi Kivity; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each. First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size: First, no KVM TMEM, caching=none: sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 2048+0 records in 2048+0 records out 8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s real 1m56.349s user 0m0.015s sys 0m15.671s sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 2048+0 records in 2048+0 records out 8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s real 1m56.255s user 0m0.018s sys 0m15.504s Now, no KVM TMEM, caching=writeback: sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 2048+0 records in 2048+0 records out 8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s real 2m2.965s user 0m0.015s sys 0m11.025s sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 2048+0 records in 2048+0 records out 8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s real 1m50.968s user 0m0.011s sys 0m10.108s And finally, KVM TMEM on, caching=none: sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 2048+0 records in 2048+0 records out 8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s real 1m59.123s user 0m0.020s sys 0m29.336s sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 2048+0 records in 2048+0 records out 8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s real 0m36.950s user 0m0.005s sys 0m35.308s This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going: kvm statistics kvm_exit 1952342 36037 kvm_entry 1952334 36034 kvm_hypercall 1710568 33948 kvm_apic 109027 1319 kvm_emulate_insn 63745 673 kvm_mmio 63483 669 kvm_inj_virq 45899 654 kvm_apic_accept_irq 45809 654 kvm_pio 18445 52 kvm_set_irq 19102 50 kvm_msi_set_irq 17809 47 kvm_fpu 244 18 kvm_apic_ipi 368 8 kvm_cr 70 6 kvm_userspace_exit 897 5 kvm_cpuid 48 5 vcpu_match_mmio 257 3 kvm_pic_set_irq 1293 3 kvm_ioapic_set_irq 1293 3 kvm_ack_irq 84 1 kvm_page_fault 60538 0 Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each. First, the baseline - no KVM TMEM, caching=none: Zero file: 12800+0 records in 12800+0 records out 53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s real 11m43.583s user 0m0.106s sys 1m42.075s 12800+0 records in 12800+0 records out 53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s real 11m31.284s user 0m0.100s sys 1m41.235s Random file: 12594+1 records in 12594+1 records out 52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s real 10m55.847s user 0m0.107s sys 1m39.852s 12594+1 records in 12594+1 records out 52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s real 10m52.739s user 0m0.120s sys 1m39.712s Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none: Zeros: 12800+0 records in 12800+0 records out 53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s real 11m44.536s user 0m0.088s sys 2m0.639s 12800+0 records in 12800+0 records out 53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s real 11m30.561s user 0m0.088s sys 1m57.637s Random: 12594+1 records in 12594+1 records out 52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s real 10m56.480s user 0m0.034s sys 3m18.750s 12594+1 records in 12594+1 records out 52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s real 10m58.499s user 0m0.046s sys 3m23.678s Next, with KVM TMEM enabled, caching=none: Zeros: 12800+0 records in 12800+0 records out 53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s real 11m51.916s user 0m0.081s sys 2m59.952s 12800+0 records in 12800+0 records out 53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s real 11m31.102s user 0m0.082s sys 3m6.500s Random: 12594+1 records in 12594+1 records out 52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s real 10m56.445s user 0m0.062s sys 5m53.236s 12594+1 records in 12594+1 records out 52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s real 10m53.404s user 0m0.066s sys 5m57.087s This is a snapshot of kvm_stats while this test was running: kvm statistics kvm_entry 168179 20729 kvm_exit 168179 20728 kvm_hypercall 131808 16409 kvm_apic 17305 2006 kvm_mmio 10877 1259 kvm_emulate_insn 10974 1258 kvm_page_fault 6270 866 kvm_inj_virq 6532 751 kvm_apic_accept_irq 6516 751 kvm_set_irq 4888 536 kvm_msi_set_irq 4471 536 kvm_pio 4714 529 kvm_userspace_exit 300 2 vcpu_match_mmio 83 2 kvm_apic_ipi 69 2 kvm_pic_set_irq 417 0 kvm_ioapic_set_irq 417 0 kvm_fpu 76 0 kvm_ack_irq 27 0 kvm_cr 24 0 kvm_cpuid 16 0 And finally, KVM TMEM enabled, with caching=writeback: Zeros: 12800+0 records in 12800+0 records out 53687091200 bytes (54 GB) copied, 710.62 s, 75.5 MB/s real 11m50.698s user 0m0.078s sys 3m29.920s 12800+0 records in 12800+0 records out 53687091200 bytes (54 GB) copied, 686.286 s, 78.2 MB/s real 11m26.321s user 0m0.088s sys 3m25.931s Random: 12594+1 records in 12594+1 records out 52824875008 bytes (53 GB) copied, 673.831 s, 78.4 MB/s real 11m13.883s user 0m0.047s sys 4m5.569s 12594+1 records in 12594+1 records out 52824875008 bytes (53 GB) copied, 673.594 s, 78.4 MB/s real 11m13.619s user 0m0.056s sys 4m12.134s ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-08 13:20 ` Sasha Levin @ 2012-06-08 16:06 ` Dan Magenheimer 2012-06-11 11:17 ` Avi Kivity 2012-06-11 8:09 ` Avi Kivity 1 sibling, 1 reply; 20+ messages in thread From: Dan Magenheimer @ 2012-06-08 16:06 UTC (permalink / raw) To: Sasha Levin, Avi Kivity; +Cc: mtosatti, gregkh, sjenning, Konrad Wilk, kvm > From: Sasha Levin [mailto:levinsasha928@gmail.com] > Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support > > I re-ran benchmarks in a single user environment to get more stable results, increasing the test files > to 50gb each. Nice results Sasha! The non-increase in real and the significant increase in sys demonstrates that tmem should have little or no impact as long as there are sufficient unused CPU cycles.... since tmem is most active on I/O bound workloads when there tends to be lots of idle cpu time, tmem is usually "free". But if KVM perfectly load balances across the sum of all guests so that there is little or no cpu idle time (rare but possible), there will be a measurable impact. For a true worst case analysis, try running cpus=1. (One can argue that anyone who runs KVM on a single cpu system deserves what they get ;-) But, the "WasActive" patch[1] (if adapted slightly for the KVM-TMEM patch) should eliminate the negative impact on systime of streaming workloads even on cpus=1. > From: Avi Kivity [mailto:avi@redhat.com] > <this comment was on Sasha's first round of benchmarking> > These results give about 47 usec per page system time (seems quite > high), whereas the difference is 0.7 user per page (seems quite low, for > 1 or 2 syscalls per page). Can you post a snapshot of kvm_stat while > this is running? Note that the userspace difference is likely all noise. No tmem/zcache activites should be done in userspace. All the activites result from either a page fault or kswapd. Since each streamed page (assuming no WasActive patch) should result in one hypercall and one lz01x page compression, I suspect that 47usec is a good estimate of the sum of those on Sasha's machine. [1] https://lkml.org/lkml/2012/1/25/300 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-08 16:06 ` Dan Magenheimer @ 2012-06-11 11:17 ` Avi Kivity 0 siblings, 0 replies; 20+ messages in thread From: Avi Kivity @ 2012-06-11 11:17 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On 06/08/2012 07:06 PM, Dan Magenheimer wrote: >> From: Avi Kivity [mailto:avi@redhat.com] >> <this comment was on Sasha's first round of benchmarking> >> These results give about 47 usec per page system time (seems quite >> high), whereas the difference is 0.7 user per page (seems quite low, for >> 1 or 2 syscalls per page). Can you post a snapshot of kvm_stat while >> this is running? > > Note that the userspace difference is likely all noise. > No tmem/zcache activites should be done in userspace. All > the activites result from either a page fault or kswapd. s/user/usec/... > Since each streamed page (assuming no WasActive patch) should > result in one hypercall and one lz01x page compression, I suspect > that 47usec is a good estimate of the sum of those on Sasha's machine. It's a huge number for a page. The newer results give lower numbers, but still quite high. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-08 13:20 ` Sasha Levin 2012-06-08 16:06 ` Dan Magenheimer @ 2012-06-11 8:09 ` Avi Kivity 2012-06-11 10:26 ` Sasha Levin 1 sibling, 1 reply; 20+ messages in thread From: Avi Kivity @ 2012-06-11 8:09 UTC (permalink / raw) To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm On 06/08/2012 04:20 PM, Sasha Levin wrote: > I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each. > > First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size: > > First, no KVM TMEM, caching=none: > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > 2048+0 records in > 2048+0 records out > 8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s > > real 1m56.349s > user 0m0.015s > sys 0m15.671s > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > 2048+0 records in > 2048+0 records out > 8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s > > real 1m56.255s > user 0m0.018s > sys 0m15.504s > > Now, no KVM TMEM, caching=writeback: > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > 2048+0 records in > 2048+0 records out > 8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s > > real 2m2.965s > user 0m0.015s > sys 0m11.025s > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > 2048+0 records in > 2048+0 records out > 8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s > > real 1m50.968s > user 0m0.011s > sys 0m10.108s Strange that system time is lower with cache=writeback. > > And finally, KVM TMEM on, caching=none: > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > 2048+0 records in > 2048+0 records out > 8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s > > real 1m59.123s > user 0m0.020s > sys 0m29.336s > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > 2048+0 records in > 2048+0 records out > 8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s > > real 0m36.950s > user 0m0.005s > sys 0m35.308s So system time more than doubled compared to non-tmem cache=none. The overhead per page is 17s / (8589934592/4096) = 8.1usec. Seems quite high. 'perf top' while this is running would be interesting. > > This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going: > > kvm statistics > > kvm_exit 1952342 36037 > kvm_entry 1952334 36034 > kvm_hypercall 1710568 33948 In that test, 56k pages/sec were transferred. Why are we seeing only 33k hypercalls/sec? Shouldn't we have two hypercalls/page (one when evicting a page to make some room, one to read the new page from tmem)? > > > Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each. > > First, the baseline - no KVM TMEM, caching=none: > > Zero file: > 12800+0 records in > 12800+0 records out > 53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s > > real 11m43.583s > user 0m0.106s > sys 1m42.075s > 12800+0 records in > 12800+0 records out > 53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s > > real 11m31.284s > user 0m0.100s > sys 1m41.235s > > Random file: > 12594+1 records in > 12594+1 records out > 52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s > > real 10m55.847s > user 0m0.107s > sys 1m39.852s > 12594+1 records in > 12594+1 records out > 52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s > > real 10m52.739s > user 0m0.120s > sys 1m39.712s > > Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none: > > Zeros: > 12800+0 records in > 12800+0 records out > 53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s > > real 11m44.536s > user 0m0.088s > sys 2m0.639s > 12800+0 records in > 12800+0 records out > 53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s > > real 11m30.561s > user 0m0.088s > sys 1m57.637s zcache appears not to be helping at all; it's just adding overhead. Is even the compressed file too large? overhead = 1.4 usec/page. > > Random: > 12594+1 records in > 12594+1 records out > 52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s > > real 10m56.480s > user 0m0.034s > sys 3m18.750s > 12594+1 records in > 12594+1 records out > 52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s > > real 10m58.499s > user 0m0.046s > sys 3m23.678s Overhead grows to 7.6 usec/page. > > Next, with KVM TMEM enabled, caching=none: > > Zeros: > 12800+0 records in > 12800+0 records out > 53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s > > real 11m51.916s > user 0m0.081s > sys 2m59.952s > 12800+0 records in > 12800+0 records out > 53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s > > real 11m31.102s > user 0m0.082s > sys 3m6.500s Overhead = 6.6 usec/page. > > Random: > 12594+1 records in > 12594+1 records out > 52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s > > real 10m56.445s > user 0m0.062s > sys 5m53.236s > 12594+1 records in > 12594+1 records out > 52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s > > real 10m53.404s > user 0m0.066s > sys 5m57.087s Overhead = 19 usec/page. This is pretty steep. We have flash storage doing a million iops/sec, and here you add 19 microseconds to that. > > > This is a snapshot of kvm_stats while this test was running: > > kvm statistics > > kvm_entry 168179 20729 > kvm_exit 168179 20728 > kvm_hypercall 131808 16409 The last test was running 19k pages/sec, doesn't quite fit with this measurement. Is the measurement stable or fluctuating? > > And finally, KVM TMEM enabled, with caching=writeback: I'm not sure what the point of this is? You have two host-caching mechanisms running in parallel, are you trying to increase overhead while reducing effective cache size? My conclusion is that the overhead is quite high, but please double check my numbers, maybe I missed something obvious. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 8:09 ` Avi Kivity @ 2012-06-11 10:26 ` Sasha Levin 2012-06-11 11:45 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Sasha Levin @ 2012-06-11 10:26 UTC (permalink / raw) To: Avi Kivity; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm On Mon, 2012-06-11 at 11:09 +0300, Avi Kivity wrote: > On 06/08/2012 04:20 PM, Sasha Levin wrote: > > I re-ran benchmarks in a single user environment to get more stable results, increasing the test files to 50gb each. > > > > First, a test of the good case scenario for KVM TMEM - we'll try streaming a file which compresses well but is bigger than the host RAM size: > > > > First, no KVM TMEM, caching=none: > > > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > > 2048+0 records in > > 2048+0 records out > > 8589934592 bytes (8.6 GB) copied, 116.309 s, 73.9 MB/s > > > > real 1m56.349s > > user 0m0.015s > > sys 0m15.671s > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > > 2048+0 records in > > 2048+0 records out > > 8589934592 bytes (8.6 GB) copied, 116.191 s, 73.9 MB/s > > > > real 1m56.255s > > user 0m0.018s > > sys 0m15.504s > > > > Now, no KVM TMEM, caching=writeback: > > > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > > 2048+0 records in > > 2048+0 records out > > 8589934592 bytes (8.6 GB) copied, 122.894 s, 69.9 MB/s > > > > real 2m2.965s > > user 0m0.015s > > sys 0m11.025s > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > > 2048+0 records in > > 2048+0 records out > > 8589934592 bytes (8.6 GB) copied, 110.915 s, 77.4 MB/s > > > > real 1m50.968s > > user 0m0.011s > > sys 0m10.108s > > Strange that system time is lower with cache=writeback. Maybe because these pages don't get written out immediately? I don't have a better guess. > > And finally, KVM TMEM on, caching=none: > > > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > > 2048+0 records in > > 2048+0 records out > > 8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s > > > > real 1m59.123s > > user 0m0.020s > > sys 0m29.336s > > > > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 > > 2048+0 records in > > 2048+0 records out > > 8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s > > > > real 0m36.950s > > user 0m0.005s > > sys 0m35.308s > > So system time more than doubled compared to non-tmem cache=none. The > overhead per page is 17s / (8589934592/4096) = 8.1usec. Seems quite high. Right, but consider it didn't increase real time at all. > 'perf top' while this is running would be interesting. I'll update later with this. > > This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going: > > > > kvm statistics > > > > kvm_exit 1952342 36037 > > kvm_entry 1952334 36034 > > kvm_hypercall 1710568 33948 > > In that test, 56k pages/sec were transferred. Why are we seeing only > 33k hypercalls/sec? Shouldn't we have two hypercalls/page (one when > evicting a page to make some room, one to read the new page from tmem)? The guest doesn't do eviction at all, in fact - it doesn't know how big the cache is so even if it wanted to, it couldn't evict pages (the only thing it does is invalidate pages which have changed in the guest). This means it only takes one hypercall/page instead of two. > > > > > > Now, for the worst case "streaming test". I've tried streaming two files, one which has good compression (zeros), and one full with random bits. Doing two runs for each. > > > > First, the baseline - no KVM TMEM, caching=none: > > > > Zero file: > > 12800+0 records in > > 12800+0 records out > > 53687091200 bytes (54 GB) copied, 703.502 s, 76.3 MB/s > > > > real 11m43.583s > > user 0m0.106s > > sys 1m42.075s > > 12800+0 records in > > 12800+0 records out > > 53687091200 bytes (54 GB) copied, 691.208 s, 77.7 MB/s > > > > real 11m31.284s > > user 0m0.100s > > sys 1m41.235s > > > > Random file: > > 12594+1 records in > > 12594+1 records out > > 52824875008 bytes (53 GB) copied, 655.778 s, 80.6 MB/s > > > > real 10m55.847s > > user 0m0.107s > > sys 1m39.852s > > 12594+1 records in > > 12594+1 records out > > 52824875008 bytes (53 GB) copied, 652.668 s, 80.9 MB/s > > > > real 10m52.739s > > user 0m0.120s > > sys 1m39.712s > > > > Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none: > > > > Zeros: > > 12800+0 records in > > 12800+0 records out > > 53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s > > > > real 11m44.536s > > user 0m0.088s > > sys 2m0.639s > > 12800+0 records in > > 12800+0 records out > > 53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s > > > > real 11m30.561s > > user 0m0.088s > > sys 1m57.637s > > zcache appears not to be helping at all; it's just adding overhead. Is > even the compressed file too large? > > overhead = 1.4 usec/page. Correct, I've had to further increase the size of this file so that zcache would fail here as well. The good case was tested before, here I wanted to see what will happen with files that wouldn't have much benefit from both regular caching and zcache. > > > > Random: > > 12594+1 records in > > 12594+1 records out > > 52824875008 bytes (53 GB) copied, 656.436 s, 80.5 MB/s > > > > real 10m56.480s > > user 0m0.034s > > sys 3m18.750s > > 12594+1 records in > > 12594+1 records out > > 52824875008 bytes (53 GB) copied, 658.446 s, 80.2 MB/s > > > > real 10m58.499s > > user 0m0.046s > > sys 3m23.678s > > Overhead grows to 7.6 usec/page. > > > > > Next, with KVM TMEM enabled, caching=none: > > > > Zeros: > > 12800+0 records in > > 12800+0 records out > > 53687091200 bytes (54 GB) copied, 711.754 s, 75.4 MB/s > > > > real 11m51.916s > > user 0m0.081s > > sys 2m59.952s > > 12800+0 records in > > 12800+0 records out > > 53687091200 bytes (54 GB) copied, 690.958 s, 77.7 MB/s > > > > real 11m31.102s > > user 0m0.082s > > sys 3m6.500s > > Overhead = 6.6 usec/page. > > > > > Random: > > 12594+1 records in > > 12594+1 records out > > 52824875008 bytes (53 GB) copied, 656.378 s, 80.5 MB/s > > > > real 10m56.445s > > user 0m0.062s > > sys 5m53.236s > > 12594+1 records in > > 12594+1 records out > > 52824875008 bytes (53 GB) copied, 653.353 s, 80.9 MB/s > > > > real 10m53.404s > > user 0m0.066s > > sys 5m57.087s > > > Overhead = 19 usec/page. > > This is pretty steep. We have flash storage doing a million iops/sec, > and here you add 19 microseconds to that. Might be interesting to test it with flash storage as well... > > > > > > This is a snapshot of kvm_stats while this test was running: > > > > kvm statistics > > > > kvm_entry 168179 20729 > > kvm_exit 168179 20728 > > kvm_hypercall 131808 16409 > > The last test was running 19k pages/sec, doesn't quite fit with this > measurement. Is the measurement stable or fluctuating? It's pretty stable when running the "zero" pages, but when switching to random files it somewhat fluctuates. > > > > And finally, KVM TMEM enabled, with caching=writeback: > > I'm not sure what the point of this is? You have two host-caching > mechanisms running in parallel, are you trying to increase overhead > while reducing effective cache size? I thought that you've asked for this test: On Wed, 2012-06-06 at 16:24 +0300, Avi Kivity wrote: > while cache=writeback with cleancache enabled in the host should > give the same effect, but with the extra hypercalls, but with an extra > copy to manage the host pagecache. It would be good to see results for all three settings. > My conclusion is that the overhead is quite high, but please double > check my numbers, maybe I missed something obvious. I'm not sure what options I have to lower the overhead here, should I be using something other than hypercalls to communicate with the host? I know that there are several things being worked on from zcache perspective (WasActive, batching, etc), but is there something that could be done within the scope of kvm-tmem? It would be interesting in seeing results for Xen/TMEM and comparing them to these results. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 10:26 ` Sasha Levin @ 2012-06-11 11:45 ` Avi Kivity 2012-06-11 15:44 ` Dan Magenheimer 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2012-06-11 11:45 UTC (permalink / raw) To: Sasha Levin; +Cc: mtosatti, gregkh, sjenning, dan.magenheimer, konrad.wilk, kvm On 06/11/2012 01:26 PM, Sasha Levin wrote: >> >> Strange that system time is lower with cache=writeback. > > Maybe because these pages don't get written out immediately? I don't > have a better guess. >From the guest point of view, it's the same flow. btw, this is a read, so the difference would be readahead, not write-behind, but the difference in system time is still unexplained. > >> > And finally, KVM TMEM on, caching=none: >> > >> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 >> > 2048+0 records in >> > 2048+0 records out >> > 8589934592 bytes (8.6 GB) copied, 119.024 s, 72.2 MB/s >> > >> > real 1m59.123s >> > user 0m0.020s >> > sys 0m29.336s >> > >> > sh-4.2# time dd if=test/zero of=/dev/null bs=4M count=2048 >> > 2048+0 records in >> > 2048+0 records out >> > 8589934592 bytes (8.6 GB) copied, 36.8798 s, 233 MB/s >> > >> > real 0m36.950s >> > user 0m0.005s >> > sys 0m35.308s >> >> So system time more than doubled compared to non-tmem cache=none. The >> overhead per page is 17s / (8589934592/4096) = 8.1usec. Seems quite high. > > Right, but consider it didn't increase real time at all. Real time is bounded by disk bandwidth. It's a consideration of course, and all forms of caching increase cpu utilization for the cache miss case, but in this case the overhead is excessive due to the lack of batching and due to compression overhead. > >> 'perf top' while this is running would be interesting. > > I'll update later with this. > >> > This is a snapshot of kvm_stats while the 2nd run in the KVM TMEM test was going: >> > >> > kvm statistics >> > >> > kvm_exit 1952342 36037 >> > kvm_entry 1952334 36034 >> > kvm_hypercall 1710568 33948 >> >> In that test, 56k pages/sec were transferred. Why are we seeing only >> 33k hypercalls/sec? Shouldn't we have two hypercalls/page (one when >> evicting a page to make some room, one to read the new page from tmem)? > > The guest doesn't do eviction at all, in fact - it doesn't know how big > the cache is so even if it wanted to, it couldn't evict pages (the only > thing it does is invalidate pages which have changed in the guest). IIUC, when the guest reads a page, it first has to make room in its own pagecache; before dropping a clean page it calls cleancache to dispose of it, which calls a hypercall which compresses and stores it on the host. Next a page is allocated and a cleancache hypercall is made to see if it is in host tmem. So two hypercalls per page, once guest pagecache is full. > > This means it only takes one hypercall/page instead of two. >> > >> > Now, this is with zcache enabled in the guest (not going through KVM TMEM), caching=none: >> > >> > Zeros: >> > 12800+0 records in >> > 12800+0 records out >> > 53687091200 bytes (54 GB) copied, 704.479 s, 76.2 MB/s >> > >> > real 11m44.536s >> > user 0m0.088s >> > sys 2m0.639s >> > 12800+0 records in >> > 12800+0 records out >> > 53687091200 bytes (54 GB) copied, 690.501 s, 77.8 MB/s >> > >> > real 11m30.561s >> > user 0m0.088s >> > sys 1m57.637s >> >> zcache appears not to be helping at all; it's just adding overhead. Is >> even the compressed file too large? >> >> overhead = 1.4 usec/page. > > Correct, I've had to further increase the size of this file so that > zcache would fail here as well. The good case was tested before, here I > wanted to see what will happen with files that wouldn't have much > benefit from both regular caching and zcache. Well, zeroes is not a good test for this since it minimizes zcache allocation overhead. >> >> >> Overhead = 19 usec/page. >> >> This is pretty steep. We have flash storage doing a million iops/sec, >> and here you add 19 microseconds to that. > > Might be interesting to test it with flash storage as well... Try http://sg.danny.cz/sg/sdebug26.html. You can use it to emulate a large fast block device without needing tons of RAM (but you can still populate it with nonzero data). If using qemu, try ,aio=native to minimize overhead further. > >> > >> > >> > This is a snapshot of kvm_stats while this test was running: >> > >> > kvm statistics >> > >> > kvm_entry 168179 20729 >> > kvm_exit 168179 20728 >> > kvm_hypercall 131808 16409 >> >> The last test was running 19k pages/sec, doesn't quite fit with this >> measurement. Is the measurement stable or fluctuating? > > It's pretty stable when running the "zero" pages, but when switching to > random files it somewhat fluctuates. Well, weird. > >> > >> > And finally, KVM TMEM enabled, with caching=writeback: >> >> I'm not sure what the point of this is? You have two host-caching >> mechanisms running in parallel, are you trying to increase overhead >> while reducing effective cache size? > > I thought that you've asked for this test: > > On Wed, 2012-06-06 at 16:24 +0300, Avi Kivity wrote: >> while cache=writeback with cleancache enabled in the host should >> give the same effect, but with the extra hypercalls, but with an extra >> copy to manage the host pagecache. It would be good to see results for all three settings. > Ah, so it's a worser worst case. But somehow it's better than cache=none? >> My conclusion is that the overhead is quite high, but please double >> check my numbers, maybe I missed something obvious. > > I'm not sure what options I have to lower the overhead here, should I be > using something other than hypercalls to communicate with the host? > > I know that there are several things being worked on from zcache > perspective (WasActive, batching, etc), but is there something that > could be done within the scope of kvm-tmem? > > It would be interesting in seeing results for Xen/TMEM and comparing > them to these results. Batching will drastically reduce the number of hypercalls. A different alternative is to use ballooning to feed the guest free memory so it doesn't need to hypercall at all. Deciding how to divide free memory among the guests is hard (but then so is deciding how to divide tmem memory among guests), and adding dedup on top of that is also hard (ksm? zksm?). IMO letting the guest have the memory and manage it on its own will be much simpler and faster compared to the constant chatting that has to go on if the host manages this memory. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 11:45 ` Avi Kivity @ 2012-06-11 15:44 ` Dan Magenheimer 2012-06-11 17:06 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Dan Magenheimer @ 2012-06-11 15:44 UTC (permalink / raw) To: Avi Kivity, Sasha Levin; +Cc: mtosatti, gregkh, sjenning, Konrad Wilk, kvm > From: Avi Kivity [mailto:avi@redhat.com] > > > > The guest doesn't do eviction at all, in fact - it doesn't know how big > > the cache is so even if it wanted to, it couldn't evict pages (the only > > thing it does is invalidate pages which have changed in the guest). > > IIUC, when the guest reads a page, it first has to make room in its own > pagecache; before dropping a clean page it calls cleancache to dispose > of it, which calls a hypercall which compresses and stores it on the > host. Next a page is allocated and a cleancache hypercall is made to > see if it is in host tmem. So two hypercalls per page, once guest > pagecache is full. Yes, Avi is correct here. > >> This is pretty steep. We have flash storage doing a million iops/sec, > >> and here you add 19 microseconds to that. > > > > Might be interesting to test it with flash storage as well... Well, to be fair, you are comparing a device that costs many thousands of $US to a software solution that uses idle CPU cycles and no additional RAM. > Batching will drastically reduce the number of hypercalls. For the record, batching CAN be implemented... ramster is essentially an implementation of batching where the local system is the "guest" and the remote system is the "host". But with ramster the overhead to move the data (whether batched or not) is much MUCH worse than a hypercall and ramster still shows performance advantage. So, IMHO, one step at a time. Get the foundation code in place and tune it later if a batching implementation can be demonstrated to improve performance sufficiently. > A different > alternative is to use ballooning to feed the guest free memory so it > doesn't need to hypercall at all. Deciding how to divide free memory > among the guests is hard (but then so is deciding how to divide tmem > memory among guests), and adding dedup on top of that is also hard (ksm? > zksm?). IMO letting the guest have the memory and manage it on its own > will be much simpler and faster compared to the constant chatting that > has to go on if the host manages this memory. Here we disagree, maybe violently. All existing solutions that try to do manage memory across multiple tenants from an "external memory manager policy" fail miserably. Tmem is at least trying something new by actively involving both the host and the guest in the policy (guest decides which pages, host decided how many) and without the massive changes required for something like IBM's solution (forgot what it was called). Yes, tmem has overhead but since the overhead only occurs where pages would otherwise have to be read/written from disk, the overhead is well "hidden". BTW, dedup in zcache is fairly easy to implement because the pages can only be read/written as an entire page and only through a well-defined API. Xen does it (with optional compression), zcache could also, but it never made much sense for zcache when there was only one tenant. KVM of course benefits from KSM, but IIUC KSM only works on anonymous pages. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 15:44 ` Dan Magenheimer @ 2012-06-11 17:06 ` Avi Kivity 2012-06-11 19:25 ` Sasha Levin 2012-06-12 1:18 ` Dan Magenheimer 0 siblings, 2 replies; 20+ messages in thread From: Avi Kivity @ 2012-06-11 17:06 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On 06/11/2012 06:44 PM, Dan Magenheimer wrote: > > >> This is pretty steep. We have flash storage doing a million iops/sec, > > >> and here you add 19 microseconds to that. > > > > > > Might be interesting to test it with flash storage as well... > > Well, to be fair, you are comparing a device that costs many > thousands of $US to a software solution that uses idle CPU > cycles and no additional RAM. You don't know that those cycles are idle. And when in fact you have no additional RAM, those cycles are wasted to no benefit. The fact that I/O is being performed doesn't mean that we can waste cpu. Those cpu cycles can be utilized by other processes on the same guest or by other guests. > > > Batching will drastically reduce the number of hypercalls. > > For the record, batching CAN be implemented... ramster is essentially > an implementation of batching where the local system is the "guest" > and the remote system is the "host". But with ramster the > overhead to move the data (whether batched or not) is much MUCH > worse than a hypercall and ramster still shows performance advantage. Sure, you can buffer pages in memory but then you add yet another copy. I know you think copies are cheap but I disagree. > So, IMHO, one step at a time. Get the foundation code in > place and tune it later if a batching implementation can > be demonstrated to improve performance sufficiently. Sorry, no, first demonstrate no performance regressions, then we can talk about performance improvements. > > A different > > alternative is to use ballooning to feed the guest free memory so it > > doesn't need to hypercall at all. Deciding how to divide free memory > > among the guests is hard (but then so is deciding how to divide tmem > > memory among guests), and adding dedup on top of that is also hard (ksm? > > zksm?). IMO letting the guest have the memory and manage it on its own > > will be much simpler and faster compared to the constant chatting that > > has to go on if the host manages this memory. > > Here we disagree, maybe violently. All existing solutions that > try to do manage memory across multiple tenants from an "external > memory manager policy" fail miserably. Tmem is at least trying > something new by actively involving both the host and the guest > in the policy (guest decides which pages, host decided how many) > and without the massive changes required for something like > IBM's solution (forgot what it was called). cmm2 > Yes, tmem has > overhead but since the overhead only occurs where pages > would otherwise have to be read/written from disk, the > overhead is well "hidden". The overhead is NOT hidden. We spent many efforts to tune virtio-blk to reduce its overhead, and now you add 6-20 microseconds per page. A guest may easily be reading a quarter million pages per second, this adds up very fast - at the upper end you're consuming 5 vcpus just for tmem. Note that you don't even have to issue I/O to get a tmem hypercall invoked. Alllocate a ton of memory and you get cleancache calls for each page that passes through the tail of the LRU. Again with the upper end, allocating a gigabyte can now take a few seconds extra. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 17:06 ` Avi Kivity @ 2012-06-11 19:25 ` Sasha Levin 2012-06-11 19:56 ` Sasha Levin 2012-06-12 10:12 ` Avi Kivity 2012-06-12 1:18 ` Dan Magenheimer 1 sibling, 2 replies; 20+ messages in thread From: Sasha Levin @ 2012-06-11 19:25 UTC (permalink / raw) To: Avi Kivity; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote: > Sorry, no, first demonstrate no performance regressions, then we can > talk about performance improvements. No performance regressions? For caching? How would that work? Or even if you meant just the kvm-tmem interface overhead, I don't see how that would work. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 19:25 ` Sasha Levin @ 2012-06-11 19:56 ` Sasha Levin 2012-06-12 11:46 ` Avi Kivity 2012-06-12 10:12 ` Avi Kivity 1 sibling, 1 reply; 20+ messages in thread From: Sasha Levin @ 2012-06-11 19:56 UTC (permalink / raw) To: Avi Kivity; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On Mon, 2012-06-11 at 21:25 +0200, Sasha Levin wrote: > On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote: > > Sorry, no, first demonstrate no performance regressions, then we can > > talk about performance improvements. > > No performance regressions? For caching? How would that work? > > Or even if you meant just the kvm-tmem interface overhead, I don't see > how that would work. btw, so far we've been poking on half of the code here. What about frontswap over kvm-tmem? are there any specific tests you'd like to see there? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 19:56 ` Sasha Levin @ 2012-06-12 11:46 ` Avi Kivity 2012-06-12 11:58 ` Gleb Natapov 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2012-06-12 11:46 UTC (permalink / raw) To: Sasha Levin; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On 06/11/2012 10:56 PM, Sasha Levin wrote: > > btw, so far we've been poking on half of the code here. > > What about frontswap over kvm-tmem? are there any specific tests you'd > like to see there? hmm. On one hand, no one swaps these days so there aren't any good benchmarks for it. On the other hand, with swapping, at least we're guaranteed the page will be read in the future (unlike cache, where it's quite possible it won't be). I don't know. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-12 11:46 ` Avi Kivity @ 2012-06-12 11:58 ` Gleb Natapov 2012-06-12 12:01 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Gleb Natapov @ 2012-06-12 11:58 UTC (permalink / raw) To: Avi Kivity Cc: Sasha Levin, Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On Tue, Jun 12, 2012 at 02:46:38PM +0300, Avi Kivity wrote: > On 06/11/2012 10:56 PM, Sasha Levin wrote: > > > > btw, so far we've been poking on half of the code here. > > > > What about frontswap over kvm-tmem? are there any specific tests you'd > > like to see there? > > hmm. On one hand, no one swaps these days so there aren't any good > benchmarks for it. On the other hand, with swapping, at least we're > guaranteed the page will be read in the future (unlike cache, where it's > quite possible it won't be). I don't know. > > Swapped page can be discarded without reading too. -- Gleb. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-12 11:58 ` Gleb Natapov @ 2012-06-12 12:01 ` Avi Kivity 0 siblings, 0 replies; 20+ messages in thread From: Avi Kivity @ 2012-06-12 12:01 UTC (permalink / raw) To: Gleb Natapov Cc: Sasha Levin, Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On 06/12/2012 02:58 PM, Gleb Natapov wrote: > On Tue, Jun 12, 2012 at 02:46:38PM +0300, Avi Kivity wrote: >> On 06/11/2012 10:56 PM, Sasha Levin wrote: >> > >> > btw, so far we've been poking on half of the code here. >> > >> > What about frontswap over kvm-tmem? are there any specific tests you'd >> > like to see there? >> >> hmm. On one hand, no one swaps these days so there aren't any good >> benchmarks for it. On the other hand, with swapping, at least we're >> guaranteed the page will be read in the future (unlike cache, where it's >> quite possible it won't be). I don't know. >> >> > Swapped page can be discarded without reading too. Right. The effects of frontswap can be achieved by swapping to a block device that sets cache=writeback, more or less (esp. with trim support, you can discard pages that you won't be needing again before they hit disk). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 19:25 ` Sasha Levin 2012-06-11 19:56 ` Sasha Levin @ 2012-06-12 10:12 ` Avi Kivity 1 sibling, 0 replies; 20+ messages in thread From: Avi Kivity @ 2012-06-12 10:12 UTC (permalink / raw) To: Sasha Levin; +Cc: Dan Magenheimer, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On 06/11/2012 10:25 PM, Sasha Levin wrote: > On Mon, 2012-06-11 at 20:06 +0300, Avi Kivity wrote: >> Sorry, no, first demonstrate no performance regressions, then we can >> talk about performance improvements. > > No performance regressions? For caching? How would that work? A small degradation might be acceptable. 2X cpu consumption is not. IMO "using host memory" is the problem, because it involves copies and hypercalls. Try giving the memory to the guest, either through the balloon or through a pci device that exposes memory that can be withdrawn. That will make everything *much* faster. > > Or even if you meant just the kvm-tmem interface overhead, I don't see > how that would work. > I meant the overall overhead, as seen by users. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-11 17:06 ` Avi Kivity 2012-06-11 19:25 ` Sasha Levin @ 2012-06-12 1:18 ` Dan Magenheimer 2012-06-12 10:09 ` Avi Kivity 1 sibling, 1 reply; 20+ messages in thread From: Dan Magenheimer @ 2012-06-12 1:18 UTC (permalink / raw) To: Avi Kivity; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm > From: Avi Kivity [mailto:avi@redhat.com] > Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support > > On 06/11/2012 06:44 PM, Dan Magenheimer wrote: > > > >> This is pretty steep. We have flash storage doing a million iops/sec, > > > >> and here you add 19 microseconds to that. > > > > > > > > Might be interesting to test it with flash storage as well... > > > > Well, to be fair, you are comparing a device that costs many > > thousands of $US to a software solution that uses idle CPU > > cycles and no additional RAM. > > You don't know that those cycles are idle. And when in fact you have no > additional RAM, those cycles are wasted to no benefit. > > The fact that I/O is being performed doesn't mean that we can waste > cpu. Those cpu cycles can be utilized by other processes on the same > guest or by other guests. You're right of course, so I apologize for oversimplifying... but so are you. Let's take a step back: IMHO, a huge part (majority?) of computer science these days is trying to beat Amdahl's law. On many machines/workloads, especially in virtual environments, RAM is the bottleneck. Tmem's role is, when RAM is the bottleneck, to increase RAM effective size AND, in a multi-tenant environment, flexibility at the cost of CPU cycles. But tmem also is designed to be very dynamically flexible so that it either has low CPU cost when it not being used OR can be dynamically disabled/re-enabled with reasonably low overhead. Why I think you are oversimplifying: "those cpu cycles can be utilized by other processes on the same guest or by other guests" pre-supposes that cpu availability is the bottleneck. It would be interesting if it were possible to measure how many systems (with modern processors) for which this is true. I'm not arguing that they don't exist but I suspect they are fairly rare these days, even for KVM systems. > > > Batching will drastically reduce the number of hypercalls. > > > > For the record, batching CAN be implemented... ramster is essentially > > an implementation of batching where the local system is the "guest" > > and the remote system is the "host". But with ramster the > > overhead to move the data (whether batched or not) is much MUCH > > worse than a hypercall and ramster still shows performance advantage. > > Sure, you can buffer pages in memory but then you add yet another copy. > I know you think copies are cheap but I disagree. I only think copies are *relatively* cheap. Orders of magnitude cheaper than some alternatives. So if it takes two page copies or even ten to replace a disk access, yes I think copies are cheap. (But I do understand your point.) > > So, IMHO, one step at a time. Get the foundation code in > > place and tune it later if a batching implementation can > > be demonstrated to improve performance sufficiently. > > Sorry, no, first demonstrate no performance regressions, then we can > talk about performance improvements. Well that's an awfully hard bar to clear, even with any of the many changes being merged every release into the core Linux mm subsystem. Any change to memory management will have some positive impacts on some workloads and some negative impacts on others. > > > A different > > > alternative is to use ballooning to feed the guest free memory so it > > > doesn't need to hypercall at all. Deciding how to divide free memory > > > among the guests is hard (but then so is deciding how to divide tmem > > > memory among guests), and adding dedup on top of that is also hard (ksm? > > > zksm?). IMO letting the guest have the memory and manage it on its own > > > will be much simpler and faster compared to the constant chatting that > > > has to go on if the host manages this memory. > > > > Here we disagree, maybe violently. All existing solutions that > > try to do manage memory across multiple tenants from an "external > > memory manager policy" fail miserably. Tmem is at least trying > > something new by actively involving both the host and the guest > > in the policy (guest decides which pages, host decided how many) > > and without the massive changes required for something like > > IBM's solution (forgot what it was called). > > cmm2 That's the one. Thanks for the reminder! > > Yes, tmem has > > overhead but since the overhead only occurs where pages > > would otherwise have to be read/written from disk, the > > overhead is well "hidden". > > The overhead is NOT hidden. We spent many efforts to tune virtio-blk to > reduce its overhead, and now you add 6-20 microseconds per page. A > guest may easily be reading a quarter million pages per second, this > adds up very fast - at the upper end you're consuming 5 vcpus just for tmem. > > Note that you don't even have to issue I/O to get a tmem hypercall > invoked. Alllocate a ton of memory and you get cleancache calls for > each page that passes through the tail of the LRU. Again with the upper > end, allocating a gigabyte can now take a few seconds extra. Though not precisely so, we are arguing throughput vs latency here and the two can't always be mixed. And if, in allocating a GB of memory, you are tossing out useful pagecache pages, and those pagecache pages can instead be preserved by tmem thus saving N page faults and order(N) disk accesses, your savings are false economy. I think Sasha's numbers demonstrate that nicely. Anyway, as I've said all along, let's look at the numbers. I've always admitted that tmem on an old uniprocessor should be disabled. If no performance degradation in that environment is a requirement for KVM-tmem to be merged, that is certainly your choice. And if "more CPU cycles used" is a metric, definitely, tmem is not going to pass because that's exactly what it's doing: trading more CPU cycles for better RAM efficiency == less disk accesses. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-12 1:18 ` Dan Magenheimer @ 2012-06-12 10:09 ` Avi Kivity 2012-06-12 16:40 ` Dan Magenheimer 0 siblings, 1 reply; 20+ messages in thread From: Avi Kivity @ 2012-06-12 10:09 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On 06/12/2012 04:18 AM, Dan Magenheimer wrote: >> From: Avi Kivity [mailto:avi@redhat.com] >> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support >> >> On 06/11/2012 06:44 PM, Dan Magenheimer wrote: >> > > >> This is pretty steep. We have flash storage doing a million iops/sec, >> > > >> and here you add 19 microseconds to that. >> > > > >> > > > Might be interesting to test it with flash storage as well... >> > >> > Well, to be fair, you are comparing a device that costs many >> > thousands of $US to a software solution that uses idle CPU >> > cycles and no additional RAM. >> >> You don't know that those cycles are idle. And when in fact you have no >> additional RAM, those cycles are wasted to no benefit. >> >> The fact that I/O is being performed doesn't mean that we can waste >> cpu. Those cpu cycles can be utilized by other processes on the same >> guest or by other guests. > > You're right of course, so I apologize for oversimplifying... but > so are you. Let's take a step back: > > IMHO, a huge part (majority?) of computer science these days is > trying to beat Amdahl's law. On many machines/workloads, > especially in virtual environments, RAM is the bottleneck. > Tmem's role is, when RAM is the bottleneck, to increase RAM > effective size AND, in a multi-tenant environment, flexibility > at the cost of CPU cycles. But tmem also is designed to be very > dynamically flexible so that it either has low CPU cost when it > not being used OR can be dynamically disabled/re-enabled with > reasonably low overhead. > > Why I think you are oversimplifying: "those cpu cycles can be > utilized by other processes on the same guest or by other > guests" pre-supposes that cpu availability is the bottleneck. > It would be interesting if it were possible to measure how > many systems (with modern processors) for which this is true. > I'm not arguing that they don't exist but I suspect they are > fairly rare these days, even for KVM systems. In a given host, either cpu or memory is the bottleneck. If you have both free memory and free cycles, you pack more guests on that machine. During off-peak you may have both, but we need to see what happens during the peak; off-peak we're doing okay. So on such a host, during peak, either the cpu is churning away and we can't spare those cycles for tmem, or memory is packed full of guests and tmem won't provide much benefit (but still consume those cycles). > >> > > Batching will drastically reduce the number of hypercalls. >> > >> > For the record, batching CAN be implemented... ramster is essentially >> > an implementation of batching where the local system is the "guest" >> > and the remote system is the "host". But with ramster the >> > overhead to move the data (whether batched or not) is much MUCH >> > worse than a hypercall and ramster still shows performance advantage. >> >> Sure, you can buffer pages in memory but then you add yet another copy. >> I know you think copies are cheap but I disagree. > > I only think copies are *relatively* cheap. Orders of magnitude > cheaper than some alternatives. So if it takes two page copies > or even ten to replace a disk access, yes I think copies are cheap. > (But I do understand your point.) The copies are cheaper that a disk access, yes, but you need to factor in the probability of a disk access being saved. cleancache already works on the tail end of the lru, we're dumping those pages because they have low access frequency, so the probability starts out low. If many guests are active (so we need the cpu resources), then they also compete for tmem resources, and per-guest it becomes less effective as well. > >> > So, IMHO, one step at a time. Get the foundation code in >> > place and tune it later if a batching implementation can >> > be demonstrated to improve performance sufficiently. >> >> Sorry, no, first demonstrate no performance regressions, then we can >> talk about performance improvements. > > Well that's an awfully hard bar to clear, even with any of the many > changes being merged every release into the core Linux mm subsystem. > Any change to memory management will have some positive impacts on some > workloads and some negative impacts on others. Right, that's too harsh. But these benchmarks show a doubling (or even more) of cpu overhead, and that is whether the cache is effective or not. That is simply way too much to consider. Look at the block, vfs, and mm layers. Huge pains have been taken to batch everything and avoid per-page work -- 20 years of not having enough cycles. And here you throw all this out of the window with per-page crossing of the guest/host boundary. > >> > Yes, tmem has >> > overhead but since the overhead only occurs where pages >> > would otherwise have to be read/written from disk, the >> > overhead is well "hidden". >> >> The overhead is NOT hidden. We spent many efforts to tune virtio-blk to >> reduce its overhead, and now you add 6-20 microseconds per page. A >> guest may easily be reading a quarter million pages per second, this >> adds up very fast - at the upper end you're consuming 5 vcpus just for tmem. >> >> Note that you don't even have to issue I/O to get a tmem hypercall >> invoked. Alllocate a ton of memory and you get cleancache calls for >> each page that passes through the tail of the LRU. Again with the upper >> end, allocating a gigabyte can now take a few seconds extra. > > Though not precisely so, we are arguing throughput vs latency here > and the two can't always be mixed. > > And if, in allocating a GB of memory, you are tossing out useful > pagecache pages, and those pagecache pages can instead be preserved > by tmem thus saving N page faults and order(N) disk accesses, > your savings are false economy. I think Sasha's numbers > demonstrate that nicely. It depends. If you have an 8GB guest, then saving the tail end of an 8GB LRU may improve your caching or it may not. But the impact on that allocation is certain. You're trading off possible marginal improvement for unconditional performance degradation. > > Anyway, as I've said all along, let's look at the numbers. > I've always admitted that tmem on an old uniprocessor should > be disabled. If no performance degradation in that environment > is a requirement for KVM-tmem to be merged, that is certainly > your choice. And if "more CPU cycles used" is a metric, > definitely, tmem is not going to pass because that's exactly > what it's doing: trading more CPU cycles for better RAM > efficiency == less disk accesses. Again, the cpu cycles spent are certain, and double the effort needed to get those pages in the first place. Disk accesses saved will depend on the workload, and on host memory availability. Turning tmem on will certainly generate performance regressions as well as improvements. Maybe on Xen the tradeoff is different (hypercalls ought to be faster on xenpv), but the numbers I saw on kvm aren't good. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-12 10:09 ` Avi Kivity @ 2012-06-12 16:40 ` Dan Magenheimer 2012-06-12 17:54 ` Avi Kivity 0 siblings, 1 reply; 20+ messages in thread From: Dan Magenheimer @ 2012-06-12 16:40 UTC (permalink / raw) To: Avi Kivity; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm > From: Avi Kivity [mailto:avi@redhat.com] > Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support I started off with a point-by-point comment on most of your responses about the tradeoffs of how tmem works, but decided it best to simply say we disagree and kvm-tmem will need to prove who is right. > >> Sorry, no, first demonstrate no performance regressions, then we can > >> talk about performance improvements. > > > > Well that's an awfully hard bar to clear, even with any of the many > > changes being merged every release into the core Linux mm subsystem. > > Any change to memory management will have some positive impacts on some > > workloads and some negative impacts on others. > > Right, that's too harsh. But these benchmarks show a doubling (or even > more) of cpu overhead, and that is whether the cache is effective or > not. That is simply way too much to consider. One point here... remember you have contrived a worst case scenario. The one case Sasha provided outside of that contrived worst case, as you commented, looks very nice. So the costs/benefits remain to be seen over a wider set of workloads. Also, even that contrived case should look quite a bit better with WasActive properly implemented. > Look at the block, vfs, and mm layers. Huge pains have been taken to > batch everything and avoid per-page work -- 20 years of not having > enough cycles. And here you throw all this out of the window with > per-page crossing of the guest/host boundary. Well, to be fair, those 20 years of effort were because (1) disk seeks are a million times slower than an in-RAM page copy and (2) SMP systems were rare and expensive. The world changes... ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC 00/10] KVM: Add TMEM host/guest support 2012-06-12 16:40 ` Dan Magenheimer @ 2012-06-12 17:54 ` Avi Kivity 0 siblings, 0 replies; 20+ messages in thread From: Avi Kivity @ 2012-06-12 17:54 UTC (permalink / raw) To: Dan Magenheimer; +Cc: Sasha Levin, mtosatti, gregkh, sjenning, Konrad Wilk, kvm On 06/12/2012 07:40 PM, Dan Magenheimer wrote: > > From: Avi Kivity [mailto:avi@redhat.com] > > Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support > > I started off with a point-by-point comment on most of your > responses about the tradeoffs of how tmem works, but decided > it best to simply say we disagree and kvm-tmem will need to > prove who is right. That is why I am asking for benchmarks. > > >> Sorry, no, first demonstrate no performance regressions, then we can > > >> talk about performance improvements. > > > > > > Well that's an awfully hard bar to clear, even with any of the many > > > changes being merged every release into the core Linux mm subsystem. > > > Any change to memory management will have some positive impacts on some > > > workloads and some negative impacts on others. > > > > Right, that's too harsh. But these benchmarks show a doubling (or even > > more) of cpu overhead, and that is whether the cache is effective or > > not. That is simply way too much to consider. > > One point here... remember you have contrived a worst case > scenario. The one case Sasha provided outside of that contrived > worst case, as you commented, looks very nice. So the costs/benefits > remain to be seen over a wider set of workloads. While the workload is contrived, decreasing benefits with increasing cache size is nothing new. And here tmem is increasing the cost of all caching, without guaranteeing any return. > Also, even that contrived case should look quite a bit better > with WasActive properly implemented. I'll be happy to see benchmarks of improved code. > > Look at the block, vfs, and mm layers. Huge pains have been taken to > > batch everything and avoid per-page work -- 20 years of not having > > enough cycles. And here you throw all this out of the window with > > per-page crossing of the guest/host boundary. > > Well, to be fair, those 20 years of effort were because > (1) disk seeks are a million times slower than an in-RAM page > copy and (2) SMP systems were rare and expensive. The > world changes... I don't see how smp matters here. You have more cores, you put more work on them, you don't expect the OS or hypervisor to consume them for you. In any case you're consuming this cpu on the same core as the guest, so you're reducing throghput (if caching is ineffective). Disks are still slow, even fast flash arrays, but tmem is not the only solution to that problem. You say ballooning has not proven itself in this area but that doesn't mean it has been proven not to work; and it doesn't suffer from the inefficiency of crossing the guest/host boundary. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2012-06-12 17:55 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-06-06 13:07 [RFC 00/10] KVM: Add TMEM host/guest support Sasha Levin 2012-06-06 13:24 ` Avi Kivity 2012-06-08 13:20 ` Sasha Levin 2012-06-08 16:06 ` Dan Magenheimer 2012-06-11 11:17 ` Avi Kivity 2012-06-11 8:09 ` Avi Kivity 2012-06-11 10:26 ` Sasha Levin 2012-06-11 11:45 ` Avi Kivity 2012-06-11 15:44 ` Dan Magenheimer 2012-06-11 17:06 ` Avi Kivity 2012-06-11 19:25 ` Sasha Levin 2012-06-11 19:56 ` Sasha Levin 2012-06-12 11:46 ` Avi Kivity 2012-06-12 11:58 ` Gleb Natapov 2012-06-12 12:01 ` Avi Kivity 2012-06-12 10:12 ` Avi Kivity 2012-06-12 1:18 ` Dan Magenheimer 2012-06-12 10:09 ` Avi Kivity 2012-06-12 16:40 ` Dan Magenheimer 2012-06-12 17:54 ` Avi Kivity
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox