From: Dan Magenheimer <dan.magenheimer@oracle.com>
To: Avi Kivity <avi@redhat.com>
Cc: Sasha Levin <levinsasha928@gmail.com>,
mtosatti@redhat.com, gregkh@linuxfoundation.org,
sjenning@linux.vnet.ibm.com, Konrad Wilk <konrad.wilk@oracle.com>,
kvm@vger.kernel.org
Subject: RE: [RFC 00/10] KVM: Add TMEM host/guest support
Date: Mon, 11 Jun 2012 18:18:12 -0700 (PDT) [thread overview]
Message-ID: <022f701e-d40f-4b44-b960-effa0d320d4a@default> (raw)
In-Reply-To: <4FD625A7.5020707@redhat.com>
> From: Avi Kivity [mailto:avi@redhat.com]
> Subject: Re: [RFC 00/10] KVM: Add TMEM host/guest support
>
> On 06/11/2012 06:44 PM, Dan Magenheimer wrote:
> > > >> This is pretty steep. We have flash storage doing a million iops/sec,
> > > >> and here you add 19 microseconds to that.
> > > >
> > > > Might be interesting to test it with flash storage as well...
> >
> > Well, to be fair, you are comparing a device that costs many
> > thousands of $US to a software solution that uses idle CPU
> > cycles and no additional RAM.
>
> You don't know that those cycles are idle. And when in fact you have no
> additional RAM, those cycles are wasted to no benefit.
>
> The fact that I/O is being performed doesn't mean that we can waste
> cpu. Those cpu cycles can be utilized by other processes on the same
> guest or by other guests.
You're right of course, so I apologize for oversimplifying... but
so are you. Let's take a step back:
IMHO, a huge part (majority?) of computer science these days is
trying to beat Amdahl's law. On many machines/workloads,
especially in virtual environments, RAM is the bottleneck.
Tmem's role is, when RAM is the bottleneck, to increase RAM
effective size AND, in a multi-tenant environment, flexibility
at the cost of CPU cycles. But tmem also is designed to be very
dynamically flexible so that it either has low CPU cost when it
not being used OR can be dynamically disabled/re-enabled with
reasonably low overhead.
Why I think you are oversimplifying: "those cpu cycles can be
utilized by other processes on the same guest or by other
guests" pre-supposes that cpu availability is the bottleneck.
It would be interesting if it were possible to measure how
many systems (with modern processors) for which this is true.
I'm not arguing that they don't exist but I suspect they are
fairly rare these days, even for KVM systems.
> > > Batching will drastically reduce the number of hypercalls.
> >
> > For the record, batching CAN be implemented... ramster is essentially
> > an implementation of batching where the local system is the "guest"
> > and the remote system is the "host". But with ramster the
> > overhead to move the data (whether batched or not) is much MUCH
> > worse than a hypercall and ramster still shows performance advantage.
>
> Sure, you can buffer pages in memory but then you add yet another copy.
> I know you think copies are cheap but I disagree.
I only think copies are *relatively* cheap. Orders of magnitude
cheaper than some alternatives. So if it takes two page copies
or even ten to replace a disk access, yes I think copies are cheap.
(But I do understand your point.)
> > So, IMHO, one step at a time. Get the foundation code in
> > place and tune it later if a batching implementation can
> > be demonstrated to improve performance sufficiently.
>
> Sorry, no, first demonstrate no performance regressions, then we can
> talk about performance improvements.
Well that's an awfully hard bar to clear, even with any of the many
changes being merged every release into the core Linux mm subsystem.
Any change to memory management will have some positive impacts on some
workloads and some negative impacts on others.
> > > A different
> > > alternative is to use ballooning to feed the guest free memory so it
> > > doesn't need to hypercall at all. Deciding how to divide free memory
> > > among the guests is hard (but then so is deciding how to divide tmem
> > > memory among guests), and adding dedup on top of that is also hard (ksm?
> > > zksm?). IMO letting the guest have the memory and manage it on its own
> > > will be much simpler and faster compared to the constant chatting that
> > > has to go on if the host manages this memory.
> >
> > Here we disagree, maybe violently. All existing solutions that
> > try to do manage memory across multiple tenants from an "external
> > memory manager policy" fail miserably. Tmem is at least trying
> > something new by actively involving both the host and the guest
> > in the policy (guest decides which pages, host decided how many)
> > and without the massive changes required for something like
> > IBM's solution (forgot what it was called).
>
> cmm2
That's the one. Thanks for the reminder!
> > Yes, tmem has
> > overhead but since the overhead only occurs where pages
> > would otherwise have to be read/written from disk, the
> > overhead is well "hidden".
>
> The overhead is NOT hidden. We spent many efforts to tune virtio-blk to
> reduce its overhead, and now you add 6-20 microseconds per page. A
> guest may easily be reading a quarter million pages per second, this
> adds up very fast - at the upper end you're consuming 5 vcpus just for tmem.
>
> Note that you don't even have to issue I/O to get a tmem hypercall
> invoked. Alllocate a ton of memory and you get cleancache calls for
> each page that passes through the tail of the LRU. Again with the upper
> end, allocating a gigabyte can now take a few seconds extra.
Though not precisely so, we are arguing throughput vs latency here
and the two can't always be mixed.
And if, in allocating a GB of memory, you are tossing out useful
pagecache pages, and those pagecache pages can instead be preserved
by tmem thus saving N page faults and order(N) disk accesses,
your savings are false economy. I think Sasha's numbers
demonstrate that nicely.
Anyway, as I've said all along, let's look at the numbers.
I've always admitted that tmem on an old uniprocessor should
be disabled. If no performance degradation in that environment
is a requirement for KVM-tmem to be merged, that is certainly
your choice. And if "more CPU cycles used" is a metric,
definitely, tmem is not going to pass because that's exactly
what it's doing: trading more CPU cycles for better RAM
efficiency == less disk accesses.
next prev parent reply other threads:[~2012-06-12 1:18 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-06 13:07 [RFC 00/10] KVM: Add TMEM host/guest support Sasha Levin
2012-06-06 13:24 ` Avi Kivity
2012-06-08 13:20 ` Sasha Levin
2012-06-08 16:06 ` Dan Magenheimer
2012-06-11 11:17 ` Avi Kivity
2012-06-11 8:09 ` Avi Kivity
2012-06-11 10:26 ` Sasha Levin
2012-06-11 11:45 ` Avi Kivity
2012-06-11 15:44 ` Dan Magenheimer
2012-06-11 17:06 ` Avi Kivity
2012-06-11 19:25 ` Sasha Levin
2012-06-11 19:56 ` Sasha Levin
2012-06-12 11:46 ` Avi Kivity
2012-06-12 11:58 ` Gleb Natapov
2012-06-12 12:01 ` Avi Kivity
2012-06-12 10:12 ` Avi Kivity
2012-06-12 1:18 ` Dan Magenheimer [this message]
2012-06-12 10:09 ` Avi Kivity
2012-06-12 16:40 ` Dan Magenheimer
2012-06-12 17:54 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=022f701e-d40f-4b44-b960-effa0d320d4a@default \
--to=dan.magenheimer@oracle.com \
--cc=avi@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=konrad.wilk@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=levinsasha928@gmail.com \
--cc=mtosatti@redhat.com \
--cc=sjenning@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox