linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Avi Kivity <avi@redhat.com>
To: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	Pavel Machek <pavel@ucw.cz>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hugh.dickins@tiscali.co.uk, ngupta@vflare.org,
	JBeulich@novell.com, chris.mason@oracle.com,
	kurt.hackel@oracle.com, dave.mccracken@oracle.com,
	npiggin@suse.de, akpm@linux-foundation.org, riel@redhat.com
Subject: Re: Frontswap [PATCH 0/4] (was Transcendent Memory): overview
Date: Sun, 02 May 2010 18:35:47 +0300	[thread overview]
Message-ID: <4BDD9BD3.2080301@redhat.com> (raw)
In-Reply-To: <3a62a058-7976-48d7-acd2-8c6a8312f10f@default>

On 05/01/2010 08:10 PM, Dan Magenheimer wrote:
>> Eventually you'll have to swap frontswap pages, or kill uncooperative
>> guests.  At which point all of the simplicity is gone.
>>      
> OK, now I think I see the crux of the disagreement.
>    

Alas, I think we're pretty far from that.

> NO!  Frontswap on Xen+tmem never *never* _never_ NEVER results
> in host swapping.

That's a bug.  You're giving the guest memory without the means to take 
it back.  The result is that you have to _undercommit_ your memory 
resources.

Consider a machine running a guest, with most of its memory free.  You 
give the memory via frontswap to the guest.  The guest happily swaps to 
frontswap, and uses the freed memory for something unswappable, like 
mlock()ed memory or hugetlbfs.

Now the second node dies and you need memory to migrate your guests 
into.  But you can't, and the hypervisor is at the mercy of the guest 
for getting its memory back; and the guest can't do it (at least not 
quickly).

> Host swapping is evil.  Host swapping is
> the root of most of the bad reputation that memory overcommit
> has gotten from VMware customers.  Host swapping can't be
> avoided with some memory overcommit technologies (such as page
> sharing), but frontswap on Xen+tmem CAN and DOES avoid it.
>    

In this case the guest expects that swapped out memory will be slow 
(since was freed via the swap API; it will be slow if the host happened 
to run out of tmem).  So by storing this memory on disk you aren't 
reducing performance beyond what you promised to the guest.

Swapping guest RAM will indeed cause a performance hit, but sometimes 
you need to do it.

> So, to summarize:
>
> 1) You agreed that a synchronous interface for frontswap makes
>     sense for swap-to-in-kernel-compressed-RAM because it is
>     truly swapping to RAM.
>    

Because the interface is internal to the kernel.

> 2) You have pointed out that an asynchronous interface for
>     frontswap makes more sense for KVM than a synchronous
>     interface, because KVM does host swapping.

kvm's host swapping is unrelated.  Host swapping swaps guest-owned 
memory; that's not what we want here.  We want to cache guest swap in 
RAM, and that's easily done by having a virtual disk cached in main 
memory.  We're simply presenting a disk with a large write-back cache to 
the guest.

You could just as easily cache a block device in free RAM with Xen.  
Have a tmem domain behave as the backend for your swap device.  Use 
ballooning to force tmem to disk, or to allow more cache when memory is 
free.

Voila: you no longer depend on guests (you depend on the tmem domain, 
but that's part of the host code), you don't need guest modifications, 
so it works across a wider range of guests.

>    Then you said
>     if you have an asynchronous interface anyway, the existing
>     swap code works just fine with no changes so frontswap
>     is not needed at all... for KVM.
>    

For any hypervisor which implements virtual disks with write-back cache 
in host memory.

> 3) You have suggested that if Xen were more like KVM and required
>     host-swapping, then Xen doesn't need frontswap either.
>    

Host swapping is not a requirement.

> BUT frontswap on Xen+tmem always truly swaps to RAM.
>    

AND that's a problem because it puts the hypervisor at the mercy of the 
guest.

> So there are two users of frontswap for which the synchronous
> interface makes sense.

I believe there is only one.  See below.

> I believe there may be more in the
> future and you disagree but, as Jeremy said, "a general Linux
> principle is not to overdesign interfaces for hypothetical users,
> only for real needs."  We have demonstrated there is a need
> with at least two users so the debate is only whether the
> number of users is two or more than two.
>
> Frontswap is a very non-invasive patch and is very cleanly
> layered so that if it is not in the presence of either of
> the intended "users", it can be turned off in many different
> ways with zero overhead (CONFIG'ed off) or extremely small overhead
> (frontswap_ops is never set; or frontswap_ops is set but the
> underlying hypervisor doesn't support it so frontswap_poolid
> never gets set).
>    

The problem is not the complexity of the patch itself.  It's the fact 
that it introduces a new external API.  If we refactor swapping, that 
stands in the way.

How much, that's up to the mm maintainers to say.  If it isn't a problem 
for them, fine (but I still think 
swap-to-RAM-without-hypervisor-decommit is a bad idea).

> So... KVM doesn't need it and won't use it.  Do you, Avi, have
> any other objections as to why the frontswap patch shouldn't be
> accepted as is for the users that DO need it and WILL use it?
>    

Even ignoring the problems above (which are really hypervisor problems 
and the guest, which is what we're discussing here, shouldn't care if 
the hypervisor paints itself into an oom), a synchronous single-page DMA 
API is a bad idea.  Look at the Xen network and block code, while they 
eventually do a memory copy for every page they see, they try to batch 
multiple pages into an exit, and make the response asynchronous.

As an example, with a batched API you could save/restore the fpu context 
and use sse for copying the memory, while with a single page API you'd 
probably lost out.  Synchronous DMA, even for emulated hardware, is out 
of place in 2010.

-- 
error compiling committee.c: too many arguments to function


  parent reply	other threads:[~2010-05-02 15:36 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-22 13:42 Frontswap [PATCH 0/4] (was Transcendent Memory): overview Dan Magenheimer
2010-04-22 15:28 ` Avi Kivity
2010-04-22 15:48   ` Dan Magenheimer
2010-04-22 16:13     ` Avi Kivity
2010-04-22 20:15       ` Dan Magenheimer
2010-04-23  9:48         ` Avi Kivity
2010-04-23 13:47           ` Dan Magenheimer
2010-04-23 13:57             ` Avi Kivity
2010-04-23 14:43               ` Dan Magenheimer
2010-04-23 14:52                 ` Avi Kivity
2010-04-23 15:00                   ` Avi Kivity
2010-04-23 16:26                     ` Dan Magenheimer
2010-04-24 18:25                       ` Avi Kivity
     [not found]                         ` <1c02a94a-a6aa-4cbb-a2e6-9d4647760e91@default4BD43033.7090706@redhat.com>
2010-04-25  0:41                         ` Dan Magenheimer
2010-04-25 12:06                           ` Avi Kivity
2010-04-25 13:12                             ` Dan Magenheimer
2010-04-25 13:18                               ` Avi Kivity
2010-04-28  5:55                               ` Pavel Machek
2010-04-29 14:42                                 ` Dan Magenheimer
2010-04-29 18:59                                   ` Avi Kivity
2010-04-29 19:01                                     ` Avi Kivity
2010-04-29 18:53                                 ` Avi Kivity
2010-04-30  1:45                                 ` Dave Hansen
2010-04-30  7:13                                   ` Avi Kivity
2010-04-30 15:59                                     ` Dan Magenheimer
2010-04-30 16:08                                       ` Dave Hansen
2010-05-10 16:05                                         ` Martin Schwidefsky
2010-04-30 16:16                                       ` Avi Kivity
     [not found]                                         ` <4BDB18CE.2090608@goop.org4BDB2069.4000507@redhat.com>
     [not found]                                           ` <3a62a058-7976-48d7-acd2-8c6a8312f10f@default20100502071059.GF1790@ucw.cz>
2010-04-30 16:43                                         ` Dan Magenheimer
2010-04-30 17:10                                           ` Dave Hansen
2010-04-30 18:08                                           ` Avi Kivity
2010-04-30 17:52                                         ` Jeremy Fitzhardinge
2010-04-30 18:24                                           ` Avi Kivity
2010-04-30 18:59                                             ` Jeremy Fitzhardinge
2010-05-01  8:28                                               ` Avi Kivity
2010-05-01 17:10                                             ` Dan Magenheimer
2010-05-02  7:11                                               ` Pavel Machek
2010-05-02 15:05                                                 ` Dan Magenheimer
2010-05-02 20:06                                                   ` Pavel Machek
2010-05-02 21:05                                                     ` Dan Magenheimer
2010-05-02  7:57                                               ` Nitin Gupta
2010-05-02 16:06                                                 ` Dan Magenheimer
2010-05-02 16:48                                                   ` Avi Kivity
2010-05-02 17:22                                                     ` Dan Magenheimer
2010-05-03  9:39                                                       ` Avi Kivity
2010-05-03 14:59                                                         ` Dan Magenheimer
2010-05-02 15:35                                               ` Avi Kivity [this message]
2010-05-02 17:06                                                 ` Dan Magenheimer
2010-05-03  8:46                                                   ` Avi Kivity
2010-05-03 16:01                                                     ` Dan Magenheimer
2010-05-03 19:32                                                       ` Pavel Machek
2010-04-30 16:04                                     ` Dave Hansen
2010-04-23 15:56                   ` Dan Magenheimer
2010-04-24 18:22                     ` Avi Kivity
2010-04-25  0:30                       ` Dan Magenheimer
2010-04-25 12:11                         ` Avi Kivity
     [not found]                           ` <c5062f3a-3232-4b21-b032-2ee1f2485ff0@default4BD44E74.2020506@redhat.com>
2010-04-25 13:37                           ` Dan Magenheimer
2010-04-25 14:15                             ` Avi Kivity
2010-04-25 15:29                               ` Dan Magenheimer
2010-04-26  6:01                                 ` Avi Kivity
2010-04-26 12:45                                   ` Dan Magenheimer
2010-04-26 13:48                                     ` Avi Kivity
2010-04-27 12:56                                 ` Pavel Machek
2010-04-27 14:32                                   ` Dan Magenheimer
2010-04-29 13:02                                     ` Pavel Machek
2010-04-27 11:52                             ` Valdis.Kletnieks
2010-04-27  0:49                           ` Jeremy Fitzhardinge
2010-04-27 12:55                         ` Pavel Machek
2010-04-27 14:43                           ` Nitin Gupta
2010-04-29 13:04                             ` Pavel Machek
2010-04-24  1:49                   ` Nitin Gupta
2010-04-24 18:27                     ` Avi Kivity
2010-04-25  3:11                       ` Nitin Gupta
2010-04-25 12:16                         ` Avi Kivity
2010-04-25 16:05                           ` Nitin Gupta
2010-04-26  6:06                             ` Avi Kivity
2010-04-26 12:50                               ` Dan Magenheimer
2010-04-26 13:43                                 ` Avi Kivity
2010-04-27  8:29                                   ` Dan Magenheimer
2010-04-27  9:21                                     ` Avi Kivity
2010-04-26 13:47                               ` Nitin Gupta
2010-04-23 16:35             ` Jiahua

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BDD9BD3.2080301@redhat.com \
    --to=avi@redhat.com \
    --cc=JBeulich@novell.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=dan.magenheimer@oracle.com \
    --cc=dave.mccracken@oracle.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=jeremy@goop.org \
    --cc=kurt.hackel@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ngupta@vflare.org \
    --cc=npiggin@suse.de \
    --cc=pavel@ucw.cz \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).