[Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Anthony Liguori <anthony@codemonkey.ws>
To: Juan Quintela <quintela@redhat.com>
Cc: qemu-devel@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>
Subject: [Qemu-devel] Re: [PATCH 02/10] Add buffered_file_internal constant
Date: Tue, 30 Nov 2010 08:02:56 -0600	[thread overview]
Message-ID: <4CF50410.3080305@codemonkey.ws> (raw)
In-Reply-To: <m3hbezvxcw.fsf@trasno.mitica>

On 11/30/2010 05:56 AM, Juan Quintela wrote:
> No, I benchmarked against two workloads:
> a- idle guest (because it was faster to test)
> b- busy guest (each test takes forever, that is the reason that I tested
> last).
>
> So, I don't agree with that.
>    

But in both cases, it's a large memory guest where the RSS size is <<< 
than the allocated memory size.  This is simply an unrealistic 
scenario.  Migrating immediately after launch may be common among 
developers but users typically run useful workloads in their guests and 
run migration after the guest has been running for quite some time.

So the scenario I'm much more interested in, is trying to migrate a 
400GB guest after the guest has been doing useful work and has brought 
in most of it's RSS.  Also with a box that big, you're going to be on 
10gbit.

That's going to introduce a whole other set of problems that this series 
is potentially just going to exacerbate even more.

>> There are three fundamental problems: 1) kvm.ko dirty bit tracking
>> doesn't scale
>>      
> Fully agree, but this patch don't took that.
>
>    
>> 2) we lose flow control information because of the
>> multiple levels of buffering which means we move more data than we
>> should move
>>      
> Fully agree here, but this is a "massive change" to fix it correctly.
>    

It's really not massive.

>> 3) migration prevents a guest from executing the device
>> model because of qemu_mutex.
>>      
> This is a different problem.
>    

No, this is really the fundamental one.

>> Those are the problems to fix.
>>      
> This still don't fix the stalls on the main_loop.
>
> So, you are telling me, there are this list of problems that you need to
> fix.  They are not enough to fix the problem, and their imply massive
> changes.
>
> In the middle time, everybody in stable and 0.14 is not going to be able
> to use migration with more than 2GB/4GB guest.
>    

That's simply not true.  You started this series with a statement that 
migration is broken.  It's not, it works perfectly fine.  Migration with 
8GB guests work perfectly fine.

You've identified a corner case for which we have suboptimal behavior, 
and are now declaring that migration is "totally broken".

>>   Sprinkling the code with returns in
>> semi-random places because it benchmarked well for one particular test
>> case is something we'll deeply regret down the road.
>>      
> This was mean :(
>    

It wasn't intended to be mean but it is the truth.  We need to approach 
these sort of problems in a more systematic way.  Systematic means 
identifying what the fundamental problems are and fixing them in a 
proper way.

Throwing a magic number in the iteration path after which we let the 
main loop run for a little bit is simply not a solution.  You're still 
going to get main loop starvation.

> There are two returns and one heuristic.
>
> - return a) we try to migrate when we know that there is no space,
>    obvious optimazation/bug (deppends on how to look at it).
>
> - return b) we don't need to handle TLB bitmap code for kvm.  I fully
>    agree that we need to split the bitmaps in something more sensible,
>    but change is quite invasible, and simple fix works for the while.
>
> - heuristic:  if you really think that an io_handler should be able to
>    stall the main loop for almost 4 seconds, sorry, I don't agree.
>    

But fundamentally, why would an iteration of migration take 4 seconds?  
This is really my fundamental object, the migration loop should, at most 
take as much time as it takes to fill up an empty socket buffer or until 
it hits the bandwidth limit.

The bandwidth limit offers a fine-grain control over exactly how long it 
should take.

If we're burning excess CPU walking a 100MB bitmap, then let's fix that 
problem.  Stopping every 1MB worth of the bitmap to do other work just 
papers over the real problem (that we're walking 100MB bitmap).

>    Again, we can fix this much better, but it needs lots of changes:
>     * I asked you if there is a "maximum" value that one io_handler can
>       hog the main loop.  Answer: dunno/deppends.
>     * Long term: migration should be its own thread, have I told thread?
>       This is qemu, no thread.
>     * qemu_mutex: this is the holy grail, if we drop qemu_mutex in
>       ram_save_live(), we left the guest cpu's to continue running.
>       But, and this is a big but, we still stuck the main_loop for 4
>       seconds, so this solution is at least partial.
>    

It really shouldn't be that hard at all to add a multi-level tree to 
kvm.ko and it fixes a bunch of problems at once.  Likewise, accounting 
zero pages is very simple and it makes bandwidth limiting more effective.

> And that is it.  To improve things, I receive complains left and right
> that exporting buffered_file_period/timeout is BAD, very BAD.  Because
> that exposes buffered_file.c internal state.
>
> But I am given the suggestion that I should just create a
> buffered_file_write_nop() that just increases the number of bytes
> transferred for normal pages.  I agree that this "could" also work, but
> that is indeed worse in the sense that we are exposing yet more
> internals of buffered_file.c.  In the 1st case, we are only exposing the
> periof of a timer, in the second, we are hijaking how
> qemu_file_rate_limit() works + how we account for written memory + how
> it calculates the ammount of staff that it have sent, and with this we
> "knownly" lie about how much stuff we have write.
>
> How this can be "cleaner" than a timeout of 50ms is beyond me.
>    

If the migration loop is running for 50ms than we've already failed if 
the user requests a 30ms downtime for migration.  The 50ms timeout means 
that a VCPU can be blocked for 50ms at a time.

Regards,

Anthony Liguori

> On the other hand, this is the typical case where you need a lot of
> testing for each change that you did.  I thought several times that I
> had found the "guilty" bit for the stalls, and no luck, there was yet
> another problem.
>
> I also thought at points, ok, I can now drop the previous attempts, and
> no, that didn't work either.  This was the minimal set of patches able
> to fix the stalls.
>
> This is just very depressing.
>
> Later, Juan.
>

next prev parent reply	other threads:[~2010-12-01  4:46 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-23 23:02 [Qemu-devel] [PATCH 00/10] Fix migration with lots of memory Juan Quintela
2010-11-23 23:02 ` [Qemu-devel] [PATCH 01/10] Add spent time to migration Juan Quintela
2010-11-23 23:02 ` [Qemu-devel] [PATCH 02/10] Add buffered_file_internal constant Juan Quintela
2010-11-24 10:40   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-24 10:52     ` Juan Quintela
2010-11-24 11:04       ` Michael S. Tsirkin
2010-11-24 11:13         ` Juan Quintela
2010-11-24 11:19           ` Michael S. Tsirkin
     [not found]       ` <4CF46012.2060804@codemonkey.ws>
2010-11-30 11:56         ` Juan Quintela
2010-11-30 14:02           ` Anthony Liguori [this message]
2010-11-30 14:11             ` Michael S. Tsirkin
2010-11-30 14:22               ` Anthony Liguori
2010-11-30 15:40             ` Juan Quintela
2010-11-30 16:10               ` Michael S. Tsirkin
2010-11-30 16:32                 ` Juan Quintela
2010-11-30 16:44                   ` Anthony Liguori
2010-11-30 18:04                     ` Juan Quintela
2010-11-30 18:54                       ` Anthony Liguori
2010-11-30 19:15                         ` Juan Quintela
2010-11-30 20:23                           ` Anthony Liguori
2010-11-30 20:56                             ` Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 03/10] Add printf debug to savevm Juan Quintela
     [not found]   ` <4CF45AB2.7050506@codemonkey.ws>
2010-11-30 10:36     ` Stefan Hajnoczi
2010-11-30 22:40       ` [Qemu-devel] " Juan Quintela
2010-12-01  7:50         ` Stefan Hajnoczi
2010-11-23 23:03 ` [Qemu-devel] [PATCH 04/10] No need to iterate if we already are over the limit Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 05/10] KVM don't care about TLB handling Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 06/10] Only calculate expected_time for stage 2 Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 07/10] ram_save_remaining() returns an uint64_t Juan Quintela
     [not found]   ` <4CF45C0C.705@codemonkey.ws>
2010-11-30  7:21     ` [Qemu-devel] " Paolo Bonzini
2010-11-30 13:44       ` Anthony Liguori
2010-11-30 14:38     ` Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 08/10] Count nanoseconds with uint64_t not doubles Juan Quintela
2010-11-30  7:17   ` [Qemu-devel] " Paolo Bonzini
     [not found]   ` <4CF45C5B.9080507@codemonkey.ws>
2010-11-30 14:40     ` Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 09/10] Exit loop if we have been there too long Juan Quintela
2010-11-24 10:40   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-24 11:01     ` Juan Quintela
2010-11-24 11:14       ` Michael S. Tsirkin
2010-11-24 15:16         ` Paolo Bonzini
2010-11-24 15:59           ` Michael S. Tsirkin
     [not found]           ` <4CF45E3F.4040609@codemonkey.ws>
2010-11-30  8:10             ` Paolo Bonzini
2010-11-30 13:26             ` Juan Quintela
     [not found]   ` <4CF45D67.5010906@codemonkey.ws>
2010-11-30  7:15     ` Paolo Bonzini
2010-11-30 13:47       ` Anthony Liguori
2010-11-30 13:47         ` [Qemu-devel] " Anthony Liguori
2010-11-30 13:58         ` Avi Kivity
2010-11-30 13:58           ` [Qemu-devel] " Avi Kivity
2010-11-30 14:17           ` Anthony Liguori
2010-11-30 14:17             ` [Qemu-devel] " Anthony Liguori
2010-11-30 14:27             ` Avi Kivity
2010-11-30 14:27               ` [Qemu-devel] " Avi Kivity
2010-11-30 14:50               ` Anthony Liguori
2010-11-30 14:50                 ` [Qemu-devel] " Anthony Liguori
2010-12-01 12:40                 ` Avi Kivity
2010-12-01 12:40                   ` [Qemu-devel] " Avi Kivity
2010-11-30 17:43               ` Juan Quintela
2010-11-30 17:43                 ` [Qemu-devel] " Juan Quintela
2010-12-01  1:20               ` Takuya Yoshikawa
2010-12-01  1:20                 ` [Qemu-devel] " Takuya Yoshikawa
2010-12-01  1:52                 ` Juan Quintela
2010-12-01  1:52                   ` [Qemu-devel] " Juan Quintela
2010-12-01  2:22                   ` Takuya Yoshikawa
2010-12-01  2:22                     ` [Qemu-devel] " Takuya Yoshikawa
2010-12-01 12:35                   ` Avi Kivity
2010-12-01 12:35                     ` [Qemu-devel] " Avi Kivity
2010-12-01 13:45                     ` Juan Quintela
2010-12-01 13:45                       ` [Qemu-devel] " Juan Quintela
2010-12-02  1:31                     ` Takuya Yoshikawa
2010-12-02  1:31                       ` [Qemu-devel] " Takuya Yoshikawa
2010-12-02  8:37                       ` Avi Kivity
2010-12-02  8:37                         ` [Qemu-devel] " Avi Kivity
2010-11-30 14:12         ` Paolo Bonzini
2010-11-30 14:12           ` [Qemu-devel] " Paolo Bonzini
2010-11-30 15:00           ` Anthony Liguori
2010-11-30 15:00             ` [Qemu-devel] " Anthony Liguori
2010-11-30 17:59             ` Juan Quintela
2010-11-30 17:59               ` [Qemu-devel] " Juan Quintela
2010-11-23 23:03 ` [Qemu-devel] [PATCH 10/10] Maintaing number of dirty pages Juan Quintela
     [not found]   ` <4CF45DE0.8020701@codemonkey.ws>
2010-11-30 14:46     ` [Qemu-devel] " Juan Quintela
2010-12-01 14:46       ` Avi Kivity
2010-12-01 15:51         ` Juan Quintela
2010-12-01 15:55           ` Anthony Liguori
2010-12-01 16:25             ` Juan Quintela
2010-12-01 16:33               ` Anthony Liguori
2010-12-01 16:43                 ` Avi Kivity
2010-12-01 16:49                   ` Anthony Liguori
2010-12-01 16:52                     ` Avi Kivity
2010-12-01 16:56                       ` Anthony Liguori
2010-12-01 17:01                         ` Avi Kivity
2010-12-01 17:05                           ` Anthony Liguori
2010-12-01 18:51                             ` Juan Quintela

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CF50410.3080305@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.