All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jagane Sundar <jagane@sundar.org>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Jes Sorensen <Jes.Sorensen@redhat.com>,
	Anthony Liguori <aliguori@us.ibm.com>,
	Badari Pulavarty <pbadari@us.ibm.com>
Subject: Re: A Live Backup feature for KVM
Date: Mon, 25 Apr 2011 20:31:17 -0700	[thread overview]
Message-ID: <4DB63C85.4020904@sundar.org> (raw)
In-Reply-To: <BANLkTimYvQ-SbE6ASE5gBMcB9wqi+dbO3w@mail.gmail.com>

On 4/25/2011 6:34 AM, Stefan Hajnoczi wrote:
> On Mon, Apr 25, 2011 at 9:16 AM, Jagane Sundar<jagane@sundar.org>  wrote:
>> The direction that I chose to go is slightly different. In both of the
>> proposals you pointed me at, the original virtual disk is made
>> read-only and the VM writes to a different COW file. After backup
>> of the original virtual disk file is complete, the COW file is merged
>> with the original vdisk file.
>>
>> Instead, I create an Original-Blocks-COW-file to store the original
>> blocks that are overwritten by the VM everytime the VM performs
>> a write while the backup is in progress. Livebackup copies these
>> underlying blocks from the original virtual disk file before the VM's
>> write to the original virtual disk file is scheduled. The advantage of
>> this is that there is no merge necessary at the end of the backup, we
>> can simply delete the Original-Blocks-COW-file.
> The advantage of the approach that redirects writes to a new file
> instead is that the heavy work of copying data is done asynchronously
> during the merge operation instead of in the write path which will
> impact guest performance.
>
> Here's what I understand:
>
> 1. User takes a snapshot of the disk, QEMU creates old-disk.img backed
> by the current-disk.img.
> 2. Guest issues a write A.
> 3. QEMU reads B from current-disk.img.
> 4. QEMU writes B to old-disk.img.
> 5. QEMU writes A to current-disk.img.
> 6. Guest receives write completion A.
>
> The tricky thing is what happens if there is a failure after Step 5.
> If writes A and B were unstable writes (no fsync()) then no ordering
> is guaranteed and perhaps write A reached current-disk.img but write B
> did not reach old-disk.img.  In this case we no longer have a
> consistent old-disk.img snapshot - we're left with an updated
> current-disk.img and old-disk.img does not have a copy of the old
> data.

In both approaches the number of I/O operations remains constant:

WRITES_TO_NEW_FILE_APPROACH
<<Create snapshot>>
- As new writes from the VM come in:
1. Write to new-disk.img
Asynchronously:
a. Read from new-disk.img
b. Write into old-disk.img
<<Delete snapshot>>

WRITES_TO_CURRENT_FILE_APPROACH
<<Create snapshot>>
- As new writes from the VM come in:
1. Read old block from current-disk.img
2. Write old block to old-disk.img
3. Write new block to current-disk.img
<<Delete snapshot>>

The number of I/O operations is 2 writes and 1 read, in both cases.
The critical factor, then, is the duration for which the VM must
maintain the snapshot.

> The solution is to fsync() after Step 4 and before Step 5 but this
> will hurt performance.  We now have an extra read, write, and fsync()
> on every write.
>
I agree - fsync() just defeats the whole purpose of building a super 
efficient
live backup mechanism. I'm not planning to introduce fsync()s.
However, I want to treat the snapshot as a limited snapshot, only for backup
purposes. In my proposal, the old-disk.img is valid only for the time when
the livebackup client connects to qemu and transfers the blocks for
that backup over. If the disk suffers an intermittent failure after (5),
then the snapshot is deemed inconsistent, and discarded.

>> I have some reasons to believe that the Original-Blocks-COW-file
>> design that I am putting forth might work better. I have listed them
>> below. (It's past midnight here, so pardon me if it sounds garbled -- I
>> will try to clarify more in a writeup on wiki.qemu.org).
>> Let me know what your thoughts are..
>>
>> I feel that the livebackup mechanism will impact the running VM
>> less. For example, if something goes wrong with the backup process,
>> then we can simply delete the Original-Blocks-COW-file and force
>> the backup client to do a full backup the next time around. The
>> running VM or its virtual disks are not impacted at all.
> Abandoning snapshots is not okay.  Snapshots will be used in scenarios
> beyond backup and I don't think we can make them
> unreliable/throw-away.
>
My proposal is to treat the snapshot as a specific to livebackup entitiy 
that exists
only for the duration of the livebackup_client's connection to qemu to 
transfer
the blocks over. At other times, there is no snapshot, just a dirty 
blocks bitmap
indicating which blocks were modified since the last backup was taken.

Consider the use case of daily incremental backups:

WRITES_TO_NEW_FILE_APPROACH
- 1:00 AM Create snapshot A
<<<24 hours go by. All writes by the VM
during this time are stored in the new-disk.img file.>>>
- 1 AM next day, the backup program starts copying its
   incremental backup blocks, i.e. the blocks that were modified
   in the last 24 hours, and are all stored in new-disk.img
- 1:15 AM Merge snapshot A
<<< The asynchronous process now kicks in, and starts merging
the blocks from new-disk.img into the old-disk.img>>>
- 1:15 AM Create snapshot B

WRITES_TO_CURRENT_FILE_APPROACH
- 1:00 AM livebackup_client connects to qemu and creates snapshot
- livebackup_client starts transferring blocks modified by VM
   in the last 24 hours over the network to the backup server.
   Let's say that this takes about 15 minutes.
- While livebackup_client is transferring blocks, any new writes
   sent by the VM cause qemu to read a copy of the old block that
   is going to be overwritten, and write it out to old-disk.img
- 1:15 AM livebackup_client sends destroy_snapshot cmd to qemu.
   qemu deletes old-disk.img.


So, my understanding is that the snapshot lasts for 24 hours in
the case of the WRITES_TO_NEW_FILE_APPROACH, and the snapshot
lasts for 15 to 30 minutes in the case of the
WRITES_TO_CURRENT_FILE_APPROACH.

The effect is this: The WRITES_TO_CURRENT_FILE_APPROACH approach needs
to do the ReadOldBlock-WriteOldBlockToCOW-WriteNewBlockToBaseFile
for about 15 to 30 minutes of the day.

In contrast, the WRITES_TO_NEW_FILE_APPROACH needs to do the
WriteNewBlockToCOWFile-ReadNewBlockFromCOWFile-WriteNewBlockToBaseFile
for almost the entire duration of the daily backup schedule.


>> Livebackup includes a rudimentary network protocol to transfer
>> the modified blocks to a livebackup_client. It supports incremental
>> backups. Also, livebackup treats a backup as containing all the virtual
>> disks of a VM. Hence a snapshot in livebackup terms refer to a
>> snapshot of all the virtual disks.
>>
>> The approximate sequence of operation is as follows:
>> 1. VM boots up. When bdrv_open_common opens any file backed
>>     virtual disk, it checks for a file called<base_file>.livebackupconf.
>>     If such a file exists, then the virtual disk is part of the backup set,
>>     and a chunk of memory is allocated to keep track of dirty blocks.
>> 2. qemu starts up a  livebackup thread that listens on a specified port
>>     (e.g) port 7900, for connections from the livebackup client.
>> 3. The livebackup_client connects to qemu at port 7900.
>> 4. livebackup_client sends a 'do snapshot' command.
>> 5. qemu waits 30 seconds for outstanding asynchronous I/O to complete.
>> 6. When there are no more outstanding async I/O requests, qemu
>>     copies the dirty_bitmap to its snapshot structure and starts a new dirty
>>     bitmap.
>> 7. livebackup_client starts iterating through the list of dirty blocks, and
>>     starts saving these blocks to the backup image
>> 8. When all blocks have been backed up, then the backup_client sends a
>>     destroy snapshot command; the server simply deletes the
>>     Original-Blocks-COW-files for each of the virtual disks and frees the
>>     calloc'd memory holding the dirty blocks list.
> I think there's a benefit to just pointing at
> Original-Blocks-COW-files and letting the client access it directly.
> This even works with shared storage where the actual backup work is
> performed on another host via access to a shared network filesystem or
> LUN.  It may not be desirable to send everything over the network.
>
>
> Perhaps you made a custom network client because you are writing a
> full-blown backup solution for KVM?  In that case it's your job to
> move the data around and get it backed up.  But from QEMU's point of
> view we just need to provide the data and it's up to the backup
> software to send it over the network and do its magic.
>
My thoughts on the topic of using a built in backup protocol
versus just exposing the snapshot as a file are as follows:
You are correct, I am looking to build a full backup solution.
Also, I want one that is very cloud friendly. I want OpenStack to
use livebackup to provide many more types of VMs in an
IaaS cloud. Highly Available VMs using external storage,
ephemeral VMs using non backed up local disk, and good enough
High Availability using a hourly backed up VM. I will readily
admit - my focus is not on pushing part of the backup task
off to an external storage device, it is more on making a self
contained, highly efficient, scriptable backup solution
that can be managed by the cloud controller.

There are advantages and disadvantages to inventing a
custom backup protocol. I investigated NDMP, which seemed
way overkill and largely incorrect for my purpose. If we
end up going with the custom backup protocol, I will need
to add authentication and encryption.

>> I have pushed my code to the following git tree.
>> git://github.com/jagane/qemu-kvm-livebackup.git
>>
>> It started as a clone of the linux kvm tree at:
>>
>> git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
>>
>> If you want to look at the code, see livebackup.[ch] and livebackup_client.c
> In terms of submitting patches it's best to target qemu.git instead of
> qemu-kvm.git since this feature really isn't Linux KVM-specific.
> There are also efforts to merge qemu-kvm.git into qemu.git, so we
> shouldn't increase the patch delta.
>
This is very good. I did not realize that qemu-kvm had grown so close to
qemu. My preference would, of course, be to contribute to qemu and
have that flow to qemu-kvm.
>> This is very much a work in progress, and I expect to do a lot of
>> testing/debugging over the next few weeks. I will also create a
>> detailed proposal on wiki.qemu.org, with much more information.
> Excellent.  Let's get Jes to join the discussion since he's been most
> hands-on with block device snapshots.  It's Easter holiday time so
> perhaps later this week or next week more people will be around.
>
Would love to get Jes's feedback, Stefan.

Thanks,
Jagane

      reply	other threads:[~2011-04-26  3:31 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-23 23:17 A Live Backup feature for KVM Jagane Sundar
2011-04-24  8:32 ` Stefan Hajnoczi
2011-04-25  8:16   ` Jagane Sundar
2011-04-25 13:34     ` Stefan Hajnoczi
2011-04-26  3:31       ` Jagane Sundar [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DB63C85.4020904@sundar.org \
    --to=jagane@sundar.org \
    --cc=Jes.Sorensen@redhat.com \
    --cc=aliguori@us.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbadari@us.ibm.com \
    --cc=stefanha@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.