Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Juan Quintela <quintela@redhat.com>,
	Eric Blake <eblake@redhat.com>,
	Markus Armbruster <armbru@redhat.com>,
	Samuel Ortiz <sameo@linux.intel.com>,
	Sebastien Boeuf <sebastien.boeuf@intel.com>,
	"James O . D . Hunt" <james.o.hunt@intel.com>,
	Xu Wang <gnawux@gmail.com>, Peng Tao <bergwolf@gmail.com>,
	Xiao Guangrong <xiaoguangrong@tencent.com>,
	Xiao Guangrong <xiaoguangrong.eric@gmail.com>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory
Date: Thu, 19 Apr 2018 17:38:52 +0100	[thread overview]
Message-ID: <20180419163852.GE2730@work-vm> (raw)
In-Reply-To: <20180416150011.55916-1-jiangshanlai@gmail.com>

* Lai Jiangshan (jiangshanlai@gmail.com) wrote:
> 1) What's this
> 
> When the migration capability 'bypass-shared-memory'
> is set, the shared memory will be bypassed when migration.
> 
> It is the key feature to enable several excellent features for
> the qemu, such as qemu-local-migration, qemu-live-update,
> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
> yet-another-post-copy-migration, etc..
> 
> The philosophy behind this key feature, including the resulting
> advanced key features, is that a part of the memory management
> is separated out from the qemu, and let the other toolkits
> such as libvirt, kata-containers (https://github.com/kata-containers)
> runv(https://github.com/hyperhq/runv/) or some multiple cooperative
> qemu commands directly access to it, manage it, provide features on it.
> 
> 2) Status in real world
> 
> The hyperhq(http://hyper.sh  http://hypercontainer.io/)
> introduced the feature vm-template(vm-fast-live-clone)
> to the hyper container for several years, it works perfect.
> (see https://github.com/hyperhq/runv/pull/297).
> 
> The feature vm-template makes the containers(VMs) can
> be started in 130ms and save 80M memory for every
> container(VM). So that the hyper containers are fast
> and high-density as normal containers.
> 
> kata-containers project (https://github.com/kata-containers)
> which was launched by hyper, intel and friends and which descended
> from runv (and clear-container) should have this feature enabled.
> Unfortunately, due to the code confliction between runv&cc,
> this feature was temporary disabled and it is being brought
> back by hyper and intel team.
> 3) How to use and bring up advanced features.
> 
> In current qemu command line, shared memory has
> to be configured via memory-object.
> 
> a) feature: qemu-local-migration, qemu-live-update
> Set the mem-path on the tmpfs and set share=on for it when
> start the vm. example:
> -object \
> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
> -numa node,nodeid=0,cpus=0-7,memdev=mem
> 
> when you want to migrate the vm locally (after fixed a security bug
> of the qemu-binary, or other reason), you can start a new qemu with
> the same command line and -incoming, then you can migrate the
> vm from the old qemu to the new qemu with the migration capability
> 'bypass-shared-memory' set. The migration will migrate the device-state
> *ONLY*, the memory is the origin memory backed by tmpfs file.
> 
> b) feature: extremely-fast-save-restore
> the same above, but the mem-path is on the persistent file system.
> 
> c)  feature: vm-template, vm-fast-live-clone
> the template vm is started as 1), and paused when the guest reaches
> the template point(example: the guest app is ready), then the template
> vm is saved. (the qemu process of the template can be killed now, because
> we need only the memory and the device state files (in tmpfs)).
> 
> Then we can launch one or multiple VMs base on the template vm states,
> the new VMs are started without the “share=on”, all the new VMs share
> the initial memory from the memory file, they save a lot of memory.
> all the new VMs start from the template point, the guest app can go to
> work quickly.
> 
> The new VM booted from template vm can’t become template again,
> if you need this unusual chained-template feature, you can write
> a cloneable-tmpfs kernel module for it.
> 

I've just tried doing something similar with this patch; it's really
interesting. I used LVM snapshotting for the RAM:

cd /dev/shm
fallocate -l 20G backingfile
losetup -f ./backingfile
pvcreate /dev/loop0
vgcreate ram /dev/loop0
lvcreate -L4G -nram1 ram /dev/loop0

qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ram1,share=on -numa node,memdev=mem -vnc :0 -drive file=my.qcow2,id=d,cache=none -monitor stdio

boot the VM, and do a :
migrate_set_capability bypass-shared-memory on
migrate_set_speed 10G
migrate "exec:cat > migstream1"
q

then:
lvcreate -n ramsnap1 -s ram/ram1 -L4G
qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ramsnap1,share=on -numa node,memdev=mem -vnc :0 -drive file=my.qcow2,id=d,cache=none -monitor stdio -snapshot -incoming "exec:cat migstream1"


lvcreate -n ramsnap2 -s ram/ram1 -L4G
qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ramsnap2,share=on -numa node,memdev=mem -vnc :1 -drive file=my.qcow2,id=d,cache=none -monitor stdio -snapshot -incoming "exec:cat migstream1"

and I've got two separate instances of qemu restored from that stream.

It seems to work;  I wonder if we ever need things like msync() or
similar?

I've not tried creating a 2nd template with this.

> The libvirt toolkit can’t manage vm-template currently, in the
> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
> “libvrit managed template” feature to libvirt.
> 
> d) feature: yet-another-post-copy-migration
> It is a possible feature, no toolkit can do it well now.
> Using nbd server/client on the memory file is reluctantly Ok but
> inconvenient. A special feature for tmpfs might be needed to
> fully complete this feature.
> No one need yet another post copy migration method,
> but it is possible when some crazy man need it.
> 
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Samuel Ortiz <sameo@linux.intel.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: James O. D. Hunt <james.o.hunt@intel.com>
> Cc: Xu Wang <gnawux@gmail.com>
> Cc: Peng Tao <bergwolf@gmail.com>
> Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
> Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
> Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
> ---
> 
> Changes in V5:
>  check cappability conflict in migrate_caps_check()
> 
> Changes in V4:
>  fixes checkpatch.pl errors
> 
> Changes in V3:
>  rebased on upstream master
>  update the available version of the capability to
>  v2.13
> 
> Changes in V2:
>  rebased on 2.11.1
> 
>  migration/migration.c | 22 ++++++++++++++++++++++
>  migration/migration.h |  1 +
>  migration/ram.c       | 27 ++++++++++++++++++---------
>  qapi/migration.json   |  6 +++++-
>  4 files changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 52a5092add..110b40f6d4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -736,6 +736,19 @@ static bool migrate_caps_check(bool *cap_list,
>              return false;
>          }
>  
> +        if (cap_list[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY]) {
> +            /* Bypass and postcopy are quite conflicting ways
> +             * to get memory in the destination.  And there
> +             * is not code to discriminate the differences and
> +             * handle the conflicts currently.  It should be possible
> +             * to fix, but it is generally useless when both ways
> +             * are used together.
> +             */
> +            error_setg(errp, "Bypass is not currently compatible "
> +                       "with postcopy");
> +            return false;
> +        }
> +

Good.

>          /* This check is reasonably expensive, so only when it's being
>           * set the first time, also it's only the destination that needs
>           * special support.
> @@ -1509,6 +1522,15 @@ bool migrate_release_ram(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
>  }
>  
> +bool migrate_bypass_shared_memory(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
> +}
> +
>  bool migrate_postcopy_ram(void)
>  {
>      MigrationState *s;
> diff --git a/migration/migration.h b/migration/migration.h
> index 8d2f320c48..cfd2513ef0 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
>  
>  bool migrate_postcopy(void);
>  
> +bool migrate_bypass_shared_memory(void);
>  bool migrate_release_ram(void);
>  bool migrate_postcopy_ram(void);
>  bool migrate_zero_blocks(void);
> diff --git a/migration/ram.c b/migration/ram.c
> index 0e90efa092..bca170c386 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
>      unsigned long *bitmap = rb->bmap;
>      unsigned long next;
>  
> +    /* when this ramblock is requested bypassing */
> +    if (!bitmap) {
> +        return size;
> +    }
> +
>      if (rs->ram_bulk_stage && start > 0) {
>          next = start + 1;
>      } else {
> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
>      qemu_mutex_lock(&rs->bitmap_mutex);
>      rcu_read_lock();
>      RAMBLOCK_FOREACH(block) {
> -        migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
> +            migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +        }
>      }
>      rcu_read_unlock();
>      qemu_mutex_unlock(&rs->bitmap_mutex);
> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
>      qemu_mutex_init(&(*rsp)->src_page_req_mutex);
>      QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
>  
> -    /*
> -     * Count the total number of pages used by ram blocks not including any
> -     * gaps due to alignment or unplugs.
> -     */
> -    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
> -
>      ram_state_reset(*rsp);
>  
>      return 0;
>  }
>  
> -static void ram_list_init_bitmaps(void)
> +static void ram_list_init_bitmaps(RAMState *rs)
>  {
>      RAMBlock *block;
>      unsigned long pages;
> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
>      /* Skip setting bitmap if there is no RAM */
>      if (ram_bytes_total()) {
>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
> +                continue;
> +            }
>              pages = block->max_length >> TARGET_PAGE_BITS;
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            /*
> +             * Count the total number of pages used by ram blocks not
> +             * including any gaps due to alignment or unplugs.
> +             */
> +            rs->migration_dirty_pages += pages;
>              if (migrate_postcopy_ram()) {
>                  block->unsentmap = bitmap_new(pages);
>                  bitmap_set(block->unsentmap, 0, pages);

Can you please rework this to combine with Cédric Le Goater's 
'discard non-migratable RAMBlocks' - it's quite similar to what you're
trying to do but for a different reason;  If you look at the v2 from
April 13, I think you can  just find somewhere to clear the
RAM_MIGRATABLE flag.

One thing I noticed; in my world I've got some code that checks if we
ever do a RAM iteration, don't find any dirty blocks but then still have 
migration_dirty_pages being none-0; and with this patch I'm seeing that
check trigger:

   ram_find_and_save_block: no page found, yet dirty_pages=480

it doesn't seem to trigger without the patch.

Dave

> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
>      qemu_mutex_lock_ramlist();
>      rcu_read_lock();
>  
> -    ram_list_init_bitmaps();
> +    ram_list_init_bitmaps(rs);
>      memory_global_dirty_log_start();
>      migration_bitmap_sync(rs);
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 9d0bf82cf4..45326480bd 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -357,13 +357,17 @@
>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
>  #                 (since 2.12)
>  #
> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
> +#          This feature allows the memory region to be reused by new qemu(s)
> +#          or be migrated separately. (since 2.13)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> -           'dirty-bitmaps' ] }
> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 2.15.1 (Apple Git-101)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

next prev parent reply	other threads:[~2018-04-19 16:39 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-31  8:45 [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory Lai Jiangshan
2018-03-31 10:17 ` Lai Jiangshan
2018-03-31 12:35 ` Eric Blake
2018-04-01  8:48 ` [Qemu-devel] [PATCH V3] " Lai Jiangshan
2018-04-01  8:53   ` no-reply
2018-04-01  8:56   ` no-reply
2018-04-04 11:47   ` [Qemu-devel] [PATCH V4] " Lai Jiangshan
2018-04-04 12:15     ` Xiao Guangrong
2018-04-09 17:30     ` Dr. David Alan Gilbert
2018-04-12  2:34       ` Lai Jiangshan
2018-04-16 15:00         ` [Qemu-devel] [PATCH V5] " Lai Jiangshan
2018-04-19 16:38           ` Dr. David Alan Gilbert [this message]
2018-04-25 10:12             ` Lai Jiangshan
2018-04-26 19:05               ` Dr. David Alan Gilbert
2018-04-27  7:47                 ` Cédric Le Goater
2018-06-28  0:42           ` Liang Li
2018-04-16 22:54       ` [Qemu-devel] [PATCH V4] " Lai Jiangshan
2018-04-19 15:54         ` Dr. David Alan Gilbert
2018-07-02 13:10 ` [Qemu-devel] [PATCH] " Stefan Hajnoczi
2018-07-02 13:52   ` Peng Tao
2018-07-02 22:15     ` Andrea Arcangeli
2018-07-03  4:09       ` Peng Tao
2018-07-03 10:05     ` Stefan Hajnoczi
2018-07-03 15:10       ` Peng Tao
2018-07-10 13:40         ` Stefan Hajnoczi
2018-07-12 15:02           ` Peng Tao
2018-07-18 16:03             ` Stefan Hajnoczi
2018-07-02 22:01   ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180419163852.GE2730@work-vm \
    --to=dgilbert@redhat.com \
    --cc=armbru@redhat.com \
    --cc=bergwolf@gmail.com \
    --cc=eblake@redhat.com \
    --cc=gnawux@gmail.com \
    --cc=james.o.hunt@intel.com \
    --cc=jiangshanlai@gmail.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=sameo@linux.intel.com \
    --cc=sebastien.boeuf@intel.com \
    --cc=xiaoguangrong.eric@gmail.com \
    --cc=xiaoguangrong@tencent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).