From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Juan Quintela <quintela@redhat.com>,
Eric Blake <eblake@redhat.com>,
Markus Armbruster <armbru@redhat.com>,
Samuel Ortiz <sameo@linux.intel.com>,
Sebastien Boeuf <sebastien.boeuf@intel.com>,
"James O . D . Hunt" <james.o.hunt@intel.com>,
Xu Wang <gnawux@gmail.com>, Peng Tao <bergwolf@gmail.com>,
Xiao Guangrong <xiaoguangrong@tencent.com>,
Xiao Guangrong <xiaoguangrong.eric@gmail.com>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory
Date: Thu, 19 Apr 2018 17:38:52 +0100 [thread overview]
Message-ID: <20180419163852.GE2730@work-vm> (raw)
In-Reply-To: <20180416150011.55916-1-jiangshanlai@gmail.com>
* Lai Jiangshan (jiangshanlai@gmail.com) wrote:
> 1) What's this
>
> When the migration capability 'bypass-shared-memory'
> is set, the shared memory will be bypassed when migration.
>
> It is the key feature to enable several excellent features for
> the qemu, such as qemu-local-migration, qemu-live-update,
> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
> yet-another-post-copy-migration, etc..
>
> The philosophy behind this key feature, including the resulting
> advanced key features, is that a part of the memory management
> is separated out from the qemu, and let the other toolkits
> such as libvirt, kata-containers (https://github.com/kata-containers)
> runv(https://github.com/hyperhq/runv/) or some multiple cooperative
> qemu commands directly access to it, manage it, provide features on it.
>
> 2) Status in real world
>
> The hyperhq(http://hyper.sh http://hypercontainer.io/)
> introduced the feature vm-template(vm-fast-live-clone)
> to the hyper container for several years, it works perfect.
> (see https://github.com/hyperhq/runv/pull/297).
>
> The feature vm-template makes the containers(VMs) can
> be started in 130ms and save 80M memory for every
> container(VM). So that the hyper containers are fast
> and high-density as normal containers.
>
> kata-containers project (https://github.com/kata-containers)
> which was launched by hyper, intel and friends and which descended
> from runv (and clear-container) should have this feature enabled.
> Unfortunately, due to the code confliction between runv&cc,
> this feature was temporary disabled and it is being brought
> back by hyper and intel team.
> 3) How to use and bring up advanced features.
>
> In current qemu command line, shared memory has
> to be configured via memory-object.
>
> a) feature: qemu-local-migration, qemu-live-update
> Set the mem-path on the tmpfs and set share=on for it when
> start the vm. example:
> -object \
> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
> -numa node,nodeid=0,cpus=0-7,memdev=mem
>
> when you want to migrate the vm locally (after fixed a security bug
> of the qemu-binary, or other reason), you can start a new qemu with
> the same command line and -incoming, then you can migrate the
> vm from the old qemu to the new qemu with the migration capability
> 'bypass-shared-memory' set. The migration will migrate the device-state
> *ONLY*, the memory is the origin memory backed by tmpfs file.
>
> b) feature: extremely-fast-save-restore
> the same above, but the mem-path is on the persistent file system.
>
> c) feature: vm-template, vm-fast-live-clone
> the template vm is started as 1), and paused when the guest reaches
> the template point(example: the guest app is ready), then the template
> vm is saved. (the qemu process of the template can be killed now, because
> we need only the memory and the device state files (in tmpfs)).
>
> Then we can launch one or multiple VMs base on the template vm states,
> the new VMs are started without the “share=on”, all the new VMs share
> the initial memory from the memory file, they save a lot of memory.
> all the new VMs start from the template point, the guest app can go to
> work quickly.
>
> The new VM booted from template vm can’t become template again,
> if you need this unusual chained-template feature, you can write
> a cloneable-tmpfs kernel module for it.
>
I've just tried doing something similar with this patch; it's really
interesting. I used LVM snapshotting for the RAM:
cd /dev/shm
fallocate -l 20G backingfile
losetup -f ./backingfile
pvcreate /dev/loop0
vgcreate ram /dev/loop0
lvcreate -L4G -nram1 ram /dev/loop0
qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ram1,share=on -numa node,memdev=mem -vnc :0 -drive file=my.qcow2,id=d,cache=none -monitor stdio
boot the VM, and do a :
migrate_set_capability bypass-shared-memory on
migrate_set_speed 10G
migrate "exec:cat > migstream1"
q
then:
lvcreate -n ramsnap1 -s ram/ram1 -L4G
qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ramsnap1,share=on -numa node,memdev=mem -vnc :0 -drive file=my.qcow2,id=d,cache=none -monitor stdio -snapshot -incoming "exec:cat migstream1"
lvcreate -n ramsnap2 -s ram/ram1 -L4G
qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ramsnap2,share=on -numa node,memdev=mem -vnc :1 -drive file=my.qcow2,id=d,cache=none -monitor stdio -snapshot -incoming "exec:cat migstream1"
and I've got two separate instances of qemu restored from that stream.
It seems to work; I wonder if we ever need things like msync() or
similar?
I've not tried creating a 2nd template with this.
> The libvirt toolkit can’t manage vm-template currently, in the
> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
> “libvrit managed template” feature to libvirt.
>
> d) feature: yet-another-post-copy-migration
> It is a possible feature, no toolkit can do it well now.
> Using nbd server/client on the memory file is reluctantly Ok but
> inconvenient. A special feature for tmpfs might be needed to
> fully complete this feature.
> No one need yet another post copy migration method,
> but it is possible when some crazy man need it.
>
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Samuel Ortiz <sameo@linux.intel.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: James O. D. Hunt <james.o.hunt@intel.com>
> Cc: Xu Wang <gnawux@gmail.com>
> Cc: Peng Tao <bergwolf@gmail.com>
> Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
> Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
> Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
> ---
>
> Changes in V5:
> check cappability conflict in migrate_caps_check()
>
> Changes in V4:
> fixes checkpatch.pl errors
>
> Changes in V3:
> rebased on upstream master
> update the available version of the capability to
> v2.13
>
> Changes in V2:
> rebased on 2.11.1
>
> migration/migration.c | 22 ++++++++++++++++++++++
> migration/migration.h | 1 +
> migration/ram.c | 27 ++++++++++++++++++---------
> qapi/migration.json | 6 +++++-
> 4 files changed, 46 insertions(+), 10 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 52a5092add..110b40f6d4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -736,6 +736,19 @@ static bool migrate_caps_check(bool *cap_list,
> return false;
> }
>
> + if (cap_list[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY]) {
> + /* Bypass and postcopy are quite conflicting ways
> + * to get memory in the destination. And there
> + * is not code to discriminate the differences and
> + * handle the conflicts currently. It should be possible
> + * to fix, but it is generally useless when both ways
> + * are used together.
> + */
> + error_setg(errp, "Bypass is not currently compatible "
> + "with postcopy");
> + return false;
> + }
> +
Good.
> /* This check is reasonably expensive, so only when it's being
> * set the first time, also it's only the destination that needs
> * special support.
> @@ -1509,6 +1522,15 @@ bool migrate_release_ram(void)
> return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
> }
>
> +bool migrate_bypass_shared_memory(void)
> +{
> + MigrationState *s;
> +
> + s = migrate_get_current();
> +
> + return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
> +}
> +
> bool migrate_postcopy_ram(void)
> {
> MigrationState *s;
> diff --git a/migration/migration.h b/migration/migration.h
> index 8d2f320c48..cfd2513ef0 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
>
> bool migrate_postcopy(void);
>
> +bool migrate_bypass_shared_memory(void);
> bool migrate_release_ram(void);
> bool migrate_postcopy_ram(void);
> bool migrate_zero_blocks(void);
> diff --git a/migration/ram.c b/migration/ram.c
> index 0e90efa092..bca170c386 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
> unsigned long *bitmap = rb->bmap;
> unsigned long next;
>
> + /* when this ramblock is requested bypassing */
> + if (!bitmap) {
> + return size;
> + }
> +
> if (rs->ram_bulk_stage && start > 0) {
> next = start + 1;
> } else {
> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
> qemu_mutex_lock(&rs->bitmap_mutex);
> rcu_read_lock();
> RAMBLOCK_FOREACH(block) {
> - migration_bitmap_sync_range(rs, block, 0, block->used_length);
> + if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
> + migration_bitmap_sync_range(rs, block, 0, block->used_length);
> + }
> }
> rcu_read_unlock();
> qemu_mutex_unlock(&rs->bitmap_mutex);
> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
> qemu_mutex_init(&(*rsp)->src_page_req_mutex);
> QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
>
> - /*
> - * Count the total number of pages used by ram blocks not including any
> - * gaps due to alignment or unplugs.
> - */
> - (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
> -
> ram_state_reset(*rsp);
>
> return 0;
> }
>
> -static void ram_list_init_bitmaps(void)
> +static void ram_list_init_bitmaps(RAMState *rs)
> {
> RAMBlock *block;
> unsigned long pages;
> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
> /* Skip setting bitmap if there is no RAM */
> if (ram_bytes_total()) {
> QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> + if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
> + continue;
> + }
> pages = block->max_length >> TARGET_PAGE_BITS;
> block->bmap = bitmap_new(pages);
> bitmap_set(block->bmap, 0, pages);
> + /*
> + * Count the total number of pages used by ram blocks not
> + * including any gaps due to alignment or unplugs.
> + */
> + rs->migration_dirty_pages += pages;
> if (migrate_postcopy_ram()) {
> block->unsentmap = bitmap_new(pages);
> bitmap_set(block->unsentmap, 0, pages);
Can you please rework this to combine with Cédric Le Goater's
'discard non-migratable RAMBlocks' - it's quite similar to what you're
trying to do but for a different reason; If you look at the v2 from
April 13, I think you can just find somewhere to clear the
RAM_MIGRATABLE flag.
One thing I noticed; in my world I've got some code that checks if we
ever do a RAM iteration, don't find any dirty blocks but then still have
migration_dirty_pages being none-0; and with this patch I'm seeing that
check trigger:
ram_find_and_save_block: no page found, yet dirty_pages=480
it doesn't seem to trigger without the patch.
Dave
> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
> qemu_mutex_lock_ramlist();
> rcu_read_lock();
>
> - ram_list_init_bitmaps();
> + ram_list_init_bitmaps(rs);
> memory_global_dirty_log_start();
> migration_bitmap_sync(rs);
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 9d0bf82cf4..45326480bd 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -357,13 +357,17 @@
> # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
> # (since 2.12)
> #
> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
> +# This feature allows the memory region to be reused by new qemu(s)
> +# or be migrated separately. (since 2.13)
> +#
> # Since: 1.2
> ##
> { 'enum': 'MigrationCapability',
> 'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> 'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> 'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> - 'dirty-bitmaps' ] }
> + 'dirty-bitmaps', 'bypass-shared-memory' ] }
>
> ##
> # @MigrationCapabilityStatus:
> --
> 2.15.1 (Apple Git-101)
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2018-04-19 16:39 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-03-31 8:45 [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory Lai Jiangshan
2018-03-31 10:17 ` Lai Jiangshan
2018-03-31 12:35 ` Eric Blake
2018-04-01 8:48 ` [Qemu-devel] [PATCH V3] " Lai Jiangshan
2018-04-01 8:53 ` no-reply
2018-04-01 8:56 ` no-reply
2018-04-04 11:47 ` [Qemu-devel] [PATCH V4] " Lai Jiangshan
2018-04-04 12:15 ` Xiao Guangrong
2018-04-09 17:30 ` Dr. David Alan Gilbert
2018-04-12 2:34 ` Lai Jiangshan
2018-04-16 15:00 ` [Qemu-devel] [PATCH V5] " Lai Jiangshan
2018-04-19 16:38 ` Dr. David Alan Gilbert [this message]
2018-04-25 10:12 ` Lai Jiangshan
2018-04-26 19:05 ` Dr. David Alan Gilbert
2018-04-27 7:47 ` Cédric Le Goater
2018-06-28 0:42 ` Liang Li
2018-04-16 22:54 ` [Qemu-devel] [PATCH V4] " Lai Jiangshan
2018-04-19 15:54 ` Dr. David Alan Gilbert
2018-07-02 13:10 ` [Qemu-devel] [PATCH] " Stefan Hajnoczi
2018-07-02 13:52 ` Peng Tao
2018-07-02 22:15 ` Andrea Arcangeli
2018-07-03 4:09 ` Peng Tao
2018-07-03 10:05 ` Stefan Hajnoczi
2018-07-03 15:10 ` Peng Tao
2018-07-10 13:40 ` Stefan Hajnoczi
2018-07-12 15:02 ` Peng Tao
2018-07-18 16:03 ` Stefan Hajnoczi
2018-07-02 22:01 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180419163852.GE2730@work-vm \
--to=dgilbert@redhat.com \
--cc=armbru@redhat.com \
--cc=bergwolf@gmail.com \
--cc=eblake@redhat.com \
--cc=gnawux@gmail.com \
--cc=james.o.hunt@intel.com \
--cc=jiangshanlai@gmail.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=sameo@linux.intel.com \
--cc=sebastien.boeuf@intel.com \
--cc=xiaoguangrong.eric@gmail.com \
--cc=xiaoguangrong@tencent.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).