* [Qemu-devel] [PATCH RFC 1/4] Curling: add doc
2013-09-10 3:43 [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance Jules Wang
@ 2013-09-10 3:43 ` Jules Wang
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface Jules Wang
` (3 subsequent siblings)
4 siblings, 0 replies; 20+ messages in thread
From: Jules Wang @ 2013-09-10 3:43 UTC (permalink / raw)
To: qemu-devel; +Cc: quintela, owasserm, Jules Wang, stefanha, pbonzini
Curling provides fault tolerant mechanism for KVM.
For more info, see 'doc/curling.txt'.
Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
---
docs/curling.txt | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
create mode 100644 docs/curling.txt
diff --git a/docs/curling.txt b/docs/curling.txt
new file mode 100644
index 0000000..dace6db
--- /dev/null
+++ b/docs/curling.txt
@@ -0,0 +1,52 @@
+KVM Fault Tolerance Specification
+=================================
+
+
+Contents:
+=========
+* Introduction
+* Usage
+* Design & Implement
+* Performance
+
+Introduction
+============
+The goal of Curling(sports) is to provide a fault tolerant(ft for short)
+mechanism for KVM, so that in the event of a hardware failure, the virtual
+machine fails over to the backup in a way that is completely transparent
+to the guest operating system.
+
+
+Usage
+=====
+The steps of curling are the same as the steps of live migration except the
+following:
+1. Start the receiver vm with -incoming curling:tcp:<address>:<port>
+2. Start ft in the qemu monitor of sender vm by following cmdline:
+ > migrate_set_speed <full bandwidth>
+ > migrate curling:tcp:<address>:<port>
+3. Connect to the receiver vm by vnc or spice. The screen of the vm is displayed
+when curling is ready.
+4. Now, the sender vm is protected by ft, When it encounters a failure,
+the failover kicks in.
+
+
+
+Design & Implement
+==================
+* By leveraging live migration feature, we do endless live migrations between
+the sender and receiver, so the two virtual machines are synchronized.
+
+* The receiver does not load vm state once the migration begins, instead, it
+perfetches one whole migration data into a buffer, then loads vm state from
+that buffer afterwards. This "all or nothing" approach prevents the
+broken-in-the-middle problem Kemari has.
+
+* The sender sleeps a little while after each migration, to ease the
+performance penalty entailed by vm_stop and iothread locks. This is a
+tradeoff between performance and accuracy.
+....
+
+
+Performance
+===========
--
1.8.0.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface
2013-09-10 3:43 [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance Jules Wang
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 1/4] Curling: add doc Jules Wang
@ 2013-09-10 3:43 ` Jules Wang
2013-09-10 13:57 ` Juan Quintela
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 3/4] Curling: the sender Jules Wang
` (2 subsequent siblings)
4 siblings, 1 reply; 20+ messages in thread
From: Jules Wang @ 2013-09-10 3:43 UTC (permalink / raw)
To: qemu-devel; +Cc: quintela, owasserm, Jules Wang, stefanha, pbonzini
Parse the word 'curling' when incoming/outgoing migration is
starting. So we know whether to enable fault tolerant or not.
Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
---
include/migration/migration.h | 2 ++
migration.c | 16 ++++++++++++++++
2 files changed, 18 insertions(+)
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 140e6b4..4cbb62f 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -162,4 +162,6 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
ram_addr_t offset, size_t size,
int *bytes_sent);
+bool ft_enabled(void);
+
#endif
diff --git a/migration.c b/migration.c
index 200d404..59c8f32 100644
--- a/migration.c
+++ b/migration.c
@@ -58,6 +58,12 @@ enum {
static NotifierList migration_state_notifiers =
NOTIFIER_LIST_INITIALIZER(migration_state_notifiers);
+static bool ft_mode;
+bool ft_enabled(void)
+{
+ return ft_mode;
+}
+
/* When we add fault tolerance, we could have several
migrations at once. For now we don't need to add
dynamic creation of migration */
@@ -78,6 +84,11 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
{
const char *p;
+ if (strstart(uri, "curling:", &p)) {
+ ft_mode = true;
+ uri = p;
+ }
+
if (strstart(uri, "tcp:", &p))
tcp_start_incoming_migration(p, errp);
#ifdef CONFIG_RDMA
@@ -420,6 +431,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
s = migrate_init(¶ms);
+ if (strstart(uri, "curling:", &p)) {
+ ft_mode = true;
+ uri = p;
+ }
+
if (strstart(uri, "tcp:", &p)) {
tcp_start_outgoing_migration(s, p, &local_err);
#ifdef CONFIG_RDMA
--
1.8.0.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface Jules Wang
@ 2013-09-10 13:57 ` Juan Quintela
2013-09-10 13:03 ` Paolo Bonzini
2013-09-11 2:51 ` junqing.wang
0 siblings, 2 replies; 20+ messages in thread
From: Juan Quintela @ 2013-09-10 13:57 UTC (permalink / raw)
To: Jules Wang; +Cc: pbonzini, qemu-devel, stefanha, owasserm
Jules Wang <junqing.wang@cs2c.com.cn> wrote:
> Parse the word 'curling' when incoming/outgoing migration is
> starting. So we know whether to enable fault tolerant or not.
>
> Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
> ---
> include/migration/migration.h | 2 ++
> migration.c | 16 ++++++++++++++++
> 2 files changed, 18 insertions(+)
>
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 140e6b4..4cbb62f 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -162,4 +162,6 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
> ram_addr_t offset, size_t size,
> int *bytes_sent);
>
> +bool ft_enabled(void);
> +
> #endif
> diff --git a/migration.c b/migration.c
> index 200d404..59c8f32 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -58,6 +58,12 @@ enum {
> static NotifierList migration_state_notifiers =
> NOTIFIER_LIST_INITIALIZER(migration_state_notifiers);
>
> +static bool ft_mode;
> +bool ft_enabled(void)
> +{
> + return ft_mode;
Shouldn't this be in migration_state? Just wondering. And yes, I
don't see either a trivial place how to get it. get_current_migration()?
> +}
> +
> /* When we add fault tolerance, we could have several
> migrations at once. For now we don't need to add
> dynamic creation of migration */
> @@ -78,6 +84,11 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
> {
> const char *p;
>
> + if (strstart(uri, "curling:", &p)) {
> + ft_mode = true;
> + uri = p;
> + }
> +
Syntax is at least weird:
curling:tcp:foo:9999
curling+tcp:foo:9999
could be better? Suggestions folks?
notice that we still need more things: tcp+tls should happen at some
time soon. This is not related with this patch.
> if (strstart(uri, "tcp:", &p))
> tcp_start_incoming_migration(p, errp);
> #ifdef CONFIG_RDMA
> @@ -420,6 +431,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>
> s = migrate_init(¶ms);
>
> + if (strstart(uri, "curling:", &p)) {
> + ft_mode = true;
> + uri = p;
> + }
> +
> if (strstart(uri, "tcp:", &p)) {
> tcp_start_outgoing_migration(s, p, &local_err);
> #ifdef CONFIG_RDMA
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface
2013-09-10 13:57 ` Juan Quintela
@ 2013-09-10 13:03 ` Paolo Bonzini
2013-09-10 16:37 ` Juan Quintela
2013-09-11 2:51 ` junqing.wang
1 sibling, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2013-09-10 13:03 UTC (permalink / raw)
To: quintela; +Cc: stefanha, owasserm, Jules Wang, qemu-devel
Il 10/09/2013 15:57, Juan Quintela ha scritto:
>> >
>> > + if (strstart(uri, "curling:", &p)) {
>> > + ft_mode = true;
>> > + uri = p;
>> > + }
>> > +
> Syntax is at least weird:
>
> curling:tcp:foo:9999
>
> curling+tcp:foo:9999
>
> could be better? Suggestions folks?
>
> notice that we still need more things: tcp+tls should happen at some
> time soon. This is not related with this patch.
>
I think for the outgoing side it should just be "migrate -f tcp:foo:9999".
On the incoming side, perhaps you could have a different ID instead of
QEMU_VM_FILE_MAGIC, that triggers fault-tolerance mode automatically?
Then again it would be simply "-incoming tcp:foo:9999".
Paolo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface
2013-09-10 13:03 ` Paolo Bonzini
@ 2013-09-10 16:37 ` Juan Quintela
2013-09-10 14:38 ` Paolo Bonzini
0 siblings, 1 reply; 20+ messages in thread
From: Juan Quintela @ 2013-09-10 16:37 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: stefanha, owasserm, Jules Wang, qemu-devel
Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 10/09/2013 15:57, Juan Quintela ha scritto:
>>> >
>>> > + if (strstart(uri, "curling:", &p)) {
>>> > + ft_mode = true;
>>> > + uri = p;
>>> > + }
>>> > +
>> Syntax is at least weird:
>>
>> curling:tcp:foo:9999
>>
>> curling+tcp:foo:9999
>>
>> could be better? Suggestions folks?
>>
>> notice that we still need more things: tcp+tls should happen at some
>> time soon. This is not related with this patch.
>>
>
> I think for the outgoing side it should just be "migrate -f tcp:foo:9999".
>
> On the incoming side, perhaps you could have a different ID instead of
> QEMU_VM_FILE_MAGIC, that triggers fault-tolerance mode automatically?
> Then again it would be simply "-incoming tcp:foo:9999".
Then how can you distingish between faultolerance and simple migration?
You need to diferentiate on both sides.
- outgoing side: you need to continue running after sending the whole
state
- incoming side: after receivinga lot, you apply it, and have to wait
for the next one.
It is a different thing to do, we need to tell qemu somehow.
> Paolo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface
2013-09-10 16:37 ` Juan Quintela
@ 2013-09-10 14:38 ` Paolo Bonzini
2013-09-10 15:21 ` Juan Quintela
2013-09-10 15:22 ` Juan Quintela
0 siblings, 2 replies; 20+ messages in thread
From: Paolo Bonzini @ 2013-09-10 14:38 UTC (permalink / raw)
To: quintela; +Cc: stefanha, owasserm, Jules Wang, qemu-devel
Il 10/09/2013 18:37, Juan Quintela ha scritto:
>> I think for the outgoing side it should just be "migrate -f tcp:foo:9999".
>>
>> On the incoming side, perhaps you could have a different ID instead of
>> QEMU_VM_FILE_MAGIC, that triggers fault-tolerance mode automatically?
>> Then again it would be simply "-incoming tcp:foo:9999".
>
> Then how can you distingish between faultolerance and simple migration?
> You need to diferentiate on both sides.
>
> - outgoing side: you need to continue running after sending the whole
> state
> - incoming side: after receivinga lot, you apply it, and have to wait
> for the next one.
>
> It is a different thing to do, we need to tell qemu somehow.
You look at the first 4 bytes in the stream and distinguish the two cases.
Paolo
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface
2013-09-10 14:38 ` Paolo Bonzini
@ 2013-09-10 15:21 ` Juan Quintela
2013-09-10 15:22 ` Juan Quintela
1 sibling, 0 replies; 20+ messages in thread
From: Juan Quintela @ 2013-09-10 15:21 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: stefanha, owasserm, Jules Wang, qemu-devel
Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 10/09/2013 18:37, Juan Quintela ha scritto:
>>> I think for the outgoing side it should just be "migrate -f tcp:foo:9999".
>>>
>>> On the incoming side, perhaps you could have a different ID instead of
>>> QEMU_VM_FILE_MAGIC, that triggers fault-tolerance mode automatically?
>>> Then again it would be simply "-incoming tcp:foo:9999".
>>
>> Then how can you distingish between faultolerance and simple migration?
>> You need to diferentiate on both sides.
>>
>> - outgoing side: you need to continue running after sending the whole
>> state
>> - incoming side: after receivinga lot, you apply it, and have to wait
>> for the next one.
>>
>> It is a different thing to do, we need to tell qemu somehow.
>
> You look at the first 4 bytes in the stream and distinguish the two cases.
We need to change how things are handled. Are we sure we don't want
curling over exec/unix/fd?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface
2013-09-10 14:38 ` Paolo Bonzini
2013-09-10 15:21 ` Juan Quintela
@ 2013-09-10 15:22 ` Juan Quintela
1 sibling, 0 replies; 20+ messages in thread
From: Juan Quintela @ 2013-09-10 15:22 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: stefanha, owasserm, Jules Wang, qemu-devel
Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 10/09/2013 18:37, Juan Quintela ha scritto:
>>> I think for the outgoing side it should just be "migrate -f tcp:foo:9999".
>>>
>>> On the incoming side, perhaps you could have a different ID instead of
>>> QEMU_VM_FILE_MAGIC, that triggers fault-tolerance mode automatically?
>>> Then again it would be simply "-incoming tcp:foo:9999".
>>
>> Then how can you distingish between faultolerance and simple migration?
>> You need to diferentiate on both sides.
>>
>> - outgoing side: you need to continue running after sending the whole
>> state
>> - incoming side: after receivinga lot, you apply it, and have to wait
>> for the next one.
>>
>> It is a different thing to do, we need to tell qemu somehow.
>
> You look at the first 4 bytes in the stream and distinguish the two cases.
We need to change how things are handled, but nothing too complicated.
Are we sure we don't want curling over exec/unix/fd?
Later, Juan.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface
2013-09-10 13:57 ` Juan Quintela
2013-09-10 13:03 ` Paolo Bonzini
@ 2013-09-11 2:51 ` junqing.wang
1 sibling, 0 replies; 20+ messages in thread
From: junqing.wang @ 2013-09-11 2:51 UTC (permalink / raw)
To: quintela, pbonzini; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 543 bytes --]
> Shouldn't this be in migration_state? Just wondering. And yes, I
> don't see either a trivial place how to get it. get_current_migration()?
That's a better idea, I will put 'ft_enabled' in MigrationState Struct.
> I think for the outgoing side it should just be "migrate -f tcp:foo:9999".
> On the incoming side, perhaps you could have a different ID instead of
> QEMU_VM_FILE_MAGIC, that triggers fault-tolerance mode automatically?
I am OK with this solution, '-f' indicates fault-tolerance, right?
Have youdecided yet?
[-- Attachment #2: Type: text/html, Size: 963 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* [Qemu-devel] [PATCH RFC 3/4] Curling: the sender
2013-09-10 3:43 [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance Jules Wang
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 1/4] Curling: add doc Jules Wang
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 2/4] Curling: cmdline interface Jules Wang
@ 2013-09-10 3:43 ` Jules Wang
2013-09-10 14:05 ` Juan Quintela
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 4/4] Curling: the receiver Jules Wang
2013-09-10 12:27 ` [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance Orit Wasserman
4 siblings, 1 reply; 20+ messages in thread
From: Jules Wang @ 2013-09-10 3:43 UTC (permalink / raw)
To: qemu-devel; +Cc: quintela, owasserm, Jules Wang, stefanha, pbonzini
By leveraging live migration feature, the sender simply starts a
new migration when the previous migration is completed.
We need to handle the variables related to live migration very
carefully. So the new migration does not restart from the very
begin of the migration, instead, it continues the previous
migration.
Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
---
arch_init.c | 18 +++++++++++++-----
migration.c | 23 ++++++++++++++++++++++-
savevm.c | 4 ++++
3 files changed, 39 insertions(+), 6 deletions(-)
diff --git a/arch_init.c b/arch_init.c
index e47e139..5d006f6 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -611,10 +611,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
{
RAMBlock *block;
int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
+ bool create = false;
- migration_bitmap = bitmap_new(ram_pages);
- bitmap_set(migration_bitmap, 0, ram_pages);
- migration_dirty_pages = ram_pages;
+ if (!ft_enabled() || !migration_bitmap) {
+ migration_bitmap = bitmap_new(ram_pages);
+ bitmap_set(migration_bitmap, 0, ram_pages);
+ migration_dirty_pages = ram_pages;
+ create = true;
+ }
mig_throttle_on = false;
dirty_rate_high_cnt = 0;
@@ -634,7 +638,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
qemu_mutex_lock_iothread();
qemu_mutex_lock_ramlist();
bytes_transferred = 0;
- reset_ram_globals();
+ if (!ft_enabled() || create) {
+ reset_ram_globals();
+ }
memory_global_dirty_log_start();
migration_bitmap_sync();
@@ -744,7 +750,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
}
ram_control_after_iterate(f, RAM_CONTROL_FINISH);
- migration_end();
+ if (!ft_enabled()) {
+ migration_end();
+ }
qemu_mutex_unlock_ramlist();
qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
diff --git a/migration.c b/migration.c
index 59c8f32..d8a9b2d 100644
--- a/migration.c
+++ b/migration.c
@@ -567,6 +567,7 @@ static void *migration_thread(void *opaque)
int64_t max_size = 0;
int64_t start_time = initial_time;
bool old_vm_running = false;
+ int time_window = 100;
DPRINTF("beginning savevm\n");
qemu_savevm_state_begin(s->file, &s->params);
@@ -578,6 +579,8 @@ static void *migration_thread(void *opaque)
while (s->state == MIG_STATE_ACTIVE) {
int64_t current_time;
+ int64_t time_spent;
+ int64_t migration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
uint64_t pending_size;
if (!qemu_file_rate_limit(s->file)) {
@@ -607,10 +610,28 @@ static void *migration_thread(void *opaque)
break;
}
- if (!qemu_file_get_error(s->file)) {
+ if (!qemu_file_get_error(s->file) && !ft_enabled()) {
migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
break;
}
+
+ if (ft_enabled()) {
+ if (old_vm_running) {
+ qemu_mutex_lock_iothread();
+ vm_start();
+ qemu_mutex_unlock_iothread();
+
+ current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+ time_spent = current_time - migration_start_time;
+ DPRINTF("this migration lasts for %" PRId64 "ms\n",
+ time_spent);
+ if (time_spent < time_window) {
+ g_usleep((time_window - time_spent)*1000);
+ initial_time += time_window - time_spent;
+ }
+ }
+ qemu_savevm_state_begin(s->file, &s->params);
+ }
}
}
diff --git a/savevm.c b/savevm.c
index c536aa4..6daf690 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1824,6 +1824,7 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
#define QEMU_VM_SECTION_END 0x03
#define QEMU_VM_SECTION_FULL 0x04
#define QEMU_VM_SUBSECTION 0x05
+#define QEMU_VM_EOF_MAGIC 0xFeedCafe
bool qemu_savevm_state_blocked(Error **errp)
{
@@ -1983,6 +1984,9 @@ void qemu_savevm_state_complete(QEMUFile *f)
}
qemu_put_byte(f, QEMU_VM_EOF);
+ if (ft_enabled()) {
+ qemu_put_be32(f, QEMU_VM_EOF_MAGIC);
+ }
qemu_fflush(f);
}
--
1.8.0.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 3/4] Curling: the sender
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 3/4] Curling: the sender Jules Wang
@ 2013-09-10 14:05 ` Juan Quintela
2013-09-11 7:31 ` junqing.wang
0 siblings, 1 reply; 20+ messages in thread
From: Juan Quintela @ 2013-09-10 14:05 UTC (permalink / raw)
To: Jules Wang; +Cc: pbonzini, qemu-devel, stefanha, owasserm
Jules Wang <junqing.wang@cs2c.com.cn> wrote:
> By leveraging live migration feature, the sender simply starts a
> new migration when the previous migration is completed.
>
> We need to handle the variables related to live migration very
> carefully. So the new migration does not restart from the very
> begin of the migration, instead, it continues the previous
> migration.
>
> Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
> ---
> arch_init.c | 18 +++++++++++++-----
> migration.c | 23 ++++++++++++++++++++++-
> savevm.c | 4 ++++
> 3 files changed, 39 insertions(+), 6 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index e47e139..5d006f6 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -611,10 +611,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> {
> RAMBlock *block;
> int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
> + bool create = false;
This variable is never set.
>
> - migration_bitmap = bitmap_new(ram_pages);
> - bitmap_set(migration_bitmap, 0, ram_pages);
> - migration_dirty_pages = ram_pages;
> + if (!ft_enabled() || !migration_bitmap) {
> + migration_bitmap = bitmap_new(ram_pages);
Nothing in this patch sets the migration_bitmap to anything.
> + bitmap_set(migration_bitmap, 0, ram_pages);
> + migration_dirty_pages = ram_pages;
> + create = true;
> + }
> mig_throttle_on = false;
> dirty_rate_high_cnt = 0;
> @@ -634,7 +638,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> qemu_mutex_lock_iothread();
> qemu_mutex_lock_ramlist();
> bytes_transferred = 0;
> - reset_ram_globals();
> + if (!ft_enabled() || create) {
> + reset_ram_globals();
> + }
>
> memory_global_dirty_log_start();
> migration_bitmap_sync();
> @@ -744,7 +750,9 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> }
>
> ram_control_after_iterate(f, RAM_CONTROL_FINISH);
> - migration_end();
> + if (!ft_enabled()) {
> + migration_end();
> + }
What you want here? My guess is that you want to sent device state
without sending the end of migration command, right?
> qemu_mutex_unlock_ramlist();
> qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> diff --git a/migration.c b/migration.c
> index 59c8f32..d8a9b2d 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -567,6 +567,7 @@ static void *migration_thread(void *opaque)
> int64_t max_size = 0;
> int64_t start_time = initial_time;
> bool old_vm_running = false;
> + int time_window = 100;
>
> DPRINTF("beginning savevm\n");
> qemu_savevm_state_begin(s->file, &s->params);
> @@ -578,6 +579,8 @@ static void *migration_thread(void *opaque)
>
> while (s->state == MIG_STATE_ACTIVE) {
> int64_t current_time;
> + int64_t time_spent;
> + int64_t migration_start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> uint64_t pending_size;
>
> if (!qemu_file_rate_limit(s->file)) {
> @@ -607,10 +610,28 @@ static void *migration_thread(void *opaque)
> break;
> }
>
> - if (!qemu_file_get_error(s->file)) {
> + if (!qemu_file_get_error(s->file) && !ft_enabled()) {
> migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
> break;
> }
> +
> + if (ft_enabled()) {
> + if (old_vm_running) {
> + qemu_mutex_lock_iothread();
> + vm_start();
> + qemu_mutex_unlock_iothread();
> +
> + current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> + time_spent = current_time - migration_start_time;
> + DPRINTF("this migration lasts for %" PRId64 "ms\n",
> + time_spent);
> + if (time_spent < time_window) {
> + g_usleep((time_window - time_spent)*1000);
Why are we waiting here? If we are migration faster than allowed, why
we are waiting?
> + initial_time += time_window - time_spent;
> + }
> + }
> + qemu_savevm_state_begin(s->file, &s->params);
> + }
> }
> }
>
> diff --git a/savevm.c b/savevm.c
> index c536aa4..6daf690 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -1824,6 +1824,7 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se)
> #define QEMU_VM_SECTION_END 0x03
> #define QEMU_VM_SECTION_FULL 0x04
> #define QEMU_VM_SUBSECTION 0x05
> +#define QEMU_VM_EOF_MAGIC 0xFeedCafe
>
> bool qemu_savevm_state_blocked(Error **errp)
> {
> @@ -1983,6 +1984,9 @@ void qemu_savevm_state_complete(QEMUFile *f)
> }
>
> qemu_put_byte(f, QEMU_VM_EOF);
> + if (ft_enabled()) {
> + qemu_put_be32(f, QEMU_VM_EOF_MAGIC);
> + }
> qemu_fflush(f);
> }
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 3/4] Curling: the sender
2013-09-10 14:05 ` Juan Quintela
@ 2013-09-11 7:31 ` junqing.wang
0 siblings, 0 replies; 20+ messages in thread
From: junqing.wang @ 2013-09-11 7:31 UTC (permalink / raw)
To: quintela; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 2447 bytes --]
hi,
>> + bool create = false;
> >This variable is never set.
It is set in the following 'if' block.
+ create = true; <<=======
>> - migration_bitmap = bitmap_new(ram_pages);
>> - bitmap_set(migration_bitmap, 0, ram_pages);
>> - migration_dirty_pages = ram_pages;
>> + if (!ft_enabled() || !migration_bitmap) {
>> + migration_bitmap = bitmap_new(ram_pages);
>> + bitmap_set(migration_bitmap, 0, ram_pages);
>> + migration_dirty_pages = ram_pages;
>> + create = true; <==========
>> + }
>Nothing in this patch sets the migration_bitmap to anything.
Let me explain all the odd 'if' block:
1 >> + if (!ft_enabled() || !migration_bitmap) {
2 >> + if (!ft_enabled() || create) {
3 >> + if (!ft_enabled()) {
As I mentioned in the commit log:
>> We need to handle the variables related to live migration very
>> carefully. So the new migration does not restart from the very
>> begin of the migration, instead, it continues the previous
>> migration.
Some variables should not be reset after one migration, because
the next one need these variables to continue the migration.
This explains all the "if ft_enabled()"
Besides, some variables need to be initialized at the first migration of curling.
That explains the "if create" and "if !migration_bitmap"
>> + if (ft_enabled()) {
>> + if (old_vm_running) {
>> + qemu_mutex_lock_iothread();
>> + vm_start();
>> + qemu_mutex_unlock_iothread();
>> +
>> + current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> + time_spent = current_time - migration_start_time;
>> + DPRINTF("this migration lasts for %" PRId64 "ms\n",
>> + time_spent);
>> + if (time_spent < time_window) {
>> + g_usleep((time_window - time_spent)*1000);
>
>Why are we waiting here? If we are migration faster than allowed, why
>we are waiting?
Looping fast is not good, that means we enter iothread lock and do vm stop more frequently. The performance will drop and vm user will experience input stall if we do not sleep.
How to deal with this is a difficult issue, any suggestion is welcomed.
THIS IS ONE OF THE TWO MAIN PROBLEMS. The other one is related to the magic number 0xfeedcafe.
[-- Attachment #2: Type: text/html, Size: 5060 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* [Qemu-devel] [PATCH RFC 4/4] Curling: the receiver
2013-09-10 3:43 [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance Jules Wang
` (2 preceding siblings ...)
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 3/4] Curling: the sender Jules Wang
@ 2013-09-10 3:43 ` Jules Wang
2013-09-10 14:19 ` Juan Quintela
2013-09-10 12:27 ` [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance Orit Wasserman
4 siblings, 1 reply; 20+ messages in thread
From: Jules Wang @ 2013-09-10 3:43 UTC (permalink / raw)
To: qemu-devel; +Cc: quintela, owasserm, Jules Wang, stefanha, pbonzini
The receiver does migration loop until the migration connection is
lost. Then, it is started as a backup.
The receiver does not load vm state once a migration begins,
instead, it perfetches one whole migration data into a buffer,
then loads vm state from that buffer afterwards.
Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
---
include/migration/qemu-file.h | 1 +
include/sysemu/sysemu.h | 1 +
migration.c | 22 ++++--
savevm.c | 154 ++++++++++++++++++++++++++++++++++++++++--
4 files changed, 168 insertions(+), 10 deletions(-)
diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index 0f757fb..f01ff10 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -92,6 +92,7 @@ typedef struct QEMUFileOps {
QEMURamHookFunc *after_ram_iterate;
QEMURamHookFunc *hook_ram_load;
QEMURamSaveFunc *save_page;
+ QEMUFileGetBufferFunc *get_prefetch_buffer;
} QEMUFileOps;
QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index b1aa059..44f23d0 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -81,6 +81,7 @@ void qemu_savevm_state_complete(QEMUFile *f);
void qemu_savevm_state_cancel(void);
uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state_ft(QEMUFile *f);
/* SLIRP */
void do_info_slirp(Monitor *mon);
diff --git a/migration.c b/migration.c
index d8a9b2d..9be22a4 100644
--- a/migration.c
+++ b/migration.c
@@ -19,6 +19,7 @@
#include "monitor/monitor.h"
#include "migration/qemu-file.h"
#include "sysemu/sysemu.h"
+#include "sysemu/cpus.h"
#include "block/block.h"
#include "qemu/sockets.h"
#include "migration/block.h"
@@ -112,13 +113,24 @@ static void process_incoming_migration_co(void *opaque)
{
QEMUFile *f = opaque;
int ret;
+ int count = 0;
- ret = qemu_loadvm_state(f);
- qemu_fclose(f);
- if (ret < 0) {
- fprintf(stderr, "load of migration failed\n");
- exit(EXIT_FAILURE);
+ if (ft_enabled()) {
+ while (qemu_loadvm_state_ft(f) >= 0) {
+ count++;
+ DPRINTF("incoming count %d\r", count);
+ }
+ qemu_fclose(f);
+ fprintf(stderr, "ft connection lost, launching self..\n");
+ } else {
+ ret = qemu_loadvm_state(f);
+ qemu_fclose(f);
+ if (ret < 0) {
+ fprintf(stderr, "load of migration failed\n");
+ exit(EXIT_FAILURE);
+ }
}
+ cpu_synchronize_all_post_init();
qemu_announce_self();
DPRINTF("successfully loaded vm state\n");
diff --git a/savevm.c b/savevm.c
index 6daf690..d5bf153 100644
--- a/savevm.c
+++ b/savevm.c
@@ -52,6 +52,8 @@
#define ARP_PTYPE_IP 0x0800
#define ARP_OP_REQUEST_REV 0x3
+#define PFB_SIZE 0x010000
+
static int announce_self_create(uint8_t *buf,
uint8_t *mac_addr)
{
@@ -135,6 +137,10 @@ struct QEMUFile {
unsigned int iovcnt;
int last_error;
+
+ uint8_t *pfb; /* pfb -> PerFetch Buffer */
+ uint64_t pfb_index;
+ uint64_t pfb_size;
};
typedef struct QEMUFileStdio
@@ -193,6 +199,25 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
return len;
}
+static int socket_get_prefetch_buffer(void *opaque, uint8_t *buf,
+ int64_t pos, int size)
+{
+ QEMUFile *f = opaque;
+
+ if (f->pfb_size - pos <= 0) {
+ return 0;
+ }
+
+ if (f->pfb_size - pos < size) {
+ size = f->pfb_size - pos;
+ }
+
+ memcpy(buf, f->pfb+pos, size);
+
+ return size;
+}
+
+
static int socket_close(void *opaque)
{
QEMUFileSocket *s = opaque;
@@ -440,6 +465,7 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
static const QEMUFileOps socket_read_ops = {
.get_fd = socket_get_fd,
.get_buffer = socket_get_buffer,
+ .get_prefetch_buffer = socket_get_prefetch_buffer,
.close = socket_close
};
@@ -493,7 +519,7 @@ QEMUFile *qemu_fopen(const char *filename, const char *mode)
s->stdio_file = fopen(filename, mode);
if (!s->stdio_file)
goto fail;
-
+
if(mode[0] == 'w') {
s->file = qemu_fopen_ops(s, &stdio_file_write_ops);
} else {
@@ -739,6 +765,11 @@ int qemu_fclose(QEMUFile *f)
if (f->last_error) {
ret = f->last_error;
}
+
+ if (f->pfb) {
+ g_free(f->pfb);
+ }
+
g_free(f);
return ret;
}
@@ -822,6 +853,14 @@ void qemu_put_byte(QEMUFile *f, int v)
static void qemu_file_skip(QEMUFile *f, int size)
{
+ if (f->pfb_index + size <= f->pfb_size) {
+ f->pfb_index += size;
+ return;
+ } else {
+ size -= f->pfb_size - f->pfb_index;
+ f->pfb_index = f->pfb_size;
+ }
+
if (f->buf_index + size <= f->buf_size) {
f->buf_index += size;
}
@@ -831,6 +870,21 @@ static int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
{
int pending;
int index;
+ int done;
+
+ if (f->ops->get_prefetch_buffer) {
+ if (f->pfb_index + offset < f->pfb_size) {
+ done = f->ops->get_prefetch_buffer(f, buf, f->pfb_index + offset,
+ size);
+ if (done == size) {
+ return size;
+ }
+ size -= done;
+ buf += done;
+ } else {
+ offset -= f->pfb_size - f->pfb_index;
+ }
+ }
assert(!qemu_file_is_writable(f));
@@ -875,7 +929,15 @@ int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size)
static int qemu_peek_byte(QEMUFile *f, int offset)
{
- int index = f->buf_index + offset;
+ int index;
+
+ if (f->pfb_index + offset < f->pfb_size) {
+ return f->pfb[f->pfb_index + offset];
+ } else {
+ offset -= f->pfb_size - f->pfb_index;
+ }
+
+ index = f->buf_index + offset;
assert(!qemu_file_is_writable(f));
@@ -1851,7 +1913,7 @@ void qemu_savevm_state_begin(QEMUFile *f,
}
se->ops->set_params(params, se->opaque);
}
-
+
qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
qemu_put_be32(f, QEMU_VM_FILE_VERSION);
@@ -2294,8 +2356,6 @@ int qemu_loadvm_state(QEMUFile *f)
}
}
- cpu_synchronize_all_post_init();
-
ret = 0;
out:
@@ -2311,6 +2371,89 @@ out:
return ret;
}
+int qemu_loadvm_state_ft(QEMUFile *f)
+{
+ int ret = 0;
+ int i = 0;
+ int j = 0;
+ int done = 0;
+ uint64_t size = 0;
+ uint64_t count = 0;
+ uint8_t *pfb = NULL;
+ uint8_t *buf = NULL;
+
+ uint64_t max_mem = last_ram_offset() * 1.5;
+
+ if (!f->ops->get_prefetch_buffer) {
+ fprintf(stderr, "Fault tolerant is not supported by this protocol.\n");
+ return EINVAL;
+ }
+
+ size = PFB_SIZE;
+ pfb = g_malloc(size);
+
+ while (true) {
+ if (count + TARGET_PAGE_SIZE >= size) {
+ if (size*2 > max_mem) {
+ fprintf(stderr, "qemu_loadvm_state_ft: warning:" \
+ "Prefetch buffer becomes too large.\n" \
+ "Fault tolerant is unstable when you see this,\n" \
+ "please increase the bandwidth or increase " \
+ "the max down time.\n");
+ break;
+ }
+ size = size * 2;
+ buf = g_try_realloc(pfb, size);
+ if (!buf) {
+ error_report("qemu_loadvm_state_ft: out of memory.\n");
+ g_free(pfb);
+ return ENOMEM;
+ }
+
+ pfb = buf;
+ }
+
+ done = qemu_get_buffer(f, pfb + count, TARGET_PAGE_SIZE);
+
+ ret = qemu_file_get_error(f);
+ if (ret != 0) {
+ g_free(pfb);
+ return ret;
+ }
+
+ buf = pfb + count;
+ count += done;
+ for (i = 0; i < done; i++) {
+ if (buf[i] != 0xfe) {
+ continue;
+ }
+ if (buf[i-1] != 0xCa) {
+ continue;
+ }
+ if (buf[i-2] != 0xed) {
+ continue;
+ }
+ if (buf[i-3] == 0xFe) {
+ goto out;
+ }
+ }
+ }
+ out:
+ if (f->pfb) {
+ free(f->pfb);
+ }
+ f->pfb_size = count;
+ f->pfb_index = 0;
+ f->pfb = pfb;
+
+ ret = qemu_loadvm_state(f);
+
+ /* Skip magic number */
+ qemu_get_be32(f);
+
+ return ret;
+}
+
static BlockDriverState *find_vmstate_bs(void)
{
BlockDriverState *bs = NULL;
@@ -2419,6 +2562,7 @@ void do_savevm(Monitor *mon, const QDict *qdict)
goto the_end;
}
ret = qemu_savevm_state(f);
+ cpu_synchronize_all_post_init();
vm_state_size = qemu_ftell(f);
qemu_fclose(f);
if (ret < 0) {
--
1.8.0.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 4/4] Curling: the receiver
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 4/4] Curling: the receiver Jules Wang
@ 2013-09-10 14:19 ` Juan Quintela
2013-09-11 8:25 ` junqing.wang
0 siblings, 1 reply; 20+ messages in thread
From: Juan Quintela @ 2013-09-10 14:19 UTC (permalink / raw)
To: Jules Wang; +Cc: pbonzini, qemu-devel, stefanha, owasserm
Jules Wang <junqing.wang@cs2c.com.cn> wrote:
> The receiver does migration loop until the migration connection is
> lost. Then, it is started as a backup.
>
> The receiver does not load vm state once a migration begins,
> instead, it perfetches one whole migration data into a buffer,
> then loads vm state from that buffer afterwards.
>
> Signed-off-by: Jules Wang <junqing.wang@cs2c.com.cn>
> ---
> include/migration/qemu-file.h | 1 +
> include/sysemu/sysemu.h | 1 +
> migration.c | 22 ++++--
> savevm.c | 154 ++++++++++++++++++++++++++++++++++++++++--
> 4 files changed, 168 insertions(+), 10 deletions(-)
>
> diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
> index 0f757fb..f01ff10 100644
> --- a/include/migration/qemu-file.h
> +++ b/include/migration/qemu-file.h
> @@ -92,6 +92,7 @@ typedef struct QEMUFileOps {
> QEMURamHookFunc *after_ram_iterate;
> QEMURamHookFunc *hook_ram_load;
> QEMURamSaveFunc *save_page;
> + QEMUFileGetBufferFunc *get_prefetch_buffer;
> } QEMUFileOps;
>
> QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index b1aa059..44f23d0 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -81,6 +81,7 @@ void qemu_savevm_state_complete(QEMUFile *f);
> void qemu_savevm_state_cancel(void);
> uint64_t qemu_savevm_state_pending(QEMUFile *f, uint64_t max_size);
> int qemu_loadvm_state(QEMUFile *f);
> +int qemu_loadvm_state_ft(QEMUFile *f);
>
> /* SLIRP */
> void do_info_slirp(Monitor *mon);
> diff --git a/migration.c b/migration.c
> index d8a9b2d..9be22a4 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -19,6 +19,7 @@
> #include "monitor/monitor.h"
> #include "migration/qemu-file.h"
> #include "sysemu/sysemu.h"
> +#include "sysemu/cpus.h"
> #include "block/block.h"
> #include "qemu/sockets.h"
> #include "migration/block.h"
> @@ -112,13 +113,24 @@ static void process_incoming_migration_co(void *opaque)
> {
> QEMUFile *f = opaque;
> int ret;
> + int count = 0;
>
> - ret = qemu_loadvm_state(f);
> - qemu_fclose(f);
> - if (ret < 0) {
> - fprintf(stderr, "load of migration failed\n");
> - exit(EXIT_FAILURE);
> + if (ft_enabled()) {
> + while (qemu_loadvm_state_ft(f) >= 0) {
> + count++;
> + DPRINTF("incoming count %d\r", count);
> + }
> + qemu_fclose(f);
> + fprintf(stderr, "ft connection lost, launching self..\n");
Obviously, here we are needing something more that an fprintf,, right?
We are not checking either if it is one error.
> + } else {
> + ret = qemu_loadvm_state(f);
> + qemu_fclose(f);
> + if (ret < 0) {
> + fprintf(stderr, "load of migration failed\n");
> + exit(EXIT_FAILURE);
> + }
> }
> + cpu_synchronize_all_post_init();
> qemu_announce_self();
> DPRINTF("successfully loaded vm state\n");
>
> diff --git a/savevm.c b/savevm.c
> index 6daf690..d5bf153 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -52,6 +52,8 @@
> #define ARP_PTYPE_IP 0x0800
> #define ARP_OP_REQUEST_REV 0x3
>
> +#define PFB_SIZE 0x010000
> +
> static int announce_self_create(uint8_t *buf,
> uint8_t *mac_addr)
> {
> @@ -135,6 +137,10 @@ struct QEMUFile {
> unsigned int iovcnt;
>
> int last_error;
> +
> + uint8_t *pfb; /* pfb -> PerFetch Buffer */
s/PreFetch/Prefetcth/
prefetch_buffer as name? not used in so many places, makes things
clearer or more convoluted? Other comments?
> +static int socket_get_prefetch_buffer(void *opaque, uint8_t *buf,
> + int64_t pos, int size)
> +{
> + QEMUFile *f = opaque;
> +
> + if (f->pfb_size - pos <= 0) {
> + return 0;
> + }
> +
> + if (f->pfb_size - pos < size) {
> + size = f->pfb_size - pos;
> + }
> +
> + memcpy(buf, f->pfb+pos, size);
> +
> + return size;
> +}
> +
> +
> static int socket_close(void *opaque)
> {
> QEMUFileSocket *s = opaque;
> @@ -440,6 +465,7 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
> static const QEMUFileOps socket_read_ops = {
> .get_fd = socket_get_fd,
> .get_buffer = socket_get_buffer,
> + .get_prefetch_buffer = socket_get_prefetch_buffer,
> .close = socket_close
> };
>
> if (f->last_error) {
> ret = f->last_error;
> }
> +
> + if (f->pfb) {
> + g_free(f->pfb);
g_free(f->pfb);
It already checks for NULL.
> + }
> +
> g_free(f);
> return ret;
> }
> @@ -822,6 +853,14 @@ void qemu_put_byte(QEMUFile *f, int v)
>
> static void qemu_file_skip(QEMUFile *f, int size)
> {
> + if (f->pfb_index + size <= f->pfb_size) {
> + f->pfb_index += size;
> + return;
> + } else {
> + size -= f->pfb_size - f->pfb_index;
> + f->pfb_index = f->pfb_size;
> + }
> +
> if (f->buf_index + size <= f->buf_size) {
> f->buf_index += size;
> }
> @@ -831,6 +870,21 @@ static int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
> {
> int pending;
> int index;
> + int done;
> +
> + if (f->ops->get_prefetch_buffer) {
> + if (f->pfb_index + offset < f->pfb_size) {
> + done = f->ops->get_prefetch_buffer(f, buf, f->pfb_index + offset,
> + size);
> + if (done == size) {
> + return size;
> + }
> + size -= done;
> + buf += done;
> + } else {
> + offset -= f->pfb_size - f->pfb_index;
> + }
> + }
>
> assert(!qemu_file_is_writable(f));
>
> @@ -875,7 +929,15 @@ int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size)
>
> static int qemu_peek_byte(QEMUFile *f, int offset)
> {
> - int index = f->buf_index + offset;
> + int index;
> +
> + if (f->pfb_index + offset < f->pfb_size) {
> + return f->pfb[f->pfb_index + offset];
> + } else {
> + offset -= f->pfb_size - f->pfb_index;
> + }
> +
> + index = f->buf_index + offset;
>
> assert(!qemu_file_is_writable(f));
>
> @@ -1851,7 +1913,7 @@ void qemu_savevm_state_begin(QEMUFile *f,
> }
> se->ops->set_params(params, se->opaque);
> }
> -
> +
> qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
> qemu_put_be32(f, QEMU_VM_FILE_VERSION);
>
> @@ -2294,8 +2356,6 @@ int qemu_loadvm_state(QEMUFile *f)
> }
> }
>
> - cpu_synchronize_all_post_init();
> -
> ret = 0;
>
> out:
> @@ -2311,6 +2371,89 @@ out:
> return ret;
> }
>
> +int qemu_loadvm_state_ft(QEMUFile *f)
> +{
> + int ret = 0;
> + int i = 0;
> + int j = 0;
> + int done = 0;
> + uint64_t size = 0;
> + uint64_t count = 0;
> + uint8_t *pfb = NULL;
> + uint8_t *buf = NULL;
> +
> + uint64_t max_mem = last_ram_offset() * 1.5;
> +
> + if (!f->ops->get_prefetch_buffer) {
> + fprintf(stderr, "Fault tolerant is not supported by this protocol.\n");
> + return EINVAL;
> + }
> +
> + size = PFB_SIZE;
> + pfb = g_malloc(size);
> +
> + while (true) {
> + if (count + TARGET_PAGE_SIZE >= size) {
> + if (size*2 > max_mem) {
> + fprintf(stderr, "qemu_loadvm_state_ft: warning:" \
> + "Prefetch buffer becomes too large.\n" \
> + "Fault tolerant is unstable when you see this,\n" \
> + "please increase the bandwidth or increase " \
> + "the max down time.\n");
> + break;
> + }
> + size = size * 2;
> + buf = g_try_realloc(pfb, size);
> + if (!buf) {
> + error_report("qemu_loadvm_state_ft: out of memory.\n");
> + g_free(pfb);
> + return ENOMEM;
You are not handling this error in the caller. Notice that qemu
normally
> + }
> +
> + pfb = buf;
> + }
> +
> + done = qemu_get_buffer(f, pfb + count, TARGET_PAGE_SIZE);
> +
> + ret = qemu_file_get_error(f);
> + if (ret != 0) {
> + g_free(pfb);
> + return ret;
> + }
> +
> + buf = pfb + count;
> + count += done;
> + for (i = 0; i < done; i++) {
> + if (buf[i] != 0xfe) {
> + continue;
> + }
> + if (buf[i-1] != 0xCa) {
> + continue;
> + }
> + if (buf[i-2] != 0xed) {
> + continue;
> + }
> + if (buf[i-3] == 0xFe) {
> + goto out;
> + }
Using consistent capitalation here?
Better way to look for the signature? Or, what happens if it just
happens that the data contains that magic constant?
> + }
> + }
> + out:
> + if (f->pfb) {
> + free(f->pfb);
> + }
> + f->pfb_size = count;
> + f->pfb_index = 0;
> + f->pfb = pfb;
> +
> + ret = qemu_loadvm_state(f);
> +
> + /* Skip magic number */
> + qemu_get_be32(f);
> +
> + return ret;
> +}
> +
> static BlockDriverState *find_vmstate_bs(void)
> {
> BlockDriverState *bs = NULL;
> @@ -2419,6 +2562,7 @@ void do_savevm(Monitor *mon, const QDict *qdict)
> goto the_end;
> }
> ret = qemu_savevm_state(f);
> + cpu_synchronize_all_post_init();
> vm_state_size = qemu_ftell(f);
> qemu_fclose(f);
> if (ret < 0) {
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 4/4] Curling: the receiver
2013-09-10 14:19 ` Juan Quintela
@ 2013-09-11 8:25 ` junqing.wang
0 siblings, 0 replies; 20+ messages in thread
From: junqing.wang @ 2013-09-11 8:25 UTC (permalink / raw)
To: quintela; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 7691 bytes --]
hi,
At 2013-09-10 22:19:48,"Juan Quintela" <quintela@redhat.com> wrote:
>> @@ -112,13 +113,24 @@ static void process_incoming_migration_co(void *opaque)
>> {
>> QEMUFile *f = opaque;
>> int ret;
>> + int count = 0;
>>
>> - ret = qemu_loadvm_state(f);
>> - qemu_fclose(f);
>> - if (ret < 0) {
>> - fprintf(stderr, "load of migration failed\n");
>> - exit(EXIT_FAILURE);
>> + if (ft_enabled()) {
>> + while (qemu_loadvm_state_ft(f) >= 0) {
>> + count++;
>> + DPRINTF("incoming count %d\r", count);
>> + }
>> + qemu_fclose(f);
>> + fprintf(stderr, "ft connection lost, launching self..\n");
>
>Obviously, here we are needing something more that an fprintf,, right?
>
>We are not checking either if it is one error.
Agree.
>> + } else {
>> + ret = qemu_loadvm_state(f);
>> + qemu_fclose(f);
>> + if (ret < 0) {
>> + fprintf(stderr, "load of migration failed\n");
>> + exit(EXIT_FAILURE);
>> + }
>> }
>> + cpu_synchronize_all_post_init();
>> qemu_announce_self();
>> DPRINTF("successfully loaded vm state\n");
>>
>> diff --git a/savevm.c b/savevm.c
>> index 6daf690..d5bf153 100644
>> --- a/savevm.c
>> +++ b/savevm.c
>> @@ -52,6 +52,8 @@
>> #define ARP_PTYPE_IP 0x0800
>> #define ARP_OP_REQUEST_REV 0x3
>>
>> +#define PFB_SIZE 0x010000
>> +
>> static int announce_self_create(uint8_t *buf,
>> uint8_t *mac_addr)
>> {
>> @@ -135,6 +137,10 @@ struct QEMUFile {
>> unsigned int iovcnt;
>>
>> int last_error;
>> +
>> + uint8_t *pfb; /* pfb -> PerFetch Buffer */
>
>s/PreFetch/Prefetcth/
>
>prefetch_buffer as name? not used in so many places, makes things
>clearer or more convoluted? Other comments?
>
Agree.
>> +static int socket_get_prefetch_buffer(void *opaque, uint8_t *buf,
>> + int64_t pos, int size)
>> +{
>> + QEMUFile *f = opaque;
>> +
>> + if (f->pfb_size - pos <= 0) {
>> + return 0;
>> + }
>> +
>> + if (f->pfb_size - pos < size) {
>> + size = f->pfb_size - pos;
>> + }
>> +
>> + memcpy(buf, f->pfb+pos, size);
>> +
>> + return size;
>> +}
>> +
>> +
>> static int socket_close(void *opaque)
>> {
>> QEMUFileSocket *s = opaque;
>> @@ -440,6 +465,7 @@ QEMUFile *qemu_fdopen(int fd, const char *mode)
>> static const QEMUFileOps socket_read_ops = {
>> .get_fd = socket_get_fd,
>> .get_buffer = socket_get_buffer,
>> + .get_prefetch_buffer = socket_get_prefetch_buffer,
>> .close = socket_close
>> };
>>
>
>> if (f->last_error) {
>> ret = f->last_error;
>> }
>> +
>> + if (f->pfb) {
>> + g_free(f->pfb);
>
>g_free(f->pfb);
>It already checks for NULL.
Got it.
>> + }
>> +
>> g_free(f);
>> return ret;
>> }
>> @@ -822,6 +853,14 @@ void qemu_put_byte(QEMUFile *f, int v)
>>
>> static void qemu_file_skip(QEMUFile *f, int size)
>> {
>> + if (f->pfb_index + size <= f->pfb_size) {
>> + f->pfb_index += size;
>> + return;
>> + } else {
>> + size -= f->pfb_size - f->pfb_index;
>> + f->pfb_index = f->pfb_size;
>> + }
>> +
>> if (f->buf_index + size <= f->buf_size) {
>> f->buf_index += size;
>> }
>> @@ -831,6 +870,21 @@ static int qemu_peek_buffer(QEMUFile *f, uint8_t *buf, int size, size_t offset)
>> {
>> int pending;
>> int index;
>> + int done;
>> +
>> + if (f->ops->get_prefetch_buffer) {
>> + if (f->pfb_index + offset < f->pfb_size) {
>> + done = f->ops->get_prefetch_buffer(f, buf, f->pfb_index + offset,
>> + size);
>> + if (done == size) {
>> + return size;
>> + }
>> + size -= done;
>> + buf += done;
>> + } else {
>> + offset -= f->pfb_size - f->pfb_index;
>> + }
>> + }
>>
>> assert(!qemu_file_is_writable(f));
>>
>> @@ -875,7 +929,15 @@ int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size)
>>
>> static int qemu_peek_byte(QEMUFile *f, int offset)
>> {
>> - int index = f->buf_index + offset;
>> + int index;
>> +
>> + if (f->pfb_index + offset < f->pfb_size) {
>> + return f->pfb[f->pfb_index + offset];
>> + } else {
>> + offset -= f->pfb_size - f->pfb_index;
>> + }
>> +
>> + index = f->buf_index + offset;
>>
>> assert(!qemu_file_is_writable(f));
>>
>> @@ -1851,7 +1913,7 @@ void qemu_savevm_state_begin(QEMUFile *f,
>> }
>> se->ops->set_params(params, se->opaque);
>> }
>> -
>> +
>> qemu_put_be32(f, QEMU_VM_FILE_MAGIC);
>> qemu_put_be32(f, QEMU_VM_FILE_VERSION);
>>
>> @@ -2294,8 +2356,6 @@ int qemu_loadvm_state(QEMUFile *f)
>> }
>> }
>>
>> - cpu_synchronize_all_post_init();
>> -
>> ret = 0;
>>
>> out:
>> @@ -2311,6 +2371,89 @@ out:
>> return ret;
>> }
>>
>> +int qemu_loadvm_state_ft(QEMUFile *f)
>> +{
>> + int ret = 0;
>> + int i = 0;
>> + int j = 0;
>> + int done = 0;
>> + uint64_t size = 0;
>> + uint64_t count = 0;
>> + uint8_t *pfb = NULL;
>> + uint8_t *buf = NULL;
>> +
>> + uint64_t max_mem = last_ram_offset() * 1.5;
>> +
>> + if (!f->ops->get_prefetch_buffer) {
>> + fprintf(stderr, "Fault tolerant is not supported by this protocol.\n");
>> + return EINVAL;
>> + }
>> +
>> + size = PFB_SIZE;
>> + pfb = g_malloc(size);
>> +
>> + while (true) {
>> + if (count + TARGET_PAGE_SIZE >= size) {
>> + if (size*2 > max_mem) {
>> + fprintf(stderr, "qemu_loadvm_state_ft: warning:" \
>> + "Prefetch buffer becomes too large.\n" \
>> + "Fault tolerant is unstable when you see this,\n" \
>> + "please increase the bandwidth or increase " \
>> + "the max down time.\n");
>> + break;
>> + }
>> + size = size * 2;
>> + buf = g_try_realloc(pfb, size);
>> + if (!buf) {
>> + error_report("qemu_loadvm_state_ft: out of memory.\n");
>> + g_free(pfb);
>> + return ENOMEM;
>
>You are not handling this error in the caller. Notice that qemu
>normally
I am not sure what you mean.
I find my mistake that it should return -ENOMEM and -EINVAL.
>> + }
>> +
>> + pfb = buf;
>> + }
>> +
>> + done = qemu_get_buffer(f, pfb + count, TARGET_PAGE_SIZE);
>> +
>> + ret = qemu_file_get_error(f);
>> + if (ret != 0) {
>> + g_free(pfb);
>> + return ret;
>> + }
>> +
>> + buf = pfb + count;
>> + count += done;
>> + for (i = 0; i < done; i++) {
>> + if (buf[i] != 0xfe) {
>> + continue;
>> + }
>> + if (buf[i-1] != 0xCa) {
>> + continue;
>> + }
>> + if (buf[i-2] != 0xed) {
>> + continue;
>> + }
>> + if (buf[i-3] == 0xFe) {
>> + goto out;
>> + }
>
>Using consistent capitalation here?
>Better way to look for the signature?
This code looks ugly, but runs fast. :)
And as we are looking for a better solution, this piece of code shall not
be kept in the final version of curling.
> Or, what happens if it just
>happens that the data contains that magic constant?
THAT IS THE PROBLEM, ft will fail if that happens. I expect better and fast solutions. Any suggestions?
Besides, I have tried the checksum solution which is slow. :(
[-- Attachment #2: Type: text/html, Size: 20209 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance
2013-09-10 3:43 [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance Jules Wang
` (3 preceding siblings ...)
2013-09-10 3:43 ` [Qemu-devel] [PATCH RFC 4/4] Curling: the receiver Jules Wang
@ 2013-09-10 12:27 ` Orit Wasserman
2013-09-11 1:54 ` junqing.wang
4 siblings, 1 reply; 20+ messages in thread
From: Orit Wasserman @ 2013-09-10 12:27 UTC (permalink / raw)
To: Jules Wang; +Cc: pbonzini, qemu-devel, stefanha, quintela
On 09/10/2013 06:43 AM, Jules Wang wrote:
> The goal of Curling(sports) is to provide a fault tolerant mechanism for KVM,
> so that in the event of a hardware failure, the virtual machine fails over to
> the backup in a way that is completely transparent to the guest operating system.
>
> Our goal is exactly the same as the goal of Kemari, by which Curling is
> inspired. However, Curling is simpler than Kemari(too simple, I afraid):
>
> * By leveraging live migration feature, we do endless live migrations between
> the sender and receiver, so the two virtual machines are synchronized.
>
Hi,
There are two issues I see with your solution,
The first is that if the VM failure happen in the middle on the live migration
the backup VM state will be inconsistent which means you can't failover to it.
Solving it is not simple as you need some transaction mechanism that will
change the backup VM state only when the transaction completes (the live migration completes).
Kemari has something like that.
The second is that sadly live migration doesn't always converge this means
that the backup VM won't have a consist state to failover to.
You need to detect such a case and throttle down the guest to force convergence.
Regards,
Orit
> * The receiver does not load vm state once the migration begins, instead, it
> perfetches one whole migration data into a buffer, then loads vm state from that
> buffer afterwards. This "all or nothing" approach prevents the
> broken-in-the-middle problem Kemari has.
>
> * The sender sleeps a little while after each migration, to ease the performance
> penalty entailed by vm_stop and iothread locks. This is a tradeoff between
> performance and accuracy.
>
> Usage:
> The steps of curling are the same as the steps of live migration except the
> following:
> 1. Start the receiver vm with -incoming curling:tcp:<address>:<port>
> 2. Start ft in the qemu monitor of sender vm by following cmdline:
> > migrate_set_speed <full bandwidth>
> > migrate curling:tcp:<address>:<port>
> 3. Connect to the receiver vm by vnc or spice. The screen of the vm is displayed
> when curling is ready.
> 4. Now, the sender vm is protected by ft, When it encounters a failure,
> the failover kicks in.
>
> Problems to be discussed:
> 1. When the receiver is prefectching data, how does it know where is the EOF of
> one migration?
>
> Currently, we use a magic number 0xfeedcafe to indicate the EOF.
> Any better solutions?
>
> 2. How to reduce the overhead entailed by vm_stop and iothread locks?
>
> Any solutions other than sleeping?
>
> --
>
> Jules Wang (4):
> Curling: add doc
> Curling: cmdline interface
> Curling: the sender
> Curling: the receiver
>
> arch_init.c | 18 +++--
> docs/curling.txt | 52 ++++++++++++++
> include/migration/migration.h | 2 +
> include/migration/qemu-file.h | 1 +
> include/sysemu/sysemu.h | 1 +
> migration.c | 61 ++++++++++++++--
> savevm.c | 158 ++++++++++++++++++++++++++++++++++++++++--
> 7 files changed, 277 insertions(+), 16 deletions(-)
> create mode 100644 docs/curling.txt
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance
2013-09-10 12:27 ` [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance Orit Wasserman
@ 2013-09-11 1:54 ` junqing.wang
2013-09-12 7:37 ` Orit Wasserman
0 siblings, 1 reply; 20+ messages in thread
From: junqing.wang @ 2013-09-11 1:54 UTC (permalink / raw)
To: Orit Wasserman; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1234 bytes --]
Hi,
>The first is that if the VM failure happen in the middle on the live migration >the backup VM state will be inconsistent which means you can't failover to it.
Yes, I have concerned about this problem. That is why we need a prefetch buffer.
>Solving it is not simple as you need some transaction mechanism that will >change the backup VM state only when the transaction completes (the live migration completes). >Kemari has something like that. >
The backup VM state will be loaded only when the one whole migration data is prefetched. Otherwise, VM state will not be loaded. So the backup VM is ensured to have a consistent state like a checkpoint.
However, how close this checkpoint to the point of the VM failure depends on the workload and bandwidth.
>The second is that sadly live migration doesn't always converge this means >that the backup VM won't have a consist state to failover to. >You need to detect such a case and throttle down the guest to force convergence.
Yes, that's a problem. AFAK, qemu already have an auto convergence feature.
>From another perspective, if many migrations could not converge, maybe the workload is high and the bandwidth is low, and it is not recommended to use FT in general.
[-- Attachment #2: Type: text/html, Size: 1959 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance
2013-09-11 1:54 ` junqing.wang
@ 2013-09-12 7:37 ` Orit Wasserman
2013-09-12 8:17 ` junqing.wang
0 siblings, 1 reply; 20+ messages in thread
From: Orit Wasserman @ 2013-09-12 7:37 UTC (permalink / raw)
To: junqing.wang; +Cc: qemu-devel
On 09/11/2013 04:54 AM, junqing.wang@cs2c.com.cn wrote:
> Hi,
>
>>The first is that if the VM failure happen in the middle on the live migration >the backup VM state will be inconsistent which means you can't failover to it.
>
> Yes, I have concerned about this problem. That is why we need a prefetch buffer.
>
You are right I missed that.
>>Solving it is not simple as you need some transaction mechanism that will >change the backup VM state only when the transaction completes (the live migration completes). >Kemari has something like that. >
>
> The backup VM state will be loaded only when the one whole migration data is prefetched. Otherwise, VM state will not be loaded. So the backup VM is ensured to have a consistent state like a checkpoint.
> However, how close this checkpoint to the point of the VM failure depends on the workload and bandwidth.
>
At the moment in your implementation the prefetch buffer can be very large (several copies of guest memory size)
are you planning to address this issue?
>>The second is that sadly live migration doesn't always converge this means >that the backup VM won't have a consist state to failover to. >You need to detect such a case and throttle down the guest to force convergence.
>
> Yes, that's a problem. AFAK, qemu already have an auto convergence feature.
How about activating it when you do fault tolerance automatically?
> From another perspective, if many migrations could not converge, maybe the workload is high and the bandwidth is low, and it is not recommended to use FT in general.
>
I agree but we need some way to notify the user of such problem.
Regards,
Orit
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/4] Curling: KVM Fault Tolerance
2013-09-12 7:37 ` Orit Wasserman
@ 2013-09-12 8:17 ` junqing.wang
0 siblings, 0 replies; 20+ messages in thread
From: junqing.wang @ 2013-09-12 8:17 UTC (permalink / raw)
To: Orit Wasserman; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 987 bytes --]
hi,
>At the moment in your implementation the prefetch buffer can be very large (several copies of guest memory size) >are you planning to address this issue? >I agree but we need some way to notify the user of such problem.
This issue has been handled (maybe not in the best way). The prefetch buffer size could be increased up to 1.5 * vm memory size. When the migration data size is larger than it, the prefetching is stopped with a warning (pls refer to the code 4/4) and the loading is started. In this situation, broken-in-the-middle problem is inevitable.
>>>The second is that sadly live migration doesn't always converge this means >that the backup VM won't have a consist state to failover to. >You need to detect such a case and throttle down the guest to force convergence.
>
>> Yes, that's a problem. AFAK, qemu already have an auto convergence feature.
> How about activating it when you do fault tolerance automatically?
That is feasible. Any comments by others?
[-- Attachment #2: Type: text/html, Size: 1535 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread