* [Qemu-devel] [RFC 0/1] Rolling stats on colo
@ 2015-03-05 13:31 Dr. David Alan Gilbert (git)
2015-03-05 13:31 ` [Qemu-devel] [RFC 1/1] COLO: Add primary side rolling statistics Dr. David Alan Gilbert (git)
2015-03-06 1:48 ` [Qemu-devel] [RFC 0/1] Rolling stats on colo zhanghailiang
0 siblings, 2 replies; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-03-05 13:31 UTC (permalink / raw)
To: qemu-devel
Cc: zhang.zhanghailiang, yunhong.jiang, eddie.dong, peter.huangpeng,
luis
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Hi,
I'm getting COLO running on a couple of our machines here
and wanted to see what was actually going on, so I merged
in my recent rolling-stats code:
http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
with the following patch, and now I get on the primary side,
info migrate shows me:
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
Migration status: colo
total time: 0 milliseconds
colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
which suggests I've got a problem with the packet comparison; but that's
a separate issue I'll look at.
Dave
Dr. David Alan Gilbert (1):
COLO: Add primary side rolling statistics
hmp.c | 12 ++++++++++++
include/migration/migration.h | 3 +++
migration/colo.c | 15 +++++++++++++++
migration/migration.c | 30 ++++++++++++++++++++++++++++++
qapi-schema.json | 11 ++++++++++-
5 files changed, 70 insertions(+), 1 deletion(-)
--
2.1.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [Qemu-devel] [RFC 1/1] COLO: Add primary side rolling statistics
2015-03-05 13:31 [Qemu-devel] [RFC 0/1] Rolling stats on colo Dr. David Alan Gilbert (git)
@ 2015-03-05 13:31 ` Dr. David Alan Gilbert (git)
2015-03-06 1:48 ` [Qemu-devel] [RFC 0/1] Rolling stats on colo zhanghailiang
1 sibling, 0 replies; 12+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2015-03-05 13:31 UTC (permalink / raw)
To: qemu-devel
Cc: zhang.zhanghailiang, yunhong.jiang, eddie.dong, peter.huangpeng,
luis
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Record:
Checkpoint lifetime (ms)
Pause time due to checkpoint (ms)
Checkpoint size (bytes)
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
hmp.c | 12 ++++++++++++
include/migration/migration.h | 3 +++
migration/colo.c | 15 +++++++++++++++
migration/migration.c | 30 ++++++++++++++++++++++++++++++
qapi-schema.json | 11 ++++++++++-
5 files changed, 70 insertions(+), 1 deletion(-)
diff --git a/hmp.c b/hmp.c
index c724efa..2b17ed0 100644
--- a/hmp.c
+++ b/hmp.c
@@ -197,6 +197,18 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
info->setup_time);
}
}
+ if (info->has_colo_checkpoint_stats) {
+ monitor_printf_RollingStats(mon, "colo checkpoint (ms)",
+ info->colo_checkpoint_stats);
+ }
+ if (info->has_colo_paused_stats) {
+ monitor_printf_RollingStats(mon, "colo paused time (ms)",
+ info->colo_paused_stats);
+ }
+ if (info->has_colo_size_stats) {
+ monitor_printf_RollingStats(mon, "colo checkpoint size",
+ info->colo_size_stats);
+ }
if (info->has_ram) {
monitor_printf(mon, "transferred ram: %" PRIu64 " kbytes\n",
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 9893467..564edaa 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -64,6 +64,9 @@ struct MigrationState
int64_t setup_time;
int64_t dirty_sync_count;
RStats *expected_downtime_stats;
+ RStats *colo_checkpoint_stats;
+ RStats *colo_paused_stats;
+ RStats *colo_size_stats;
};
enum {
diff --git a/migration/colo.c b/migration/colo.c
index 042dec8..653ef25 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -15,6 +15,7 @@
#include "sysemu/sysemu.h"
#include "migration/migration-colo.h"
#include "qemu/error-report.h"
+#include "qemu/rolling-stats.h"
#include "migration/migration-failover.h"
#include "net/colo-nic.h"
#include "block/block.h"
@@ -272,6 +273,7 @@ static int do_colo_transaction(MigrationState *s, QEMUFile *control)
int ret;
size_t size;
QEMUFile *trans = NULL;
+ int64_t stop_time, start_time;
ret = colo_ctl_put(s->file, COLO_CHECKPOINT_NEW);
if (ret < 0) {
@@ -295,6 +297,7 @@ static int do_colo_transaction(MigrationState *s, QEMUFile *control)
goto out;
}
/* suspend and save vm state to colo buffer */
+ stop_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
qemu_mutex_lock_iothread();
vm_stop_force_state(RUN_STATE_COLO);
qemu_mutex_unlock_iothread();
@@ -343,6 +346,9 @@ static int do_colo_transaction(MigrationState *s, QEMUFile *control)
if (ret < 0) {
goto out;
}
+
+ rstats_add_value(s->colo_size_stats, size, stop_time);
+
ret = colo_ctl_get(control, COLO_CHECKPOINT_RECEIVED);
if (ret < 0) {
goto out;
@@ -366,6 +372,11 @@ static int do_colo_transaction(MigrationState *s, QEMUFile *control)
qemu_mutex_lock_iothread();
vm_start();
qemu_mutex_unlock_iothread();
+ start_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+ rstats_add_value(s->colo_paused_stats,
+ start_time - stop_time,
+ start_time);
+
DPRINTF("vm resume to run again\n");
out:
@@ -450,6 +461,10 @@ static void *colo_thread(void *opaque)
DPRINTF("Net packets is not consistent!!!\n");
}
+ current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+ rstats_add_value(s->colo_checkpoint_stats,
+ current_time - start_time,
+ current_time);
/* start a colo checkpoint */
if (do_colo_transaction(s, colo_control)) {
goto out;
diff --git a/migration/migration.c b/migration/migration.c
index 794d94a..7c0517a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -244,6 +244,21 @@ MigrationInfo *qmp_query_migrate(Error **errp)
case MIG_STATE_COLO:
info->has_status = true;
info->status = g_strdup("colo");
+ if (s->colo_checkpoint_stats) {
+ info->colo_checkpoint_stats =
+ rstats_as_RollingStats(s->colo_checkpoint_stats);
+ info->has_colo_checkpoint_stats = true;
+ }
+ if (s->colo_paused_stats) {
+ info->colo_paused_stats =
+ rstats_as_RollingStats(s->colo_paused_stats);
+ info->has_colo_paused_stats = true;
+ }
+ if (s->colo_size_stats) {
+ info->colo_size_stats =
+ rstats_as_RollingStats(s->colo_size_stats);
+ info->has_colo_size_stats = true;
+ }
/* TODO: display COLO specific informations(checkpoint info etc.),*/
break;
case MIG_STATE_COMPLETED:
@@ -433,6 +448,21 @@ static MigrationState *migrate_init(const MigrationParams *params)
} else {
rstats_reset(s->expected_downtime_stats);
}
+ if (!s->colo_checkpoint_stats) {
+ s->colo_checkpoint_stats = rstats_init(10, 0.2);
+ } else {
+ rstats_reset(s->colo_checkpoint_stats);
+ }
+ if (!s->colo_paused_stats) {
+ s->colo_paused_stats = rstats_init(10, 0.2);
+ } else {
+ rstats_reset(s->colo_paused_stats);
+ }
+ if (!s->colo_size_stats) {
+ s->colo_size_stats = rstats_init(10, 0.2);
+ } else {
+ rstats_reset(s->colo_size_stats);
+ }
s->bandwidth_limit = bandwidth_limit;
s->state = MIG_STATE_SETUP;
trace_migrate_set_state(MIG_STATE_SETUP);
diff --git a/qapi-schema.json b/qapi-schema.json
index 2ec35c7..f2a666c 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -486,6 +486,12 @@
# @expected-downtime-stats: #optional more detailed statistics from the
# downtime estimation.
#
+# @colo-checkpoint-stats: #optional The length of COLO checkpoints (ms)
+#
+# @colo-paused-stats: #optional The time paused (ms) as COLO took checkpoints
+#
+# @colo-size-stats: #optional The size of COLO checkpoints (bytes)
+#
# Since: 0.14.0
##
{ 'type': 'MigrationInfo',
@@ -496,7 +502,10 @@
'*expected-downtime': 'int',
'*downtime': 'int',
'*setup-time': 'int',
- '*expected-downtime-stats': 'RollingStats' } }
+ '*expected-downtime-stats': 'RollingStats',
+ '*colo-checkpoint-stats': 'RollingStats',
+ '*colo-paused-stats': 'RollingStats',
+ '*colo-size-stats': 'RollingStats' } }
##
# @query-migrate
--
2.1.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-05 13:31 [Qemu-devel] [RFC 0/1] Rolling stats on colo Dr. David Alan Gilbert (git)
2015-03-05 13:31 ` [Qemu-devel] [RFC 1/1] COLO: Add primary side rolling statistics Dr. David Alan Gilbert (git)
@ 2015-03-06 1:48 ` zhanghailiang
2015-03-06 1:52 ` zhanghailiang
2015-03-06 18:30 ` Dr. David Alan Gilbert
1 sibling, 2 replies; 12+ messages in thread
From: zhanghailiang @ 2015-03-06 1:48 UTC (permalink / raw)
To: Dr. David Alan Gilbert (git), qemu-devel
Cc: hangaohuai, yunhong.jiang, eddie.dong, peter.huangpeng, luis
On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Hi Dave,
>
> Hi,
> I'm getting COLO running on a couple of our machines here
> and wanted to see what was actually going on, so I merged
> in my recent rolling-stats code:
>
> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
>
> with the following patch, and now I get on the primary side,
> info migrate shows me:
>
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
> Migration status: colo
> total time: 0 milliseconds
> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
>
> which suggests I've got a problem with the packet comparison; but that's
> a separate issue I'll look at.
>
There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
Thanks,
zhanghailiang
> Dave
>
> Dr. David Alan Gilbert (1):
> COLO: Add primary side rolling statistics
>
> hmp.c | 12 ++++++++++++
> include/migration/migration.h | 3 +++
> migration/colo.c | 15 +++++++++++++++
> migration/migration.c | 30 ++++++++++++++++++++++++++++++
> qapi-schema.json | 11 ++++++++++-
> 5 files changed, 70 insertions(+), 1 deletion(-)
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-06 1:48 ` [Qemu-devel] [RFC 0/1] Rolling stats on colo zhanghailiang
@ 2015-03-06 1:52 ` zhanghailiang
2015-03-06 18:30 ` Dr. David Alan Gilbert
1 sibling, 0 replies; 12+ messages in thread
From: zhanghailiang @ 2015-03-06 1:52 UTC (permalink / raw)
To: Dr. David Alan Gilbert (git), qemu-devel
Cc: hangaohuai, yunhong.jiang, eddie.dong, peter.huangpeng, luis
On 2015/3/6 9:48, zhanghailiang wrote:
> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Hi Dave,
>
>>
>> Hi,
>> I'm getting COLO running on a couple of our machines here
>> and wanted to see what was actually going on, so I merged
>> in my recent rolling-stats code:
>>
>> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
>>
>> with the following patch, and now I get on the primary side,
>> info migrate shows me:
>>
>> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
>> Migration status: colo
>> total time: 0 milliseconds
>> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
>> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
>> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
>>
>> which suggests I've got a problem with the packet comparison; but that's
>> a separate issue I'll look at.
>>
>
> There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
s/IPS_UNTRACKED_BIT/IPS_COLO_TEMPLATE_BIT
> so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
>
> To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
>
> Thanks,
> zhanghailiang
>
>
>> Dave
>>
>> Dr. David Alan Gilbert (1):
>> COLO: Add primary side rolling statistics
>>
>> hmp.c | 12 ++++++++++++
>> include/migration/migration.h | 3 +++
>> migration/colo.c | 15 +++++++++++++++
>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
>> qapi-schema.json | 11 ++++++++++-
>> 5 files changed, 70 insertions(+), 1 deletion(-)
>>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-06 1:48 ` [Qemu-devel] [RFC 0/1] Rolling stats on colo zhanghailiang
2015-03-06 1:52 ` zhanghailiang
@ 2015-03-06 18:30 ` Dr. David Alan Gilbert
2015-03-09 2:37 ` Wen Congyang
2015-03-11 3:11 ` zhanghailiang
1 sibling, 2 replies; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-06 18:30 UTC (permalink / raw)
To: zhanghailiang
Cc: hangaohuai, yunhong.jiang, eddie.dong, qemu-devel,
peter.huangpeng, luis
* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
> >From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Hi Dave,
>
> >
> >Hi,
> > I'm getting COLO running on a couple of our machines here
> >and wanted to see what was actually going on, so I merged
> >in my recent rolling-stats code:
> >
> >http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
> >
> >with the following patch, and now I get on the primary side,
> >info migrate shows me:
> >
> >capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
> >Migration status: colo
> >total time: 0 milliseconds
> >colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
> >colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
> >colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
> >
> >which suggests I've got a problem with the packet comparison; but that's
> >a separate issue I'll look at.
> >
>
> There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
> so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
No, that's OK; we all make them.
However, that didn't cure my problem; but after a bit of experimentation I now have
COLO working pretty well; thanks for the help!
1) I had to disable IPv6 in the guest; it doesn't look like the
conntrack is coping with IPv6 ICMPV6, and on our test network
we're getting a few 10s of those each second, so it's constant
miscompares (they seem to be neighbour broadcasts and multicast
stuff).
2) It looks like virtio-net is sending ARPs - possibly every time
that a snapshot is loaded; it's not the 'qemu' announce-self code,
(I added some debug there and it's not being called); and ARPs
cause a miscompare - so you get a continuous streem of miscompares
because a miscompare triggers a new snapshot, that sends more ARPs.
I solved this by switching to e1000.
3) The other problem with virtio is it's occasionally triggering a
'virtio: error trying to map MMIO memory' from qemu; I'm not sure
why, the state COLO sends over should always be consistent.
4) With the e1000 setup; connections are generally fairly responsive,
but sshing into the guest takes *ages* (10s of seconds). I'm not sure
why, because a curl to a web server seems OK (less than a second)
and once the ssh is open it's pretty responsive.
5) I've seen one instance of;
'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p - buf == aiocb->aio_nbytes' failed.'
on the primary side.
Stats for a mostly idle guest are now showing:
colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) Count: 227 Values: 1650@1425666160229, 1661@1425666161998, 1662@1425666163736, 1687@1425666165524, 811@1425666166438, 788@1425666167298, 1619@1425666168992, 1699@1425666170793, 2711@1425666173602, 1633@1425666175315
colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) Count: 227 Values: 107@1425666160337, 75@1425666162074, 100@1425666163837, 102@1425666165627, 71@1425666166510, 74@1425666167373, 101@1425666169094, 97@1425666170891, 79@1425666173682, 97@1425666175413
colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 4826386.5) Count: 227 Values: 5998892@1425666160230, 4660988@1425666161999, 6002996@1425666163737, 5945540@1425666165525, 4833356@1425666166439, 5510606@1425666167299, 5793692@1425666168993, 5584388@1425666170794, 7016684@1425666173603, 4349084@1425666175316
So, one checkpoint every ~1.5 seconds; that's just with an
ssh connected and a script doing a 'curl' to it's http
repeatedly. Running 'top' on the ssh with a fast refresh
brings the checkpoints much faster; I guess that's because
the output of top is quite random.
> To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
Yes, that's OK; and I've had a few kernel crashes; normally
when the qemu crashes, the kernel doesn't really like it;
but that's OK, I'm sure it will get better.
I added the following to make my debug easier; which is how
I found the IPv6 problem.
diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
index 9e50b62..13c0b48 100644
--- a/xt_PMYCOLO.c
+++ b/xt_PMYCOLO.c
@@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int dataoff,
h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
if (h == NULL) {
- pr_dbg("can't find master's ct for slaver packet\n");
+ pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d protonum=%d)\n", l3num, protonum);
return NULL;
}
@@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
/* rcu_read_lock()ed by nf_hook_slow */
l3proto = __nf_ct_l3proto_find(pf);
if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, &protonum) <= 0) {
- pr_dbg("slaver: l3proto not prepared to track yet or error occurred\n");
+ pr_dbg("slaver: l3proto not prepared to track yet or error occurred (pf=%d)\n", pf);
NF_CT_STAT_INC_ATOMIC(&init_net, error);
NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
goto out;
>
> Thanks,
> zhanghailiang
Thanks,
Dave
>
>
> >Dave
> >
> >Dr. David Alan Gilbert (1):
> > COLO: Add primary side rolling statistics
> >
> > hmp.c | 12 ++++++++++++
> > include/migration/migration.h | 3 +++
> > migration/colo.c | 15 +++++++++++++++
> > migration/migration.c | 30 ++++++++++++++++++++++++++++++
> > qapi-schema.json | 11 ++++++++++-
> > 5 files changed, 70 insertions(+), 1 deletion(-)
> >
>
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-06 18:30 ` Dr. David Alan Gilbert
@ 2015-03-09 2:37 ` Wen Congyang
2015-03-09 8:55 ` Dr. David Alan Gilbert
2015-03-11 3:11 ` zhanghailiang
1 sibling, 1 reply; 12+ messages in thread
From: Wen Congyang @ 2015-03-09 2:37 UTC (permalink / raw)
To: Dr. David Alan Gilbert, zhanghailiang
Cc: hangaohuai, yunhong.jiang, eddie.dong, qemu-devel,
peter.huangpeng, luis
On 03/07/2015 02:30 AM, Dr. David Alan Gilbert wrote:
> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>
>> Hi Dave,
>>
>>>
>>> Hi,
>>> I'm getting COLO running on a couple of our machines here
>>> and wanted to see what was actually going on, so I merged
>>> in my recent rolling-stats code:
>>>
>>> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
>>>
>>> with the following patch, and now I get on the primary side,
>>> info migrate shows me:
>>>
>>> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
>>> Migration status: colo
>>> total time: 0 milliseconds
>>> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
>>> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
>>> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
>>>
>>> which suggests I've got a problem with the packet comparison; but that's
>>> a separate issue I'll look at.
>>>
>>
>> There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
>> so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
>
> No, that's OK; we all make them.
>
> However, that didn't cure my problem; but after a bit of experimentation I now have
> COLO working pretty well; thanks for the help!
>
> 1) I had to disable IPv6 in the guest; it doesn't look like the
> conntrack is coping with IPv6 ICMPV6, and on our test network
> we're getting a few 10s of those each second, so it's constant
> miscompares (they seem to be neighbour broadcasts and multicast
> stuff).
>
> 2) It looks like virtio-net is sending ARPs - possibly every time
> that a snapshot is loaded; it's not the 'qemu' announce-self code,
> (I added some debug there and it's not being called); and ARPs
> cause a miscompare - so you get a continuous streem of miscompares
> because a miscompare triggers a new snapshot, that sends more ARPs.
> I solved this by switching to e1000.
>
> 3) The other problem with virtio is it's occasionally triggering a
> 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
> why, the state COLO sends over should always be consistent.
I don't meet this problem. Can you provide your command line?
Primary or secondary qemu reports this error message?
>
> 4) With the e1000 setup; connections are generally fairly responsive,
> but sshing into the guest takes *ages* (10s of seconds). I'm not sure
> why, because a curl to a web server seems OK (less than a second)
> and once the ssh is open it's pretty responsive.
>
> 5) I've seen one instance of;
> 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p - buf == aiocb->aio_nbytes' failed.'
> on the primary side.
It is a known bug in quorum. You can try this patch:
http://lists.nongnu.org/archive/html/qemu-devel/2015-01/msg04507.html
Thanks
Wen Congyang
>
> Stats for a mostly idle guest are now showing:
>
> colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) Count: 227 Values: 1650@1425666160229, 1661@1425666161998, 1662@1425666163736, 1687@1425666165524, 811@1425666166438, 788@1425666167298, 1619@1425666168992, 1699@1425666170793, 2711@1425666173602, 1633@1425666175315
> colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) Count: 227 Values: 107@1425666160337, 75@1425666162074, 100@1425666163837, 102@1425666165627, 71@1425666166510, 74@1425666167373, 101@1425666169094, 97@1425666170891, 79@1425666173682, 97@1425666175413
> colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 4826386.5) Count: 227 Values: 5998892@1425666160230, 4660988@1425666161999, 6002996@1425666163737, 5945540@1425666165525, 4833356@1425666166439, 5510606@1425666167299, 5793692@1425666168993, 5584388@1425666170794, 7016684@1425666173603, 4349084@1425666175316
>
> So, one checkpoint every ~1.5 seconds; that's just with an
> ssh connected and a script doing a 'curl' to it's http
> repeatedly. Running 'top' on the ssh with a fast refresh
> brings the checkpoints much faster; I guess that's because
> the output of top is quite random.
>
>> To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
>
> Yes, that's OK; and I've had a few kernel crashes; normally
> when the qemu crashes, the kernel doesn't really like it;
> but that's OK, I'm sure it will get better.
>
> I added the following to make my debug easier; which is how
> I found the IPv6 problem.
>
> diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
> index 9e50b62..13c0b48 100644
> --- a/xt_PMYCOLO.c
> +++ b/xt_PMYCOLO.c
> @@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int dataoff,
> h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
>
> if (h == NULL) {
> - pr_dbg("can't find master's ct for slaver packet\n");
> + pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d protonum=%d)\n", l3num, protonum);
> return NULL;
> }
>
> @@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
> /* rcu_read_lock()ed by nf_hook_slow */
> l3proto = __nf_ct_l3proto_find(pf);
> if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, &protonum) <= 0) {
> - pr_dbg("slaver: l3proto not prepared to track yet or error occurred\n");
> + pr_dbg("slaver: l3proto not prepared to track yet or error occurred (pf=%d)\n", pf);
> NF_CT_STAT_INC_ATOMIC(&init_net, error);
> NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
> goto out;
>
>>
>> Thanks,
>> zhanghailiang
>
> Thanks,
>
> Dave
>>
>>
>>> Dave
>>>
>>> Dr. David Alan Gilbert (1):
>>> COLO: Add primary side rolling statistics
>>>
>>> hmp.c | 12 ++++++++++++
>>> include/migration/migration.h | 3 +++
>>> migration/colo.c | 15 +++++++++++++++
>>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
>>> qapi-schema.json | 11 ++++++++++-
>>> 5 files changed, 70 insertions(+), 1 deletion(-)
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-09 2:37 ` Wen Congyang
@ 2015-03-09 8:55 ` Dr. David Alan Gilbert
2015-03-09 9:01 ` Wen Congyang
0 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-09 8:55 UTC (permalink / raw)
To: Wen Congyang
Cc: hangaohuai, zhanghailiang, yunhong.jiang, eddie.dong,
peter.huangpeng, qemu-devel, luis
* Wen Congyang (wency@cn.fujitsu.com) wrote:
> On 03/07/2015 02:30 AM, Dr. David Alan Gilbert wrote:
> > * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> >> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
> >>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >>
> >> Hi Dave,
> >>
> >>>
> >>> Hi,
> >>> I'm getting COLO running on a couple of our machines here
> >>> and wanted to see what was actually going on, so I merged
> >>> in my recent rolling-stats code:
> >>>
> >>> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
> >>>
> >>> with the following patch, and now I get on the primary side,
> >>> info migrate shows me:
> >>>
> >>> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
> >>> Migration status: colo
> >>> total time: 0 milliseconds
> >>> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
> >>> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
> >>> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
> >>>
> >>> which suggests I've got a problem with the packet comparison; but that's
> >>> a separate issue I'll look at.
> >>>
> >>
> >> There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
> >> so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
> >
> > No, that's OK; we all make them.
> >
> > However, that didn't cure my problem; but after a bit of experimentation I now have
> > COLO working pretty well; thanks for the help!
> >
> > 1) I had to disable IPv6 in the guest; it doesn't look like the
> > conntrack is coping with IPv6 ICMPV6, and on our test network
> > we're getting a few 10s of those each second, so it's constant
> > miscompares (they seem to be neighbour broadcasts and multicast
> > stuff).
> >
> > 2) It looks like virtio-net is sending ARPs - possibly every time
> > that a snapshot is loaded; it's not the 'qemu' announce-self code,
> > (I added some debug there and it's not being called); and ARPs
> > cause a miscompare - so you get a continuous streem of miscompares
> > because a miscompare triggers a new snapshot, that sends more ARPs.
> > I solved this by switching to e1000.
> >
> > 3) The other problem with virtio is it's occasionally triggering a
> > 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
> > why, the state COLO sends over should always be consistent.
>
> I don't meet this problem. Can you provide your command line?
> Primary or secondary qemu reports this error message?
It's the secondary;
./try/bin/qemu-system-x86_64 -enable-kvm -nographic \
-boot c -m 2048 -smp 2 -S \
-netdev tap,id=hn0,script=$PWD/ifup-slave,\
downscript=no,colo_script=$PWD/colo-proxy/colo-proxy-script.sh,colo_nicname=em4 \
-device virtio-net-pci,mac=52:54:64:61:05:31,id=net-pci0,netdev=hn0 \
-drive driver=blkcolo,export=colo1,backing.file.filename=./Fedora-x86_64-20-20140407-sda.raw,backing.driver=raw,if=virtio\
-incoming tcp:0:8888
> > 4) With the e1000 setup; connections are generally fairly responsive,
> > but sshing into the guest takes *ages* (10s of seconds). I'm not sure
> > why, because a curl to a web server seems OK (less than a second)
> > and once the ssh is open it's pretty responsive.
> >
> > 5) I've seen one instance of;
> > 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p - buf == aiocb->aio_nbytes' failed.'
> > on the primary side.
>
> It is a known bug in quorum. You can try this patch:
> http://lists.nongnu.org/archive/html/qemu-devel/2015-01/msg04507.html
OK, I'll try it; although I've only hit that bug once.
>
> Thanks
> Wen Congyang
Thanks for the reply,
Dave
>
> >
> > Stats for a mostly idle guest are now showing:
> >
> > colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) Count: 227 Values: 1650@1425666160229, 1661@1425666161998, 1662@1425666163736, 1687@1425666165524, 811@1425666166438, 788@1425666167298, 1619@1425666168992, 1699@1425666170793, 2711@1425666173602, 1633@1425666175315
> > colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) Count: 227 Values: 107@1425666160337, 75@1425666162074, 100@1425666163837, 102@1425666165627, 71@1425666166510, 74@1425666167373, 101@1425666169094, 97@1425666170891, 79@1425666173682, 97@1425666175413
> > colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 4826386.5) Count: 227 Values: 5998892@1425666160230, 4660988@1425666161999, 6002996@1425666163737, 5945540@1425666165525, 4833356@1425666166439, 5510606@1425666167299, 5793692@1425666168993, 5584388@1425666170794, 7016684@1425666173603, 4349084@1425666175316
> >
> > So, one checkpoint every ~1.5 seconds; that's just with an
> > ssh connected and a script doing a 'curl' to it's http
> > repeatedly. Running 'top' on the ssh with a fast refresh
> > brings the checkpoints much faster; I guess that's because
> > the output of top is quite random.
> >
> >> To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
> >
> > Yes, that's OK; and I've had a few kernel crashes; normally
> > when the qemu crashes, the kernel doesn't really like it;
> > but that's OK, I'm sure it will get better.
> >
> > I added the following to make my debug easier; which is how
> > I found the IPv6 problem.
> >
> > diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
> > index 9e50b62..13c0b48 100644
> > --- a/xt_PMYCOLO.c
> > +++ b/xt_PMYCOLO.c
> > @@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int dataoff,
> > h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
> >
> > if (h == NULL) {
> > - pr_dbg("can't find master's ct for slaver packet\n");
> > + pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d protonum=%d)\n", l3num, protonum);
> > return NULL;
> > }
> >
> > @@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
> > /* rcu_read_lock()ed by nf_hook_slow */
> > l3proto = __nf_ct_l3proto_find(pf);
> > if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, &protonum) <= 0) {
> > - pr_dbg("slaver: l3proto not prepared to track yet or error occurred\n");
> > + pr_dbg("slaver: l3proto not prepared to track yet or error occurred (pf=%d)\n", pf);
> > NF_CT_STAT_INC_ATOMIC(&init_net, error);
> > NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
> > goto out;
> >
> >>
> >> Thanks,
> >> zhanghailiang
> >
> > Thanks,
> >
> > Dave
> >>
> >>
> >>> Dave
> >>>
> >>> Dr. David Alan Gilbert (1):
> >>> COLO: Add primary side rolling statistics
> >>>
> >>> hmp.c | 12 ++++++++++++
> >>> include/migration/migration.h | 3 +++
> >>> migration/colo.c | 15 +++++++++++++++
> >>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
> >>> qapi-schema.json | 11 ++++++++++-
> >>> 5 files changed, 70 insertions(+), 1 deletion(-)
> >>>
> >>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > .
> >
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-09 8:55 ` Dr. David Alan Gilbert
@ 2015-03-09 9:01 ` Wen Congyang
0 siblings, 0 replies; 12+ messages in thread
From: Wen Congyang @ 2015-03-09 9:01 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: hangaohuai, zhanghailiang, yunhong.jiang, eddie.dong,
peter.huangpeng, qemu-devel, luis
On 03/09/2015 04:55 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (wency@cn.fujitsu.com) wrote:
>> On 03/07/2015 02:30 AM, Dr. David Alan Gilbert wrote:
>>> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>>>> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
>>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>
>>>> Hi Dave,
>>>>
>>>>>
>>>>> Hi,
>>>>> I'm getting COLO running on a couple of our machines here
>>>>> and wanted to see what was actually going on, so I merged
>>>>> in my recent rolling-stats code:
>>>>>
>>>>> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
>>>>>
>>>>> with the following patch, and now I get on the primary side,
>>>>> info migrate shows me:
>>>>>
>>>>> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
>>>>> Migration status: colo
>>>>> total time: 0 milliseconds
>>>>> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
>>>>> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
>>>>> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
>>>>>
>>>>> which suggests I've got a problem with the packet comparison; but that's
>>>>> a separate issue I'll look at.
>>>>>
>>>>
>>>> There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
>>>> so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
>>>
>>> No, that's OK; we all make them.
>>>
>>> However, that didn't cure my problem; but after a bit of experimentation I now have
>>> COLO working pretty well; thanks for the help!
>>>
>>> 1) I had to disable IPv6 in the guest; it doesn't look like the
>>> conntrack is coping with IPv6 ICMPV6, and on our test network
>>> we're getting a few 10s of those each second, so it's constant
>>> miscompares (they seem to be neighbour broadcasts and multicast
>>> stuff).
>>>
>>> 2) It looks like virtio-net is sending ARPs - possibly every time
>>> that a snapshot is loaded; it's not the 'qemu' announce-self code,
>>> (I added some debug there and it's not being called); and ARPs
>>> cause a miscompare - so you get a continuous streem of miscompares
>>> because a miscompare triggers a new snapshot, that sends more ARPs.
>>> I solved this by switching to e1000.
>>>
>>> 3) The other problem with virtio is it's occasionally triggering a
>>> 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
>>> why, the state COLO sends over should always be consistent.
>>
>> I don't meet this problem. Can you provide your command line?
>> Primary or secondary qemu reports this error message?
>
> It's the secondary;
>
> ./try/bin/qemu-system-x86_64 -enable-kvm -nographic \
> -boot c -m 2048 -smp 2 -S \
> -netdev tap,id=hn0,script=$PWD/ifup-slave,\
> downscript=no,colo_script=$PWD/colo-proxy/colo-proxy-script.sh,colo_nicname=em4 \
> -device virtio-net-pci,mac=52:54:64:61:05:31,id=net-pci0,netdev=hn0 \
> -drive driver=blkcolo,export=colo1,backing.file.filename=./Fedora-x86_64-20-20140407-sda.raw,backing.driver=raw,if=virtio\
> -incoming tcp:0:8888
>
>>> 4) With the e1000 setup; connections are generally fairly responsive,
>>> but sshing into the guest takes *ages* (10s of seconds). I'm not sure
>>> why, because a curl to a web server seems OK (less than a second)
>>> and once the ssh is open it's pretty responsive.
>>>
>>> 5) I've seen one instance of;
>>> 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p - buf == aiocb->aio_nbytes' failed.'
>>> on the primary side.
>>
>> It is a known bug in quorum. You can try this patch:
>> http://lists.nongnu.org/archive/html/qemu-devel/2015-01/msg04507.html
>
> OK, I'll try it; although I've only hit that bug once.
You can also use qcow2 to avoid this problem.
Thanks
Wen Congyang
>
>>
>> Thanks
>> Wen Congyang
>
> Thanks for the reply,
>
> Dave
>>
>>>
>>> Stats for a mostly idle guest are now showing:
>>>
>>> colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) Count: 227 Values: 1650@1425666160229, 1661@1425666161998, 1662@1425666163736, 1687@1425666165524, 811@1425666166438, 788@1425666167298, 1619@1425666168992, 1699@1425666170793, 2711@1425666173602, 1633@1425666175315
>>> colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) Count: 227 Values: 107@1425666160337, 75@1425666162074, 100@1425666163837, 102@1425666165627, 71@1425666166510, 74@1425666167373, 101@1425666169094, 97@1425666170891, 79@1425666173682, 97@1425666175413
>>> colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 4826386.5) Count: 227 Values: 5998892@1425666160230, 4660988@1425666161999, 6002996@1425666163737, 5945540@1425666165525, 4833356@1425666166439, 5510606@1425666167299, 5793692@1425666168993, 5584388@1425666170794, 7016684@1425666173603, 4349084@1425666175316
>>>
>>> So, one checkpoint every ~1.5 seconds; that's just with an
>>> ssh connected and a script doing a 'curl' to it's http
>>> repeatedly. Running 'top' on the ssh with a fast refresh
>>> brings the checkpoints much faster; I guess that's because
>>> the output of top is quite random.
>>>
>>>> To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
>>>
>>> Yes, that's OK; and I've had a few kernel crashes; normally
>>> when the qemu crashes, the kernel doesn't really like it;
>>> but that's OK, I'm sure it will get better.
>>>
>>> I added the following to make my debug easier; which is how
>>> I found the IPv6 problem.
>>>
>>> diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
>>> index 9e50b62..13c0b48 100644
>>> --- a/xt_PMYCOLO.c
>>> +++ b/xt_PMYCOLO.c
>>> @@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int dataoff,
>>> h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
>>>
>>> if (h == NULL) {
>>> - pr_dbg("can't find master's ct for slaver packet\n");
>>> + pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d protonum=%d)\n", l3num, protonum);
>>> return NULL;
>>> }
>>>
>>> @@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
>>> /* rcu_read_lock()ed by nf_hook_slow */
>>> l3proto = __nf_ct_l3proto_find(pf);
>>> if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, &protonum) <= 0) {
>>> - pr_dbg("slaver: l3proto not prepared to track yet or error occurred\n");
>>> + pr_dbg("slaver: l3proto not prepared to track yet or error occurred (pf=%d)\n", pf);
>>> NF_CT_STAT_INC_ATOMIC(&init_net, error);
>>> NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
>>> goto out;
>>>
>>>>
>>>> Thanks,
>>>> zhanghailiang
>>>
>>> Thanks,
>>>
>>> Dave
>>>>
>>>>
>>>>> Dave
>>>>>
>>>>> Dr. David Alan Gilbert (1):
>>>>> COLO: Add primary side rolling statistics
>>>>>
>>>>> hmp.c | 12 ++++++++++++
>>>>> include/migration/migration.h | 3 +++
>>>>> migration/colo.c | 15 +++++++++++++++
>>>>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
>>>>> qapi-schema.json | 11 ++++++++++-
>>>>> 5 files changed, 70 insertions(+), 1 deletion(-)
>>>>>
>>>>
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>> .
>>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> .
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-06 18:30 ` Dr. David Alan Gilbert
2015-03-09 2:37 ` Wen Congyang
@ 2015-03-11 3:11 ` zhanghailiang
2015-03-11 9:06 ` Dr. David Alan Gilbert
1 sibling, 1 reply; 12+ messages in thread
From: zhanghailiang @ 2015-03-11 3:11 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: hangaohuai, Li Zhijian, yunhong.jiang, eddie.dong,
peter.huangpeng, qemu-devel, Gonglei (Arei), luis
Hi Dave,
Sorry for the late reply :)
On 2015/3/7 2:30, Dr. David Alan Gilbert wrote:
> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>
>> Hi Dave,
>>
>>>
>>> Hi,
>>> I'm getting COLO running on a couple of our machines here
>>> and wanted to see what was actually going on, so I merged
>>> in my recent rolling-stats code:
>>>
>>> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
>>>
>>> with the following patch, and now I get on the primary side,
>>> info migrate shows me:
>>>
>>> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
>>> Migration status: colo
>>> total time: 0 milliseconds
>>> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
>>> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
>>> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
>>>
>>> which suggests I've got a problem with the packet comparison; but that's
>>> a separate issue I'll look at.
>>>
>>
>> There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
>> so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
>
> No, that's OK; we all make them.
>
> However, that didn't cure my problem; but after a bit of experimentation I now have
> COLO working pretty well; thanks for the help!
>
> 1) I had to disable IPv6 in the guest; it doesn't look like the
> conntrack is coping with IPv6 ICMPV6, and on our test network
> we're getting a few 10s of those each second, so it's constant
> miscompares (they seem to be neighbour broadcasts and multicast
> stuff).
>
Hmm, yes, the proxy code in github does not support ICMPV6 packet comparing.
We will add this in the future.
> 2) It looks like virtio-net is sending ARPs - possibly every time
> that a snapshot is loaded; it's not the 'qemu' announce-self code,
> (I added some debug there and it's not being called); and ARPs
> cause a miscompare - so you get a continuous streem of miscompares
> because a miscompare triggers a new snapshot, that sends more ARPs.
> I solved this by switching to e1000.
>
I didn't meet this problem, i used tcpdump to capture the net packets and
did not found any ARPs after VM load in slave.
Maybe i missed something, Are there any servers/commands that net related run in VM?
And what's your tcpdump command line?
> 3) The other problem with virtio is it's occasionally triggering a
> 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
> why, the state COLO sends over should always be consistent.
>
> 4) With the e1000 setup; connections are generally fairly responsive,
> but sshing into the guest takes *ages* (10s of seconds). I'm not sure
> why, because a curl to a web server seems OK (less than a second)
> and once the ssh is open it's pretty responsive.
>
Er, have you tried to ssh into the guest without in COLO mode? Is it also taking a long time?
I have encounter a similar situation when the slave VM is faked dead which 'info status' is 'running',
but VM can not respond to keyboad from VNC. Maybe there is some thing wrong with device status, i
will look into it.
> 5) I've seen one instance of;
> 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p - buf == aiocb->aio_nbytes' failed.'
> on the primary side.
>
> Stats for a mostly idle guest are now showing:
>
> colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) Count: 227 Values: 1650@1425666160229, 1661@1425666161998, 1662@1425666163736, 1687@1425666165524, 811@1425666166438, 788@1425666167298, 1619@1425666168992, 1699@1425666170793, 2711@1425666173602, 1633@1425666175315
> colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) Count: 227 Values: 107@1425666160337, 75@1425666162074, 100@1425666163837, 102@1425666165627, 71@1425666166510, 74@1425666167373, 101@1425666169094, 97@1425666170891, 79@1425666173682, 97@1425666175413
> colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 4826386.5) Count: 227 Values: 5998892@1425666160230, 4660988@1425666161999, 6002996@1425666163737, 5945540@1425666165525, 4833356@1425666166439, 5510606@1425666167299, 5793692@1425666168993, 5584388@1425666170794, 7016684@1425666173603, 4349084@1425666175316
>
> So, one checkpoint every ~1.5 seconds; that's just with an
> ssh connected and a script doing a 'curl' to it's http
> repeatedly. Running 'top' on the ssh with a fast refresh
> brings the checkpoints much faster; I guess that's because
> the output of top is quite random.
>
Yes, it is a known problem, actually, not only 'top' command, every command with
random output may result in continuous miscompare.
Besides, the data transferred through SSH will be encrypted, which makes things more bad.
One way to solve this problem maybe:
if we detect a continuous stream of miscompares, we fall back to Microcheckpointing mode (periodic checkpoint).
>> To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
>
> Yes, that's OK; and I've had a few kernel crashes; normally
> when the qemu crashes, the kernel doesn't really like it;
> but that's OK, I'm sure it will get better.
>
Hmm, thanks very much for your feedback, we are making our efforts to better it... ;)
> I added the following to make my debug easier; which is how
> I found the IPv6 problem.
>
> diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
> index 9e50b62..13c0b48 100644
> --- a/xt_PMYCOLO.c
> +++ b/xt_PMYCOLO.c
> @@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int dataoff,
> h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
>
> if (h == NULL) {
> - pr_dbg("can't find master's ct for slaver packet\n");
> + pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d protonum=%d)\n", l3num, protonum);
> return NULL;
> }
>
> @@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
> /* rcu_read_lock()ed by nf_hook_slow */
> l3proto = __nf_ct_l3proto_find(pf);
> if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, &protonum) <= 0) {
> - pr_dbg("slaver: l3proto not prepared to track yet or error occurred\n");
> + pr_dbg("slaver: l3proto not prepared to track yet or error occurred (pf=%d)\n", pf);
> NF_CT_STAT_INC_ATOMIC(&init_net, error);
> NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
> goto out;
>
>>
>> Thanks,
>> zhanghailiang
>
> Thanks,
>
> Dave
>>
>>
>>> Dave
>>>
>>> Dr. David Alan Gilbert (1):
>>> COLO: Add primary side rolling statistics
>>>
>>> hmp.c | 12 ++++++++++++
>>> include/migration/migration.h | 3 +++
>>> migration/colo.c | 15 +++++++++++++++
>>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
>>> qapi-schema.json | 11 ++++++++++-
>>> 5 files changed, 70 insertions(+), 1 deletion(-)
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
> .
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-11 3:11 ` zhanghailiang
@ 2015-03-11 9:06 ` Dr. David Alan Gilbert
2015-03-11 9:31 ` zhanghailiang
0 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-11 9:06 UTC (permalink / raw)
To: zhanghailiang
Cc: Li Zhijian, yunhong.jiang, eddie.dong, peter.huangpeng,
qemu-devel, Gonglei (Arei), luis
* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> Hi Dave,
>
> Sorry for the late reply :)
No problem.
> On 2015/3/7 2:30, Dr. David Alan Gilbert wrote:
> >* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> >>On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
> >>>From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >>
> >>Hi Dave,
> >>
> >>>
> >>>Hi,
> >>> I'm getting COLO running on a couple of our machines here
> >>>and wanted to see what was actually going on, so I merged
> >>>in my recent rolling-stats code:
> >>>
> >>>http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
> >>>
> >>>with the following patch, and now I get on the primary side,
> >>>info migrate shows me:
> >>>
> >>>capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
> >>>Migration status: colo
> >>>total time: 0 milliseconds
> >>>colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
> >>>colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
> >>>colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
> >>>
> >>>which suggests I've got a problem with the packet comparison; but that's
> >>>a separate issue I'll look at.
> >>>
> >>
> >>There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
> >>so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
> >
> >No, that's OK; we all make them.
> >
> >However, that didn't cure my problem; but after a bit of experimentation I now have
> >COLO working pretty well; thanks for the help!
> >
> > 1) I had to disable IPv6 in the guest; it doesn't look like the
> > conntrack is coping with IPv6 ICMPV6, and on our test network
> > we're getting a few 10s of those each second, so it's constant
> > miscompares (they seem to be neighbour broadcasts and multicast
> > stuff).
> >
>
> Hmm, yes, the proxy code in github does not support ICMPV6 packet comparing.
> We will add this in the future.
>
> > 2) It looks like virtio-net is sending ARPs - possibly every time
> > that a snapshot is loaded; it's not the 'qemu' announce-self code,
> > (I added some debug there and it's not being called); and ARPs
> > cause a miscompare - so you get a continuous streem of miscompares
> > because a miscompare triggers a new snapshot, that sends more ARPs.
> > I solved this by switching to e1000.
> >
>
> I didn't meet this problem, i used tcpdump to capture the net packets and
> did not found any ARPs after VM load in slave.
Interesting.
> Maybe i missed something, Are there any servers/commands that net related run in VM?
I don't think so, and even if they were, I don't think they would go away
by switching to an e1000; I see there is a 'VIRTIO_NET_S_ANNOUNCE' feature
in virtio-net, and I suspect it's that which is doing it, but maybe it
depends on the guest/host kernels to have it enabled?
> And what's your tcpdump command line?
just tcpdump -i em4 -n -w outputfile
> > 3) The other problem with virtio is it's occasionally triggering a
> > 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
> > why, the state COLO sends over should always be consistent.
> >
> > 4) With the e1000 setup; connections are generally fairly responsive,
> > but sshing into the guest takes *ages* (10s of seconds). I'm not sure
> > why, because a curl to a web server seems OK (less than a second)
> > and once the ssh is open it's pretty responsive.
> >
>
> Er, have you tried to ssh into the guest without in COLO mode? Is it also taking a long time?
Not yet; I'm going to try and take some logging to it to find out why.
> I have encounter a similar situation when the slave VM is faked dead which 'info status' is 'running',
> but VM can not respond to keyboad from VNC. Maybe there is some thing wrong with device status, i
> will look into it.
>
> > 5) I've seen one instance of;
> > 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p - buf == aiocb->aio_nbytes' failed.'
> > on the primary side.
> >
> >Stats for a mostly idle guest are now showing:
> >
> >colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) Count: 227 Values: 1650@1425666160229, 1661@1425666161998, 1662@1425666163736, 1687@1425666165524, 811@1425666166438, 788@1425666167298, 1619@1425666168992, 1699@1425666170793, 2711@1425666173602, 1633@1425666175315
> >colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) Count: 227 Values: 107@1425666160337, 75@1425666162074, 100@1425666163837, 102@1425666165627, 71@1425666166510, 74@1425666167373, 101@1425666169094, 97@1425666170891, 79@1425666173682, 97@1425666175413
> >colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 4826386.5) Count: 227 Values: 5998892@1425666160230, 4660988@1425666161999, 6002996@1425666163737, 5945540@1425666165525, 4833356@1425666166439, 5510606@1425666167299, 5793692@1425666168993, 5584388@1425666170794, 7016684@1425666173603, 4349084@1425666175316
> >
> >So, one checkpoint every ~1.5 seconds; that's just with an
> >ssh connected and a script doing a 'curl' to it's http
> >repeatedly. Running 'top' on the ssh with a fast refresh
> >brings the checkpoints much faster; I guess that's because
> >the output of top is quite random.
> >
>
> Yes, it is a known problem, actually, not only 'top' command, every command with
> random output may result in continuous miscompare.
> Besides, the data transferred through SSH will be encrypted, which makes things more bad.
>
> One way to solve this problem maybe:
> if we detect a continuous stream of miscompares, we fall back to Microcheckpointing mode (periodic checkpoint).
Yes, I was going to try and implement that fallback - I've got some ideas
to try for it.
> >>To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
> >
> >Yes, that's OK; and I've had a few kernel crashes; normally
> >when the qemu crashes, the kernel doesn't really like it;
> >but that's OK, I'm sure it will get better.
> >
>
> Hmm, thanks very much for your feedback, we are making our efforts to better it... ;)
Thanks,
Dave
>
> >I added the following to make my debug easier; which is how
> >I found the IPv6 problem.
> >
> >diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
> >index 9e50b62..13c0b48 100644
> >--- a/xt_PMYCOLO.c
> >+++ b/xt_PMYCOLO.c
> >@@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int dataoff,
> > h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
> >
> > if (h == NULL) {
> >- pr_dbg("can't find master's ct for slaver packet\n");
> >+ pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d protonum=%d)\n", l3num, protonum);
> > return NULL;
> > }
> >
> >@@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
> > /* rcu_read_lock()ed by nf_hook_slow */
> > l3proto = __nf_ct_l3proto_find(pf);
> > if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, &protonum) <= 0) {
> >- pr_dbg("slaver: l3proto not prepared to track yet or error occurred\n");
> >+ pr_dbg("slaver: l3proto not prepared to track yet or error occurred (pf=%d)\n", pf);
> > NF_CT_STAT_INC_ATOMIC(&init_net, error);
> > NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
> > goto out;
> >
> >>
> >>Thanks,
> >>zhanghailiang
> >
> >Thanks,
> >
> >Dave
> >>
> >>
> >>>Dave
> >>>
> >>>Dr. David Alan Gilbert (1):
> >>> COLO: Add primary side rolling statistics
> >>>
> >>> hmp.c | 12 ++++++++++++
> >>> include/migration/migration.h | 3 +++
> >>> migration/colo.c | 15 +++++++++++++++
> >>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
> >>> qapi-schema.json | 11 ++++++++++-
> >>> 5 files changed, 70 insertions(+), 1 deletion(-)
> >>>
> >>
> >>
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >.
> >
>
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-11 9:06 ` Dr. David Alan Gilbert
@ 2015-03-11 9:31 ` zhanghailiang
2015-03-11 10:07 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 12+ messages in thread
From: zhanghailiang @ 2015-03-11 9:31 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: hangaohuai, Li Zhijian, yunhong.jiang, eddie.dong,
peter.huangpeng, qemu-devel, Gonglei (Arei), luis
On 2015/3/11 17:06, Dr. David Alan Gilbert wrote:
> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>> Hi Dave,
>>
>> Sorry for the late reply :)
>
> No problem.
>
>> On 2015/3/7 2:30, Dr. David Alan Gilbert wrote:
>>> * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
>>>> On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
>>>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>>
>>>> Hi Dave,
>>>>
>>>>>
>>>>> Hi,
>>>>> I'm getting COLO running on a couple of our machines here
>>>>> and wanted to see what was actually going on, so I merged
>>>>> in my recent rolling-stats code:
>>>>>
>>>>> http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
>>>>>
>>>>> with the following patch, and now I get on the primary side,
>>>>> info migrate shows me:
>>>>>
>>>>> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
>>>>> Migration status: colo
>>>>> total time: 0 milliseconds
>>>>> colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
>>>>> colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
>>>>> colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
>>>>>
>>>>> which suggests I've got a problem with the packet comparison; but that's
>>>>> a separate issue I'll look at.
>>>>>
>>>>
>>>> There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
>>>> so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
>>>
>>> No, that's OK; we all make them.
>>>
>>> However, that didn't cure my problem; but after a bit of experimentation I now have
>>> COLO working pretty well; thanks for the help!
>>>
>>> 1) I had to disable IPv6 in the guest; it doesn't look like the
>>> conntrack is coping with IPv6 ICMPV6, and on our test network
>>> we're getting a few 10s of those each second, so it's constant
>>> miscompares (they seem to be neighbour broadcasts and multicast
>>> stuff).
>>>
>>
>> Hmm, yes, the proxy code in github does not support ICMPV6 packet comparing.
>> We will add this in the future.
>>
>>> 2) It looks like virtio-net is sending ARPs - possibly every time
>>> that a snapshot is loaded; it's not the 'qemu' announce-self code,
>>> (I added some debug there and it's not being called); and ARPs
>>> cause a miscompare - so you get a continuous streem of miscompares
>>> because a miscompare triggers a new snapshot, that sends more ARPs.
>>> I solved this by switching to e1000.
>>>
>>
>> I didn't meet this problem, i used tcpdump to capture the net packets and
>> did not found any ARPs after VM load in slave.
>
> Interesting.
>
>> Maybe i missed something, Are there any servers/commands that net related run in VM?
>
> I don't think so, and even if they were, I don't think they would go away
> by switching to an e1000; I see there is a 'VIRTIO_NET_S_ANNOUNCE' feature
> in virtio-net, and I suspect it's that which is doing it, but maybe it
> depends on the guest/host kernels to have it enabled?
>
Er, quite possible, My host kernel is 3.14.0, and guest is suse11sp3...
>> And what's your tcpdump command line?
>
> just tcpdump -i em4 -n -w outputfile
>
>>> 3) The other problem with virtio is it's occasionally triggering a
>>> 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
>>> why, the state COLO sends over should always be consistent.
>>>
>>> 4) With the e1000 setup; connections are generally fairly responsive,
>>> but sshing into the guest takes *ages* (10s of seconds). I'm not sure
>>> why, because a curl to a web server seems OK (less than a second)
>>> and once the ssh is open it's pretty responsive.
>>>
>>
>> Er, have you tried to ssh into the guest without in COLO mode? Is it also taking a long time?
>
> Not yet; I'm going to try and take some logging to it to find out why.
>
>> I have encounter a similar situation when the slave VM is faked dead which 'info status' is 'running',
>> but VM can not respond to keyboad from VNC. Maybe there is some thing wrong with device status, i
>> will look into it.
>>
>>> 5) I've seen one instance of;
>>> 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p - buf == aiocb->aio_nbytes' failed.'
>>> on the primary side.
>>>
>>> Stats for a mostly idle guest are now showing:
>>>
>>> colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) Count: 227 Values: 1650@1425666160229, 1661@1425666161998, 1662@1425666163736, 1687@1425666165524, 811@1425666166438, 788@1425666167298, 1619@1425666168992, 1699@1425666170793, 2711@1425666173602, 1633@1425666175315
>>> colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) Count: 227 Values: 107@1425666160337, 75@1425666162074, 100@1425666163837, 102@1425666165627, 71@1425666166510, 74@1425666167373, 101@1425666169094, 97@1425666170891, 79@1425666173682, 97@1425666175413
>>> colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 4826386.5) Count: 227 Values: 5998892@1425666160230, 4660988@1425666161999, 6002996@1425666163737, 5945540@1425666165525, 4833356@1425666166439, 5510606@1425666167299, 5793692@1425666168993, 5584388@1425666170794, 7016684@1425666173603, 4349084@1425666175316
>>>
>>> So, one checkpoint every ~1.5 seconds; that's just with an
>>> ssh connected and a script doing a 'curl' to it's http
>>> repeatedly. Running 'top' on the ssh with a fast refresh
>>> brings the checkpoints much faster; I guess that's because
>>> the output of top is quite random.
>>>
>>
>> Yes, it is a known problem, actually, not only 'top' command, every command with
>> random output may result in continuous miscompare.
>> Besides, the data transferred through SSH will be encrypted, which makes things more bad.
>>
>> One way to solve this problem maybe:
>> if we detect a continuous stream of miscompares, we fall back to Microcheckpointing mode (periodic checkpoint).
>
> Yes, I was going to try and implement that fallback - I've got some ideas
> to try for it.
>
>>>> To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
>>>
>>> Yes, that's OK; and I've had a few kernel crashes; normally
>>> when the qemu crashes, the kernel doesn't really like it;
>>> but that's OK, I'm sure it will get better.
>>>
>>
>> Hmm, thanks very much for your feedback, we are making our efforts to better it... ;)
>
> Thanks,
>
> Dave
>
>>
>>> I added the following to make my debug easier; which is how
>>> I found the IPv6 problem.
>>>
>>> diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
>>> index 9e50b62..13c0b48 100644
>>> --- a/xt_PMYCOLO.c
>>> +++ b/xt_PMYCOLO.c
>>> @@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int dataoff,
>>> h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
>>>
>>> if (h == NULL) {
>>> - pr_dbg("can't find master's ct for slaver packet\n");
>>> + pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d protonum=%d)\n", l3num, protonum);
>>> return NULL;
>>> }
>>>
>>> @@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
>>> /* rcu_read_lock()ed by nf_hook_slow */
>>> l3proto = __nf_ct_l3proto_find(pf);
>>> if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, &protonum) <= 0) {
>>> - pr_dbg("slaver: l3proto not prepared to track yet or error occurred\n");
>>> + pr_dbg("slaver: l3proto not prepared to track yet or error occurred (pf=%d)\n", pf);
>>> NF_CT_STAT_INC_ATOMIC(&init_net, error);
>>> NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
>>> goto out;
>>>
>>>>
>>>> Thanks,
>>>> zhanghailiang
>>>
>>> Thanks,
>>>
>>> Dave
>>>>
>>>>
>>>>> Dave
>>>>>
>>>>> Dr. David Alan Gilbert (1):
>>>>> COLO: Add primary side rolling statistics
>>>>>
>>>>> hmp.c | 12 ++++++++++++
>>>>> include/migration/migration.h | 3 +++
>>>>> migration/colo.c | 15 +++++++++++++++
>>>>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
>>>>> qapi-schema.json | 11 ++++++++++-
>>>>> 5 files changed, 70 insertions(+), 1 deletion(-)
>>>>>
>>>>
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>>> .
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
> .
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo
2015-03-11 9:31 ` zhanghailiang
@ 2015-03-11 10:07 ` Dr. David Alan Gilbert
0 siblings, 0 replies; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2015-03-11 10:07 UTC (permalink / raw)
To: zhanghailiang
Cc: hangaohuai, Li Zhijian, yunhong.jiang, eddie.dong,
peter.huangpeng, qemu-devel, Gonglei (Arei), luis
* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> On 2015/3/11 17:06, Dr. David Alan Gilbert wrote:
> >* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> >>Hi Dave,
> >>
> >>Sorry for the late reply :)
> >
> >No problem.
> >
> >>On 2015/3/7 2:30, Dr. David Alan Gilbert wrote:
> >>>* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> >>>>On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:
> >>>>>From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >>>>
> >>>>Hi Dave,
> >>>>
> >>>>>
> >>>>>Hi,
> >>>>> I'm getting COLO running on a couple of our machines here
> >>>>>and wanted to see what was actually going on, so I merged
> >>>>>in my recent rolling-stats code:
> >>>>>
> >>>>>http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html
> >>>>>
> >>>>>with the following patch, and now I get on the primary side,
> >>>>>info migrate shows me:
> >>>>>
> >>>>>capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off colo: on
> >>>>>Migration status: colo
> >>>>>total time: 0 milliseconds
> >>>>>colo checkpoint (ms): Min/Max: 0, 10000 Mean: -1.1415868e-13 (Weighted: 4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
> >>>>>colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 62@1425561742681, 61@1425561742743, 80@1425561742824
> >>>>>colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 55294@1425561742744, 145582@1425561742825
> >>>>>
> >>>>>which suggests I've got a problem with the packet comparison; but that's
> >>>>>a separate issue I'll look at.
> >>>>>
> >>>>
> >>>>There is an obvious mistake we have made in proxy, the macro 'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
> >>>>so please fix it before do the follow test. Sorry for this low-grade mistake, we should do full test before issue it. ;)
> >>>
> >>>No, that's OK; we all make them.
> >>>
> >>>However, that didn't cure my problem; but after a bit of experimentation I now have
> >>>COLO working pretty well; thanks for the help!
> >>>
> >>> 1) I had to disable IPv6 in the guest; it doesn't look like the
> >>> conntrack is coping with IPv6 ICMPV6, and on our test network
> >>> we're getting a few 10s of those each second, so it's constant
> >>> miscompares (they seem to be neighbour broadcasts and multicast
> >>> stuff).
> >>>
> >>
> >>Hmm, yes, the proxy code in github does not support ICMPV6 packet comparing.
> >>We will add this in the future.
> >>
> >>> 2) It looks like virtio-net is sending ARPs - possibly every time
> >>> that a snapshot is loaded; it's not the 'qemu' announce-self code,
> >>> (I added some debug there and it's not being called); and ARPs
> >>> cause a miscompare - so you get a continuous streem of miscompares
> >>> because a miscompare triggers a new snapshot, that sends more ARPs.
> >>> I solved this by switching to e1000.
> >>>
> >>
> >>I didn't meet this problem, i used tcpdump to capture the net packets and
> >>did not found any ARPs after VM load in slave.
> >
> >Interesting.
> >
> >>Maybe i missed something, Are there any servers/commands that net related run in VM?
> >
> >I don't think so, and even if they were, I don't think they would go away
> >by switching to an e1000; I see there is a 'VIRTIO_NET_S_ANNOUNCE' feature
> >in virtio-net, and I suspect it's that which is doing it, but maybe it
>
> >depends on the guest/host kernels to have it enabled?
> >
>
> Er, quite possible, My host kernel is 3.14.0, and guest is suse11sp3...
I'm running 3.18 on both host and guest (Fedora 20 guest, RHEL7 host but
with custom kernel).
Dave
>
> >>And what's your tcpdump command line?
> >
> >just tcpdump -i em4 -n -w outputfile
> >
> >>> 3) The other problem with virtio is it's occasionally triggering a
> >>> 'virtio: error trying to map MMIO memory' from qemu; I'm not sure
> >>> why, the state COLO sends over should always be consistent.
> >>>
> >>> 4) With the e1000 setup; connections are generally fairly responsive,
> >>> but sshing into the guest takes *ages* (10s of seconds). I'm not sure
> >>> why, because a curl to a web server seems OK (less than a second)
> >>> and once the ssh is open it's pretty responsive.
> >>>
> >>
> >>Er, have you tried to ssh into the guest without in COLO mode? Is it also taking a long time?
> >
> >Not yet; I'm going to try and take some logging to it to find out why.
> >
> >>I have encounter a similar situation when the slave VM is faked dead which 'info status' is 'running',
> >>but VM can not respond to keyboad from VNC. Maybe there is some thing wrong with device status, i
> >>will look into it.
> >>
> >>> 5) I've seen one instance of;
> >>> 'qemu-system-x86_64: block/raw-posix.c:836: handle_aiocb_rw: Assertion `p - buf == aiocb->aio_nbytes' failed.'
> >>> on the primary side.
> >>>
> >>>Stats for a mostly idle guest are now showing:
> >>>
> >>>colo checkpoint (ms): Min/Max: 0, 10004 Mean: 1592.1 (Weighted: 1806.214) Count: 227 Values: 1650@1425666160229, 1661@1425666161998, 1662@1425666163736, 1687@1425666165524, 811@1425666166438, 788@1425666167298, 1619@1425666168992, 1699@1425666170793, 2711@1425666173602, 1633@1425666175315
> >>>colo paused time (ms): Min/Max: 58, 2975 Mean: 90.3 (Weighted: 94.109752) Count: 227 Values: 107@1425666160337, 75@1425666162074, 100@1425666163837, 102@1425666165627, 71@1425666166510, 74@1425666167373, 101@1425666169094, 97@1425666170891, 79@1425666173682, 97@1425666175413
> >>>colo checkpoint size: Min/Max: 212252, 1.9241972e+08 Mean: 5569622.6 (Weighted: 4826386.5) Count: 227 Values: 5998892@1425666160230, 4660988@1425666161999, 6002996@1425666163737, 5945540@1425666165525, 4833356@1425666166439, 5510606@1425666167299, 5793692@1425666168993, 5584388@1425666170794, 7016684@1425666173603, 4349084@1425666175316
> >>>
> >>>So, one checkpoint every ~1.5 seconds; that's just with an
> >>>ssh connected and a script doing a 'curl' to it's http
> >>>repeatedly. Running 'top' on the ssh with a fast refresh
> >>>brings the checkpoints much faster; I guess that's because
> >>>the output of top is quite random.
> >>>
> >>
> >>Yes, it is a known problem, actually, not only 'top' command, every command with
> >>random output may result in continuous miscompare.
> >>Besides, the data transferred through SSH will be encrypted, which makes things more bad.
> >>
> >>One way to solve this problem maybe:
> >>if we detect a continuous stream of miscompares, we fall back to Microcheckpointing mode (periodic checkpoint).
> >
> >Yes, I was going to try and implement that fallback - I've got some ideas
> >to try for it.
> >
> >>>>To be honest, the proxy part in github is not integrated, we have cut it just for easy review and understand, so there may be some mistakes.
> >>>
> >>>Yes, that's OK; and I've had a few kernel crashes; normally
> >>>when the qemu crashes, the kernel doesn't really like it;
> >>>but that's OK, I'm sure it will get better.
> >>>
> >>
> >>Hmm, thanks very much for your feedback, we are making our efforts to better it... ;)
> >
> >Thanks,
> >
> >Dave
> >
> >>
> >>>I added the following to make my debug easier; which is how
> >>>I found the IPv6 problem.
> >>>
> >>>diff --git a/xt_PMYCOLO.c b/xt_PMYCOLO.c
> >>>index 9e50b62..13c0b48 100644
> >>>--- a/xt_PMYCOLO.c
> >>>+++ b/xt_PMYCOLO.c
> >>>@@ -1072,7 +1072,7 @@ resolve_master_ct(struct sk_buff *skb, unsigned int dataoff,
> >>> h = nf_conntrack_find_get(&init_net, NF_CT_DEFAULT_ZONE, &tuple);
> >>>
> >>> if (h == NULL) {
> >>>- pr_dbg("can't find master's ct for slaver packet\n");
> >>>+ pr_dbg("can't find master's ct for slaver packet (pf/l3num=%d protonum=%d)\n", l3num, protonum);
> >>> return NULL;
> >>> }
> >>>
> >>>@@ -1092,7 +1092,7 @@ nf_conntrack_slaver_in(u_int8_t pf, unsigned int hooknum,
> >>> /* rcu_read_lock()ed by nf_hook_slow */
> >>> l3proto = __nf_ct_l3proto_find(pf);
> >>> if (l3proto->get_l4proto(skb, skb_network_offset(skb), &dataoff, &protonum) <= 0) {
> >>>- pr_dbg("slaver: l3proto not prepared to track yet or error occurred\n");
> >>>+ pr_dbg("slaver: l3proto not prepared to track yet or error occurred (pf=%d)\n", pf);
> >>> NF_CT_STAT_INC_ATOMIC(&init_net, error);
> >>> NF_CT_STAT_INC_ATOMIC(&init_net, invalid);
> >>> goto out;
> >>>
> >>>>
> >>>>Thanks,
> >>>>zhanghailiang
> >>>
> >>>Thanks,
> >>>
> >>>Dave
> >>>>
> >>>>
> >>>>>Dave
> >>>>>
> >>>>>Dr. David Alan Gilbert (1):
> >>>>> COLO: Add primary side rolling statistics
> >>>>>
> >>>>> hmp.c | 12 ++++++++++++
> >>>>> include/migration/migration.h | 3 +++
> >>>>> migration/colo.c | 15 +++++++++++++++
> >>>>> migration/migration.c | 30 ++++++++++++++++++++++++++++++
> >>>>> qapi-schema.json | 11 ++++++++++-
> >>>>> 5 files changed, 70 insertions(+), 1 deletion(-)
> >>>>>
> >>>>
> >>>>
> >>>--
> >>>Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>
> >>>.
> >>>
> >>
> >>
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >.
> >
>
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-03-11 10:07 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-05 13:31 [Qemu-devel] [RFC 0/1] Rolling stats on colo Dr. David Alan Gilbert (git)
2015-03-05 13:31 ` [Qemu-devel] [RFC 1/1] COLO: Add primary side rolling statistics Dr. David Alan Gilbert (git)
2015-03-06 1:48 ` [Qemu-devel] [RFC 0/1] Rolling stats on colo zhanghailiang
2015-03-06 1:52 ` zhanghailiang
2015-03-06 18:30 ` Dr. David Alan Gilbert
2015-03-09 2:37 ` Wen Congyang
2015-03-09 8:55 ` Dr. David Alan Gilbert
2015-03-09 9:01 ` Wen Congyang
2015-03-11 3:11 ` zhanghailiang
2015-03-11 9:06 ` Dr. David Alan Gilbert
2015-03-11 9:31 ` zhanghailiang
2015-03-11 10:07 ` Dr. David Alan Gilbert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).