From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51931)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1ZjmOs-000431-D7
	for qemu-devel@nongnu.org; Wed, 07 Oct 2015 06:56:47 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1ZjmOp-0001BY-5f
	for qemu-devel@nongnu.org; Wed, 07 Oct 2015 06:56:46 -0400
Received: from mx1.redhat.com ([209.132.183.28]:36091)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1ZjmOo-0001BD-UH
	for qemu-devel@nongnu.org; Wed, 07 Oct 2015 06:56:43 -0400
Date: Wed, 7 Oct 2015 11:56:38 +0100
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20151007105637.GB2710@work-vm>
References: <1443103624-17230-1-git-send-email-den@openvz.org>
	<20150928192258.GE5468@work-vm> <560D4430.3090203@virtuozzo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <560D4430.3090203@virtuozzo.com>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH 1/1] migration: fix expected_downtime
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Igor Redko <redkoi@virtuozzo.com>
Cc: Amit Shah <amit.shah@redhat.com>, "Denis V. Lunev" <den@openvz.org>, qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>

* Igor Redko (redkoi@virtuozzo.com) wrote:
> On 28.09.2015 22:22, Dr. David Alan Gilbert wrote:
> >* Denis V. Lunev (den@openvz.org) wrote:
> >>From: Igor Redko <redkoi@virtuozzo.com>
> >>
> >>To get this estimation we must divide pending_size by bandwidth
> >>according to description of expected-downtime ("qmp-commands.hx:3246"=
):
> >>   "expected-downtime": only present while migration is active
> >>               total amount in ms for downtime that was calculated on
> >>               the last bitmap round (json-int)
> >>
> >>Previous version was just wrong because dirty_bytes_rate and bandwidt=
h
> >>are measured in Bytes/ms, so dividing first by second we get some
> >>dimensionless quantity.
> >>As it said in description above this value is showed during active
> >>migration phase and recalculated only after transferring all memory
> >>and if this process took more than 1 sec. So maybe just nobody notice=
d
> >>that bug.
> >
> >While I agree the existing code looks wrong, I don't see how this is
> >any more correct.
>=20
> This patch is aimed to fix units of expected_downtime. It is reasonable=
 that
> expected_downtime should be measured in milliseconds. While the existin=
g
> implementation lacks of any units.

I agree your units are correct where the old one isn't; and I agree
it needs fixing.
However I'm worrying about whether the value in your fix is correct.

> >  I think 'pending_size' is an estimate of the number of bytes left
> >to transfer, the intention being that most of those are transferred
> >prior to pausing the machine, if those are transferred before pausing
> >then they aren't part of the downtime.
> >
> Yes, 'pending_size' is an estimate of the number of bytes left to trans=
fer,
> indeed.
> But the condition:
> >	if (s->dirty_bytes_rate && transferred_bytes > 10000) {
> slightly modifies the meaning of pending_size correspondingly.
> dirty_bytes_rate is set in migration_sync() that is called when pending=
_size
> < max_downtime * bandwidth. This estimation is higher than max_downtime=
 by
> design

I don't think that check really modifies the meaning of pending_size;
it's just a sanity check that means we don't start trying to predict down=
time
when we've not transmitted much yet.

> >It feels that:
> >    * If the guest wasn't dirtying pages, then you wouldn't have to
> >      pause the guest; if it was just dirtying them a little then you
> >      wouldn't have much to transfer after the pages you'd already
> >      sent; so if the guest dirty pages fast then the estimate should =
be
> >      larger; so 'dirty_bytes_rate' being on top of the fraction feels=
 right.
> >
> >    * If the bandwidth is higher then the estimate should be smaller; =
so
> >      'bandwidth' being on the bottom of the fraction feels right.
> >
> >Dave
> >
> The 'expected_downtime' in the existing code takes two types of values:
>   * positive - dirty_bytes_rate is higher than bandwidth. In this
>     case migration doesn't complete.
>   * zero - bandwidth is higher than dirty_bytes_rate. In this case
>     migration is possible, but we don=E2=80=99t have the downtime value=
.

OK, I don't think that should give zero; my argument being that given
each pass throuhg RAM takes some time, you're always going to some propor=
tion
of RAM that's dirty.

> This patch has some imperfections. But if we would look back into histo=
ry,
> it seems that this patch just restores the broken logic.
> The existing code is introduced by commit
> https://github.com/qemu/qemu/commit/90f8ae724a575861f093fbdbfd49a925bcf=
ec327
> which claims, that it just restores the mistakenly deleted estimation
> (commit
> https://github.com/qemu/qemu/commit/e4ed1541ac9413eac494a03532e34beaf8a=
7d1c5)
> Meanwhile, the estimation has changed during this restore operation. Th=
e
> estimation before the removal (before
> e4ed1541ac9413eac494a03532e34beaf8a7d1c5) was just like the one in my p=
atch.

Yes; remember I believe that the current code is wrong - I'm just not
sure your suggested fix is right.

> So maybe we should think about improvement of this estimation.
> I'm suggest using something like:
>     expected_downtime =3D migrate_max_downtime * dirty_bytes_rate / ban=
dwidth
>=20
> In my opinion this is more correct than the existing approach since the=
 last
> step of the migration process (before pause) is transferring of max_siz=
e
> bytes (max_size =3D  bandwidth * migrate_max_downtime() / 1000000). So =
the
> bytes that were dirtied at this step will be transferred during downtim=
e.
> The transferred bytes count is dirty_bytes_rate * max_size/bandwidth or
> migrate_max_downtime * dirty_bytes_rate and division by the bandwidth
> results in a formula:
>     expected_downtime =3D migrate_max_downtime * dirty_bytes_rate / ban=
dwidth

Are you sure - I thought migrate_max_downtime is governing the time *afte=
r* pause;
and it doesn't make sense to me for the expectation of the system to be
related to the expectation of the user.

Dave

> Igor
>=20
> >>Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
> >>Reviewed-by: Anna Melekhova <annam@virtuozzo.com>
> >>Signed-off-by: Denis V. Lunev <den@openvz.org>
> >>CC: Juan Quintela <quintela@redhat.com>
> >>CC: Amit Shah <amit.shah@redhat.com>
> >>---
> >>  migration/migration.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >>diff --git a/migration/migration.c b/migration/migration.c
> >>index 662e77e..d55d545 100644
> >>--- a/migration/migration.c
> >>+++ b/migration/migration.c
> >>@@ -994,7 +994,7 @@ static void *migration_thread(void *opaque)
> >>              /* if we haven't sent anything, we don't want to recalc=
ulate
> >>                 10000 is a small enough number for our purposes */
> >>              if (s->dirty_bytes_rate && transferred_bytes > 10000) {
> >>-                s->expected_downtime =3D s->dirty_bytes_rate / bandw=
idth;
> >>+                s->expected_downtime =3D pending_size / bandwidth;
> >>              }
> >>
> >>              qemu_file_reset_rate_limit(s->file);
> >>--
> >>2.1.4
> >>
> >>
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
>=20
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK