From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out-219.mta1.migadu.com (out-219.mta1.migadu.com [95.215.58.219])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3102A1A72C
	for <kvmarm@lists.linux.dev>; Wed, 13 Sep 2023 17:30:10 +0000 (UTC)
Date: Wed, 13 Sep 2023 17:30:03 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1694626209;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=2jxXuDsdxHjvkm94f22PEQvZk32OEtTjBlaxJsBMLbI=;
	b=c6QvLNV30cJqCkCX7oLBhgkG42dkPQdepaLwLCGf4Nn1+mCXbT8eTbHs5H6L5w5rjL5rJQ
	TxE2q8GVA8+z3twT2/ZJ1iM6czH3S4wEm9kNEbu0wfzZ6aaxafm5AYIh4iMRo4oyQMkGAy
	bNyxYeZyI163al5tNrlTnsuXODecGvk=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Oliver Upton <oliver.upton@linux.dev>
To: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com, zhukeqian1@huawei.com,
	jonathan.cameron@huawei.com, linuxarm@huawei.com
Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
Message-ID: <ZQHxm+L890yTpY91@linux.dev>
References: <20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com>
Precedence: bulk
X-Mailing-List: kvmarm@lists.linux.dev
List-Id: <kvmarm.lists.linux.dev>
List-Subscribe: <mailto:kvmarm+subscribe@lists.linux.dev>
List-Unsubscribe: <mailto:kvmarm+unsubscribe@lists.linux.dev>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com>
X-Migadu-Flow: FLOW_OUT

Hi Shameer,

On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote:
> Hi,
> 
> This is to revive the RFC series[1], which makes use of hardware dirty
> bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
> out by Zhu Keqian sometime back.
> 
> One of the main drawbacks in using the hardware DBM feature for dirty
> page tracking is the additional overhead in scanning the PTEs for dirty
> pages[2]. Also there are no vCPU page faults when we set the DBM bit,
> which may result in higher convergence time during guest migration. 
> 
> This series tries to reduce these overheads by not setting the
> DBM for all the writeable pages during migration and instead uses a
> combined software(current page fault mechanism) and hardware approach
> (set DBM) for dirty page tracking.
> 
> As noted in RFC v1[1],
> "The core idea is that we do not enable hardware dirty at start (do not
> add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
> for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
> DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
> PTs with hardware dirty enabled, so we do not need to scan all PTs."

I'm unconvinced of the value of such a change.

What you're proposing here is complicated and I fear not easily
maintainable. Keeping the *two* sources of dirty state seems likely to
fail (eventually) with some very unfortunate consequences.

The optimization of enabling DBM on neighboring PTEs is presumptive of
the guest access pattern and could incur unnecessary scans of the
stage-2 page table w/ a sufficiently sparse guest access pattern.

> Tests with dirty_log_perf_test with anonymous THP pages shows significant
> improvement in "dirty memory time" as expected but with a hit on
> "get dirty time" .
> 
> ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp
> 
> +---------------------------+----------------+------------------+
> |                           |   6.5-rc5      | 6.5-rc5 + series |
> |                           |     (s)        |       (s)        |
> +---------------------------+----------------+------------------+
> |    dirty memory time      |    4.22        |          0.41    |
> |    get dirty log time     |    0.00047     |          3.25    |
> |    clear dirty log time   |    0.48        |          0.98    |
> +---------------------------------------------------------------+

The vCPU:memory ratio you're testing doesn't seem representative of what
a typical cloud provider would be configuring, and the dirty log
collection is going to scale linearly with the size of guest memory.

Slow dirty log collection is going to matter a lot for VM blackout,
which from experience tends to be the most sensitive period of live
migration for guest workloads.

At least in our testing, the split GET/CLEAR dirty log ioctls
dramatically improved the performance of a write-protection based ditry
tracking scheme, as the false positive rate for dirtied pages is
significantly reduced. FWIW, this is what we use for doing LM on arm64 as
opposed to the D-bit implemenation that we use on x86.
       
> In order to get some idea on actual live migration performance,
> I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
> while the test was in progress initiated live migration(local).
> 
> redis-benchmark -t set -c 900 -n 5000000 --threads 96
> 
> Average of 5 runs shows that benchmark finishes ~10% faster with
> a ~8% increase in "total time" for migration.
> 
> +---------------------------+----------------+------------------+
> |                           |   6.5-rc5      | 6.5-rc5 + series |
> |                           |     (s)        |    (s)           |
> +---------------------------+----------------+------------------+
> | [redis]5000000 requests in|    79.428      |      71.49       |
> | [info migrate]total time  |    8438        |      9097        |
> +---------------------------------------------------------------+

Faster pre-copy performance would help the benchmark complete faster,
but the goal for a live migration should be to minimize the lost
computation for the entire operation. You'd need to test with a
continuous workload rather than one with a finite amount of work.

Also, do you know what live migration scheme you're using here?

-- 
Thanks,
Oliver

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 9A9B9EE01F4
	for <linux-arm-kernel@archiver.kernel.org>; Wed, 13 Sep 2023 17:30:48 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=aGsAWKVovQZ37LliN6OWgwWCiXM3Ur2OhQNMiXAbDHM=; b=QiN3ZgBbS823At
	KcRNQT9P+8DlKa0hz8yu+w43bMoNfoCT371eot6wCNCqCFXBc2ackj0zRBjyK4c/i5ft6CjnCJ0Ra
	JzxGWCv6eaIU+iNZuAT3aLmnJVOBtg8ouVTNx3BnnDedQjmD0+dhszkkKbW3DdySYyD8plnqGR9et
	1zOBmKml06lBLw4/rXu5R1C3LLsGkvw3c7HJZONy6MYFklCYeGe3H99mvdCBvgIf4mzgUBzfk+Opx
	kOn4Uc0nwmyw7B7wBEOjVKqehuDaT0I8MHQfO0PocUp/PxrZ+imyn+54C1zBMHZgqg97o0wIEZUsE
	almfw4YDzvY7XZsqsFVQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux))
	id 1qgTgv-006Upk-1r;
	Wed, 13 Sep 2023 17:30:17 +0000
Received: from out-217.mta1.migadu.com ([95.215.58.217])
	by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux))
	id 1qgTgr-006UoQ-02
	for linux-arm-kernel@lists.infradead.org;
	Wed, 13 Sep 2023 17:30:15 +0000
Date: Wed, 13 Sep 2023 17:30:03 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1694626209;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=2jxXuDsdxHjvkm94f22PEQvZk32OEtTjBlaxJsBMLbI=;
	b=c6QvLNV30cJqCkCX7oLBhgkG42dkPQdepaLwLCGf4Nn1+mCXbT8eTbHs5H6L5w5rjL5rJQ
	TxE2q8GVA8+z3twT2/ZJ1iM6czH3S4wEm9kNEbu0wfzZ6aaxafm5AYIh4iMRo4oyQMkGAy
	bNyxYeZyI163al5tNrlTnsuXODecGvk=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Oliver Upton <oliver.upton@linux.dev>
To: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, maz@kernel.org,
	will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com, zhukeqian1@huawei.com,
	jonathan.cameron@huawei.com, linuxarm@huawei.com
Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log
Message-ID: <ZQHxm+L890yTpY91@linux.dev>
References: <20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com>
X-Migadu-Flow: FLOW_OUT
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20230913_103013_631731_4C505B6F 
X-CRM114-Status: GOOD (  22.31  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi Shameer,

On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote:
> Hi,
> =

> This is to revive the RFC series[1], which makes use of hardware dirty
> bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
> out by Zhu Keqian sometime back.
> =

> One of the main drawbacks in using the hardware DBM feature for dirty
> page tracking is the additional overhead in scanning the PTEs for dirty
> pages[2]. Also there are no vCPU page faults when we set the DBM bit,
> which may result in higher convergence time during guest migration. =

> =

> This series tries to reduce these overheads by not setting the
> DBM for all the writeable pages during migration and instead uses a
> combined software(current page fault mechanism) and hardware approach
> (set DBM) for dirty page tracking.
> =

> As noted in RFC v1[1],
> "The core idea is that we do not enable hardware dirty at start (do not
> add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
> for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
> DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
> PTs with hardware dirty enabled, so we do not need to scan all PTs."

I'm unconvinced of the value of such a change.

What you're proposing here is complicated and I fear not easily
maintainable. Keeping the *two* sources of dirty state seems likely to
fail (eventually) with some very unfortunate consequences.

The optimization of enabling DBM on neighboring PTEs is presumptive of
the guest access pattern and could incur unnecessary scans of the
stage-2 page table w/ a sufficiently sparse guest access pattern.

> Tests with dirty_log_perf_test with anonymous THP pages shows significant
> improvement in "dirty memory time" as expected but with a hit on
> "get dirty time" .
> =

> ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp
> =

> +---------------------------+----------------+------------------+
> | =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 6.5-rc5 =A0 =
=A0 =A0| 6.5-rc5 + series |
> | =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 (s) =A0 =
=A0 =A0 =A0| =A0 =A0 =A0 (s) =A0 =A0 =A0 =A0|
> +---------------------------+----------------+------------------+
> | =A0 =A0dirty memory time =A0 =A0 =A0| =A0 =A04.22 =A0 =A0 =A0 =A0| =A0 =
=A0 =A0 =A0 =A00.41 =A0 =A0|
> | =A0 =A0get dirty log time =A0 =A0 | =A0 =A00.00047 =A0 =A0 | =A0 =A0 =
=A0 =A0 =A03.25 =A0 =A0|
> | =A0 =A0clear dirty log time =A0 | =A0 =A00.48 =A0 =A0 =A0 =A0| =A0 =A0 =
=A0 =A0 =A00.98 =A0 =A0|
> +---------------------------------------------------------------+

The vCPU:memory ratio you're testing doesn't seem representative of what
a typical cloud provider would be configuring, and the dirty log
collection is going to scale linearly with the size of guest memory.

Slow dirty log collection is going to matter a lot for VM blackout,
which from experience tends to be the most sensitive period of live
migration for guest workloads.

At least in our testing, the split GET/CLEAR dirty log ioctls
dramatically improved the performance of a write-protection based ditry
tracking scheme, as the false positive rate for dirtied pages is
significantly reduced. FWIW, this is what we use for doing LM on arm64 as
opposed to the D-bit implemenation that we use on x86.
=A0 =A0 =A0 =A0
> In order to get some idea on actual live migration performance,
> I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
> while the test was in progress initiated live migration(local).
> =

> redis-benchmark -t set -c 900 -n 5000000 --threads 96
> =

> Average of 5 runs shows that benchmark finishes ~10% faster with
> a ~8% increase in "total time" for migration.
> =

> +---------------------------+----------------+------------------+
> | =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 6.5-rc5 =A0 =
=A0 =A0| 6.5-rc5 + series |
> | =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 (s) =A0 =
=A0 =A0 =A0| =A0 =A0(s) =A0 =A0 =A0 =A0 =A0 |
> +---------------------------+----------------+------------------+
> | [redis]5000000 requests in| =A0 =A079.428 =A0 =A0 =A0| =A0 =A0 =A071.49=
 =A0 =A0 =A0 |
> | [info migrate]total time =A0| =A0 =A08438 =A0 =A0 =A0 =A0| =A0 =A0 =A09=
097 =A0 =A0 =A0 =A0|
> +---------------------------------------------------------------+

Faster pre-copy performance would help the benchmark complete faster,
but the goal for a live migration should be to minimize the lost
computation for the entire operation. You'd need to test with a
continuous workload rather than one with a finite amount of work.

Also, do you know what live migration scheme you're using here?

-- =

Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel