From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-219.mta1.migadu.com (out-219.mta1.migadu.com [95.215.58.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3102A1A72C for ; Wed, 13 Sep 2023 17:30:10 +0000 (UTC) Date: Wed, 13 Sep 2023 17:30:03 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1694626209; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2jxXuDsdxHjvkm94f22PEQvZk32OEtTjBlaxJsBMLbI=; b=c6QvLNV30cJqCkCX7oLBhgkG42dkPQdepaLwLCGf4Nn1+mCXbT8eTbHs5H6L5w5rjL5rJQ TxE2q8GVA8+z3twT2/ZJ1iM6czH3S4wEm9kNEbu0wfzZ6aaxafm5AYIh4iMRo4oyQMkGAy bNyxYeZyI163al5tNrlTnsuXODecGvk= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Oliver Upton To: Shameer Kolothum Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, maz@kernel.org, will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, zhukeqian1@huawei.com, jonathan.cameron@huawei.com, linuxarm@huawei.com Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Message-ID: References: <20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com> Precedence: bulk X-Mailing-List: kvmarm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com> X-Migadu-Flow: FLOW_OUT Hi Shameer, On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote: > Hi, > > This is to revive the RFC series[1], which makes use of hardware dirty > bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent > out by Zhu Keqian sometime back. > > One of the main drawbacks in using the hardware DBM feature for dirty > page tracking is the additional overhead in scanning the PTEs for dirty > pages[2]. Also there are no vCPU page faults when we set the DBM bit, > which may result in higher convergence time during guest migration. > > This series tries to reduce these overheads by not setting the > DBM for all the writeable pages during migration and instead uses a > combined software(current page fault mechanism) and hardware approach > (set DBM) for dirty page tracking. > > As noted in RFC v1[1], > "The core idea is that we do not enable hardware dirty at start (do not > add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking > for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add > DBM bit for nearby 64PTs). Then when sync dirty log, we have known all > PTs with hardware dirty enabled, so we do not need to scan all PTs." I'm unconvinced of the value of such a change. What you're proposing here is complicated and I fear not easily maintainable. Keeping the *two* sources of dirty state seems likely to fail (eventually) with some very unfortunate consequences. The optimization of enabling DBM on neighboring PTEs is presumptive of the guest access pattern and could incur unnecessary scans of the stage-2 page table w/ a sufficiently sparse guest access pattern. > Tests with dirty_log_perf_test with anonymous THP pages shows significant > improvement in "dirty memory time" as expected but with a hit on > "get dirty time" . > > ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp > > +---------------------------+----------------+------------------+ > |                           |   6.5-rc5      | 6.5-rc5 + series | > |                           |     (s)        |       (s)        | > +---------------------------+----------------+------------------+ > |    dirty memory time      |    4.22        |          0.41    | > |    get dirty log time     |    0.00047     |          3.25    | > |    clear dirty log time   |    0.48        |          0.98    | > +---------------------------------------------------------------+ The vCPU:memory ratio you're testing doesn't seem representative of what a typical cloud provider would be configuring, and the dirty log collection is going to scale linearly with the size of guest memory. Slow dirty log collection is going to matter a lot for VM blackout, which from experience tends to be the most sensitive period of live migration for guest workloads. At least in our testing, the split GET/CLEAR dirty log ioctls dramatically improved the performance of a write-protection based ditry tracking scheme, as the false positive rate for dirtied pages is significantly reduced. FWIW, this is what we use for doing LM on arm64 as opposed to the D-bit implemenation that we use on x86.         > In order to get some idea on actual live migration performance, > I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and > while the test was in progress initiated live migration(local). > > redis-benchmark -t set -c 900 -n 5000000 --threads 96 > > Average of 5 runs shows that benchmark finishes ~10% faster with > a ~8% increase in "total time" for migration. > > +---------------------------+----------------+------------------+ > |                           |   6.5-rc5      | 6.5-rc5 + series | > |                           |     (s)        |    (s)           | > +---------------------------+----------------+------------------+ > | [redis]5000000 requests in|    79.428      |      71.49       | > | [info migrate]total time  |    8438        |      9097        | > +---------------------------------------------------------------+ Faster pre-copy performance would help the benchmark complete faster, but the goal for a live migration should be to minimize the lost computation for the entire operation. You'd need to test with a continuous workload rather than one with a finite amount of work. Also, do you know what live migration scheme you're using here? -- Thanks, Oliver From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9A9B9EE01F4 for ; Wed, 13 Sep 2023 17:30:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=aGsAWKVovQZ37LliN6OWgwWCiXM3Ur2OhQNMiXAbDHM=; b=QiN3ZgBbS823At KcRNQT9P+8DlKa0hz8yu+w43bMoNfoCT371eot6wCNCqCFXBc2ackj0zRBjyK4c/i5ft6CjnCJ0Ra JzxGWCv6eaIU+iNZuAT3aLmnJVOBtg8ouVTNx3BnnDedQjmD0+dhszkkKbW3DdySYyD8plnqGR9et 1zOBmKml06lBLw4/rXu5R1C3LLsGkvw3c7HJZONy6MYFklCYeGe3H99mvdCBvgIf4mzgUBzfk+Opx kOn4Uc0nwmyw7B7wBEOjVKqehuDaT0I8MHQfO0PocUp/PxrZ+imyn+54C1zBMHZgqg97o0wIEZUsE almfw4YDzvY7XZsqsFVQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qgTgv-006Upk-1r; Wed, 13 Sep 2023 17:30:17 +0000 Received: from out-217.mta1.migadu.com ([95.215.58.217]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1qgTgr-006UoQ-02 for linux-arm-kernel@lists.infradead.org; Wed, 13 Sep 2023 17:30:15 +0000 Date: Wed, 13 Sep 2023 17:30:03 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1694626209; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2jxXuDsdxHjvkm94f22PEQvZk32OEtTjBlaxJsBMLbI=; b=c6QvLNV30cJqCkCX7oLBhgkG42dkPQdepaLwLCGf4Nn1+mCXbT8eTbHs5H6L5w5rjL5rJQ TxE2q8GVA8+z3twT2/ZJ1iM6czH3S4wEm9kNEbu0wfzZ6aaxafm5AYIh4iMRo4oyQMkGAy bNyxYeZyI163al5tNrlTnsuXODecGvk= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Oliver Upton To: Shameer Kolothum Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, maz@kernel.org, will@kernel.org, catalin.marinas@arm.com, james.morse@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, zhukeqian1@huawei.com, jonathan.cameron@huawei.com, linuxarm@huawei.com Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log Message-ID: References: <20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230825093528.1637-1-shameerali.kolothum.thodi@huawei.com> X-Migadu-Flow: FLOW_OUT X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230913_103013_631731_4C505B6F X-CRM114-Status: GOOD ( 22.31 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi Shameer, On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote: > Hi, > = > This is to revive the RFC series[1], which makes use of hardware dirty > bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent > out by Zhu Keqian sometime back. > = > One of the main drawbacks in using the hardware DBM feature for dirty > page tracking is the additional overhead in scanning the PTEs for dirty > pages[2]. Also there are no vCPU page faults when we set the DBM bit, > which may result in higher convergence time during guest migration. = > = > This series tries to reduce these overheads by not setting the > DBM for all the writeable pages during migration and instead uses a > combined software(current page fault mechanism) and hardware approach > (set DBM) for dirty page tracking. > = > As noted in RFC v1[1], > "The core idea is that we do not enable hardware dirty at start (do not > add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking > for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add > DBM bit for nearby 64PTs). Then when sync dirty log, we have known all > PTs with hardware dirty enabled, so we do not need to scan all PTs." I'm unconvinced of the value of such a change. What you're proposing here is complicated and I fear not easily maintainable. Keeping the *two* sources of dirty state seems likely to fail (eventually) with some very unfortunate consequences. The optimization of enabling DBM on neighboring PTEs is presumptive of the guest access pattern and could incur unnecessary scans of the stage-2 page table w/ a sufficiently sparse guest access pattern. > Tests with dirty_log_perf_test with anonymous THP pages shows significant > improvement in "dirty memory time" as expected but with a hit on > "get dirty time" . > = > ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp > = > +---------------------------+----------------+------------------+ > | =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 6.5-rc5 =A0 = =A0 =A0| 6.5-rc5 + series | > | =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 (s) =A0 = =A0 =A0 =A0| =A0 =A0 =A0 (s) =A0 =A0 =A0 =A0| > +---------------------------+----------------+------------------+ > | =A0 =A0dirty memory time =A0 =A0 =A0| =A0 =A04.22 =A0 =A0 =A0 =A0| =A0 = =A0 =A0 =A0 =A00.41 =A0 =A0| > | =A0 =A0get dirty log time =A0 =A0 | =A0 =A00.00047 =A0 =A0 | =A0 =A0 = =A0 =A0 =A03.25 =A0 =A0| > | =A0 =A0clear dirty log time =A0 | =A0 =A00.48 =A0 =A0 =A0 =A0| =A0 =A0 = =A0 =A0 =A00.98 =A0 =A0| > +---------------------------------------------------------------+ The vCPU:memory ratio you're testing doesn't seem representative of what a typical cloud provider would be configuring, and the dirty log collection is going to scale linearly with the size of guest memory. Slow dirty log collection is going to matter a lot for VM blackout, which from experience tends to be the most sensitive period of live migration for guest workloads. At least in our testing, the split GET/CLEAR dirty log ioctls dramatically improved the performance of a write-protection based ditry tracking scheme, as the false positive rate for dirtied pages is significantly reduced. FWIW, this is what we use for doing LM on arm64 as opposed to the D-bit implemenation that we use on x86. =A0 =A0 =A0 =A0 > In order to get some idea on actual live migration performance, > I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and > while the test was in progress initiated live migration(local). > = > redis-benchmark -t set -c 900 -n 5000000 --threads 96 > = > Average of 5 runs shows that benchmark finishes ~10% faster with > a ~8% increase in "total time" for migration. > = > +---------------------------+----------------+------------------+ > | =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 6.5-rc5 =A0 = =A0 =A0| 6.5-rc5 + series | > | =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 (s) =A0 = =A0 =A0 =A0| =A0 =A0(s) =A0 =A0 =A0 =A0 =A0 | > +---------------------------+----------------+------------------+ > | [redis]5000000 requests in| =A0 =A079.428 =A0 =A0 =A0| =A0 =A0 =A071.49= =A0 =A0 =A0 | > | [info migrate]total time =A0| =A0 =A08438 =A0 =A0 =A0 =A0| =A0 =A0 =A09= 097 =A0 =A0 =A0 =A0| > +---------------------------------------------------------------+ Faster pre-copy performance would help the benchmark complete faster, but the goal for a live migration should be to minimize the lost computation for the entire operation. You'd need to test with a continuous workload rather than one with a finite amount of work. Also, do you know what live migration scheme you're using here? -- = Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel