From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA747C433EF for ; Tue, 11 Jan 2022 11:55:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239003AbiAKLzE (ORCPT ); Tue, 11 Jan 2022 06:55:04 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49300 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239321AbiAKLzA (ORCPT ); Tue, 11 Jan 2022 06:55:00 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 643B6C06173F for ; Tue, 11 Jan 2022 03:55:00 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 283F3B81A34 for ; Tue, 11 Jan 2022 11:54:59 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C8E87C36AE3; Tue, 11 Jan 2022 11:54:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641902097; bh=PcHSbQAv9iYlueJlYk3EkQpIZ8HAtZkZAvEG5bViVv4=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=RxIjA9t3YoLSPpqsFYAbepOLn/2lbzJ8lCHEjrlT3Mt/2mXV6WW2zAfHpZfSz4nWF jjoZmDZucc9kIwHu727Tc/JgUlcY3WV1lG5OkJfiyndEc0VJiu++8NcLX3JUDHJuNK D4/+c71hWixYMFKh6NqY9Ml2/GgZmd6QFA5tbPFgLyC+A3fWTIX7bkn7P0O8RwYOD1 Vb3AO/4Y5suCIj+YnZbjus6E9J7H1btz8R/YYcHf5qeErzyH9AoSrNXPX+xCdEN+rQ AlF46uHADZoY6VntDlW717VUXooZgQl+85EMjXZt/lh9i7kZa1nWFN1T23qo/ak9ns N4EdDfD6gwFSg== Received: from sofa.misterjones.org ([185.219.108.64] helo=why.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1n7Fjr-00HNFO-RG; Tue, 11 Jan 2022 11:54:55 +0000 Date: Tue, 11 Jan 2022 11:54:55 +0000 Message-ID: <877db6trlc.wl-maz@kernel.org> From: Marc Zyngier To: Jing Zhang Cc: KVM , KVMARM , Will Deacon , Paolo Bonzini , David Matlack , Oliver Upton , Reiji Watanabe Subject: Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty In-Reply-To: <20220110210441.2074798-1-jingzhangos@google.com> References: <20220110210441.2074798-1-jingzhangos@google.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: jingzhangos@google.com, kvm@vger.kernel.org, kvmarm@lists.cs.columbia.edu, will@kernel.org, pbonzini@redhat.com, dmatlack@google.com, oupton@google.com, reijiw@google.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Mon, 10 Jan 2022 21:04:38 +0000, Jing Zhang wrote: > > This patch is to reduce the performance degradation of guest workload during > dirty logging on ARM64. A fast path is added to handle permission relaxation > during dirty logging. The MMU lock is replaced with rwlock, by which all > permision relaxations on leaf pte can be performed under the read lock. This > greatly reduces the MMU lock contention during dirty logging. With this > solution, the source guest workload performance degradation can be improved > by more than 60%. > > Problem: > * A Google internal live migration test shows that the source guest workload > performance has >99% degradation for about 105 seconds, >50% degradation > for about 112 seconds, >10% degradation for about 112 seconds on ARM64. > This shows that most of the time, the guest workload degradtion is above > 99%, which obviously needs some improvement compared to the test result > on x86 (>99% for 6s, >50% for 9s, >10% for 27s). > * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB > * VM spec: #vCPU: 48, #Mem/vCPU: 4GB What are the host and guest page sizes? > > Analysis: > * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get > the number of contentions of MMU lock and the "dirty memory time" on > various VM spec. > By using test command > ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU] How is this test representative of the internal live migration test you mention above? '-m 2' indicates a mode that varies depending on the HW and revision of the test (I just added a bunch of supported modes). Which one is it? > Below are the results: > +-------+------------------------+-----------------------+ > | #vCPU | dirty memory time (ms) | number of contentions | > +-------+------------------------+-----------------------+ > | 1 | 926 | 0 | > +-------+------------------------+-----------------------+ > | 2 | 1189 | 4732558 | > +-------+------------------------+-----------------------+ > | 4 | 2503 | 11527185 | > +-------+------------------------+-----------------------+ > | 8 | 5069 | 24881677 | > +-------+------------------------+-----------------------+ > | 16 | 10340 | 50347956 | > +-------+------------------------+-----------------------+ > | 32 | 20351 | 100605720 | > +-------+------------------------+-----------------------+ > | 64 | 40994 | 201442478 | > +-------+------------------------+-----------------------+ > > * From the test results above, the "dirty memory time" and the number of > MMU lock contention scale with the number of vCPUs. That means all the > dirty memory operations from all vCPU threads have been serialized by > the MMU lock. Further analysis also shows that the permission relaxation > during dirty logging is where vCPU threads get serialized. > > Solution: > * On ARM64, there is no mechanism as PML (Page Modification Logging) and > the dirty-bit solution for dirty logging is much complicated compared to > the write-protection solution. The straight way to reduce the guest > performance degradation is to enhance the concurrency for the permission > fault path during dirty logging. > * In this patch, we only put leaf PTE permission relaxation for dirty > logging under read lock, all others would go under write lock. > Below are the results based on the solution: > +-------+------------------------+ > | #vCPU | dirty memory time (ms) | > +-------+------------------------+ > | 1 | 803 | > +-------+------------------------+ > | 2 | 843 | > +-------+------------------------+ > | 4 | 942 | > +-------+------------------------+ > | 8 | 1458 | > +-------+------------------------+ > | 16 | 2853 | > +-------+------------------------+ > | 32 | 5886 | > +-------+------------------------+ > | 64 | 12190 | > +-------+------------------------+ > All "dirty memory time" have been reduced by more than 60% when the > number of vCPU grows. How does that translate to the original problem statement with your live migration test? Thanks, M. -- Without deviation from the norm, progress is not possible.