From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.0 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F52CC388F7 for ; Tue, 10 Nov 2020 21:32:27 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id BF24220731 for ; Tue, 10 Nov 2020 21:32:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="EEe2jcLz" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BF24220731 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:48376 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kcbFZ-0002Iy-Ge for qemu-devel@archiver.kernel.org; Tue, 10 Nov 2020 16:32:25 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:55586) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kcbEZ-0001qZ-7m for qemu-devel@nongnu.org; Tue, 10 Nov 2020 16:31:24 -0500 Received: from hqnvemgate24.nvidia.com ([216.228.121.143]:6378) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kcbET-0007uk-UO for qemu-devel@nongnu.org; Tue, 10 Nov 2020 16:31:22 -0500 Received: from hqmail.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate24.nvidia.com (using TLS: TLSv1.2, AES256-SHA) id ; Tue, 10 Nov 2020 13:26:15 -0800 Received: from nvidia.com (10.124.1.5) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Tue, 10 Nov 2020 21:26:08 +0000 Date: Tue, 10 Nov 2020 13:26:06 -0800 From: Neo Jia To: Alex Williamson Subject: Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental Message-ID: <20201110212606.GA22394@nvidia.com> References: <160494787833.1473.10514376876696596117.stgit@gimli.home> <20201109194417.GR3024@work-vm> <20201109132950.6c2dfe02@w520.home> <20201110091037.GA3108@work-vm> <898ba98f-9967-f3b3-737c-2d18b0281b51@nvidia.com> <20201110082050.10ca74d4@w520.home> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20201110082050.10ca74d4@w520.home> X-NVConfidentiality: public User-Agent: Mutt/1.6.2 (2016-07-01) X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL101.nvidia.com (172.20.187.10) To HQMAIL107.nvidia.com (172.20.187.13) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1605043575; bh=OBuTo1F0P6nc8LQYwT9nQSI504vP2RDz4nq19qaZahc=; h=Date:From:To:CC:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To:X-NVConfidentiality: User-Agent:X-Originating-IP:X-ClientProxiedBy; b=EEe2jcLz7HfWpP8oUYD59U8sE7tebdMSTt2XjUaruj2rOle3fQzKB5UjqTPW7JI/u XYQUAw7fgc/INsJFTv3KjMkEgSpqPQDafVcngbj4VQmRxzVpYUW6xAEMv6Zx9VBP90 wR1Anxt7TguM2oKwu3zcPVak5AN7BF8q969mtFRMQUaMOY7f4RCKjMcIx40jKBQUHG LXoU2piDUvyNlv+bdnKnJvnexxmg6vNdoHSkkGitb6NdqefmMk6xAHUBEAxQ5KQ9Q5 OYPKacNsJl/+0reFILbovSRxMJPTsi9nPRtovCQ7BfQ+wjRc3ptIy6JB40Wi3ZA2w2 9NyVzI1bw3B+w== Received-SPF: pass client-ip=216.228.121.143; envelope-from=cjia@nvidia.com; helo=hqnvemgate24.nvidia.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/11/10 16:26:09 X-ACL-Warn: Detected OS = Windows 7 or 8 [fuzzy] X-Spam_score_int: -70 X-Spam_score: -7.1 X-Spam_bar: ------- X-Spam_report: (-7.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Malcolm Crossley , Juan Quintela , Cornelia Huck , "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Kirti Wankhede , Dheeraj Nigam , Philippe =?iso-8859-1?Q?Mathieu-Daud=E9?= Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Tue, Nov 10, 2020 at 08:20:50AM -0700, Alex Williamson wrote: > External email: Use caution opening links or attachments > > > On Tue, 10 Nov 2020 19:46:20 +0530 > Kirti Wankhede wrote: > > > On 11/10/2020 2:40 PM, Dr. David Alan Gilbert wrote: > > > * Alex Williamson (alex.williamson@redhat.com) wrote: > > >> On Mon, 9 Nov 2020 19:44:17 +0000 > > >> "Dr. David Alan Gilbert" wrote: > > >> > > >>> * Alex Williamson (alex.williamson@redhat.com) wrote: > > >>>> Per the proposed documentation for vfio device migration: > > >>>> > > >>>> Dirty pages are tracked when device is in stop-and-copy phase > > >>>> because if pages are marked dirty during pre-copy phase and > > >>>> content is transfered from source to destination, there is no > > >>>> way to know newly dirtied pages from the point they were copied > > >>>> earlier until device stops. To avoid repeated copy of same > > >>>> content, pinned pages are marked dirty only during > > >>>> stop-and-copy phase. > > >>>> > > >>>> Essentially, since we don't have hardware dirty page tracking for > > >>>> assigned devices at this point, we consider any page that is pinned > > >>>> by an mdev vendor driver or pinned and mapped through the IOMMU to > > >>>> be perpetually dirty. In the worst case, this may result in all of > > >>>> guest memory being considered dirty during every iteration of live > > >>>> migration. The current vfio implementation of migration has chosen > > >>>> to mask device dirtied pages until the final stages of migration in > > >>>> order to avoid this worst case scenario. > > >>>> > > >>>> Allowing the device to implement a policy decision to prioritize > > >>>> reduced migration data like this jeopardizes QEMU's overall ability > > >>>> to implement any degree of service level guarantees during migration. > > >>>> For example, any estimates towards achieving acceptable downtime > > >>>> margins cannot be trusted when such a device is present. The vfio > > >>>> device should participate in dirty page tracking to the best of its > > >>>> ability throughout migration, even if that means the dirty footprint > > >>>> of the device impedes migration progress, allowing both QEMU and > > >>>> higher level management tools to decide whether to continue the > > >>>> migration or abort due to failure to achieve the desired behavior. > > >>> > > >>> I don't feel particularly badly about the decision to squash it in > > >>> during the stop-and-copy phase; for devices where the pinned memory > > >>> is large, I don't think doing it during the main phase makes much sense; > > >>> especially if you then have to deal with tracking changes in pinning. > > >> > > >> > > >> AFAIK the kernel support for tracking changes in page pinning already > > >> exists, this is largely the vfio device in QEMU that decides when to > > >> start exposing the device dirty footprint to QEMU. I'm a bit surprised > > >> by this answer though, we don't really know what the device memory > > >> footprint is. It might be large, it might be nothing, but by not > > >> participating in dirty page tracking until the VM is stopped, we can't > > >> know what the footprint is and how it will affect downtime. Is it > > >> really the place of a QEMU device driver to impose this sort of policy? > > > > > > If it could actually track changes then I'd agree we shouldn't impose > > > any policy; but if it's just marking the whole area as dirty we're going > > > to need a bodge somewhere; this bodge doesn't look any worse than the > > > others to me. > > > > > >> > > >>> Having said that, I agree with marking it as experimental, because > > >>> I'm dubious how useful it will be for the same reason, I worry > > >>> about whether the downtime will be so large to make it pointless. > > >> > > > > Not all device state is large, for example NIC might only report > > currently mapped RX buffers which usually not more than a 1GB and could > > be as low as 10's of MB. GPU might or might not have large data, that > > depends on its use cases. > > > Right, it's only if we have a vendor driver that doesn't pin any memory > when dirty tracking is enabled and we're running without a viommu that > we would expect all of guest memory to be continuously dirty. > > > > >> TBH I think that's the wrong reason to mark it experimental. There's > > >> clearly demand for vfio device migration and even if the practical use > > >> cases are initially small, they will expand over time and hardware will > > >> get better. My objection is that the current behavior masks the > > >> hardware and device limitations, leading to unrealistic expectations. > > >> If the user expects minimal downtime, configures convergence to account > > >> for that, QEMU thinks it can achieve it, and then the device marks > > >> everything dirty, that's not supportable. > > > > > > Yes, agreed. > > > > Yes, there is demand for vfio device migration and many devices owners > > started scoping and development for migration support. > > Instead of making whole migration support as experimental, we can have > > opt-in option to decide to mark sys mem pages dirty during iterative > > phase (pre-copy phase) of migration. > > > Per my previous suggestion, I'd think an opt-out would be more > appropriate, ie. implementing pre-copy dirty page tracking by default. I think this will be a better approach without marking this feature as experimental. Thanks, Neo > > > > >> OTOH if the vfio device > > >> participates in dirty tracking through pre-copy, then the practical use > > >> cases will find themselves as migrations will either be aborted because > > >> downtime tolerances cannot be achieved or downtimes will be configured > > >> to match reality. Thanks, > > > > > > Without a way to prioritise the unpinned memory during that period, > > > we're going to be repeatedly sending the pinned memory which is going to > > > lead to a much larger bandwidth usage that required; so that's going in > > > completely the wrong direction and also wrong from the point of view of > > > the user. > > > Who decides which is the wrong direction for the user? If the user > wants minimal bandwidth regardless of downtime, can't they create a > procedure to pause the VM, do the migration, then resume? Are there > already migration tunables to effectively achieve this? If a user > attempts to do a "live" migration, isn't our priority then shifted to > managing the downtime constraints over the bandwidth? IOW the policy > decision is implied by the user actions and configuration of the > migration, I don't think that at the device level we should be guessing > which feature to prioritize, just like a vCPU doesn't to stop marking > dirty pages during pre-copy because it's touching too much memory. > Higher level policies and configurations should determine inflection > points... imo. Thanks, > > Alex >