From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9D5FC63777 for ; Thu, 3 Dec 2020 11:23:47 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id DCC7A22201 for ; Thu, 3 Dec 2020 11:23:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DCC7A22201 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:41008 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kkmi9-0003EE-ON for qemu-devel@archiver.kernel.org; Thu, 03 Dec 2020 06:23:45 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:53844) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kkmgS-0002FT-EF for qemu-devel@nongnu.org; Thu, 03 Dec 2020 06:22:00 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:33590) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.90_1) (envelope-from ) id 1kkmgQ-0002tp-Dd for qemu-devel@nongnu.org; Thu, 03 Dec 2020 06:22:00 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1606994517; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1GC0OfxOzXeIIXbEGJo6xURy79f2K3WHY7EGjvnYocc=; b=g8BthVNx5M1WcXdtgaQX4BPbv5yp8RKZuV6MiRXHDpuWD3tIbYgyqPO2ryuNVSG3aXH6y5 cT3FJYvjUAmM8nXthX3ows4viIv0UXGyiIogpMi8litr6FZwkWHlReS32Fsnhln1EkXYtQ CX0szA/lOVBODOXwFmXbLZJUId9QJ58= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-263-uF1FNNM4OGWITni1NQQTIQ-1; Thu, 03 Dec 2020 06:21:53 -0500 X-MC-Unique: uF1FNNM4OGWITni1NQQTIQ-1 Received: by mail-wm1-f70.google.com with SMTP id k128so1333704wme.7 for ; Thu, 03 Dec 2020 03:21:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=1GC0OfxOzXeIIXbEGJo6xURy79f2K3WHY7EGjvnYocc=; b=tDvUXUXlpQcGP3VuMPK1d45bK7mj9CBbxK/zUTzmJmlmapYpZHgDlcQ3+7dyD5Tk8C CJoyhy2LzDfXD6l1+DdvmfGIcbTGvpWIIAmiHUFsL64hJNYWqErHEWpto894t/0yDoKA 0llRxArYHvvf6/TyJtl9vUdeFydswBFHmii/XoD2yJObA/v/v0vACFVcuu5tSnWIjzqC Uw5uaEcuGhnSHYjWTfEFBa5l9DW6IbpK+1iwExzywzFwefNQSh4WknX+9DVvdjStEH5h ij030txPeH8XU4/sfgAKJPnJ9CBQ6rpx9XRgZxaU3fbZ7h1iYWvgOugRMGdU4Dt7FblU 7ReQ== X-Gm-Message-State: AOAM532kGgVjaGdtCbxumiooQ5hd95J8XeqPqgBiD891BOrUVYbqGJT5 u4om0PlYAYU61a1Q0FYbEXrbSgzyxgAd8UKoh56CwVQAHfrGgAF6QfKls+1a8fXJUCkUF5eRGKh whLYVqgkAZJaSjUo= X-Received: by 2002:a05:600c:224b:: with SMTP id a11mr2761154wmm.97.1606994511876; Thu, 03 Dec 2020 03:21:51 -0800 (PST) X-Google-Smtp-Source: ABdhPJyeeRxUXHQP4a8Su37XEvUECjGmSGL2LSWzxiQ/uUTiPBXllGMLR0eZluRCoUsiCIibnRRjfQ== X-Received: by 2002:a05:600c:224b:: with SMTP id a11mr2761133wmm.97.1606994511659; Thu, 03 Dec 2020 03:21:51 -0800 (PST) Received: from redhat.com (bzq-79-176-44-197.red.bezeqint.net. [79.176.44.197]) by smtp.gmail.com with ESMTPSA id 138sm1219830wma.41.2020.12.03.03.21.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Dec 2020 03:21:50 -0800 (PST) Date: Thu, 3 Dec 2020 06:21:47 -0500 From: "Michael S. Tsirkin" To: Daniel =?iso-8859-1?Q?P=2E_Berrang=E9?= Subject: Re: [PATCH v2 01/27] migration: Network Failover can't work with a paused guest Message-ID: <20201203061907-mutt-send-email-mst@kernel.org> References: <20201202050918-mutt-send-email-mst@kernel.org> <20201202102718.GA2360260@redhat.com> <20201202053111-mutt-send-email-mst@kernel.org> <20201202053219-mutt-send-email-mst@kernel.org> <87mtywlbvq.fsf@secure.mitica> <20201202105515.GD2360260@redhat.com> <20201202061641-mutt-send-email-mst@kernel.org> <20201202112639.GE2360260@redhat.com> <20201202063656-mutt-send-email-mst@kernel.org> <20201202120121.GF2360260@redhat.com> MIME-Version: 1.0 In-Reply-To: <20201202120121.GF2360260@redhat.com> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=mst@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=216.205.24.124; envelope-from=mst@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -35 X-Spam_score: -3.6 X-Spam_bar: --- X-Spam_report: (-3.6 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.495, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Eduardo Habkost , Juan Quintela , Jason Wang , "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Paolo Bonzini Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Wed, Dec 02, 2020 at 12:01:21PM +0000, Daniel P. Berrangé wrote: > On Wed, Dec 02, 2020 at 06:37:46AM -0500, Michael S. Tsirkin wrote: > > On Wed, Dec 02, 2020 at 11:26:39AM +0000, Daniel P. Berrangé wrote: > > > On Wed, Dec 02, 2020 at 06:19:29AM -0500, Michael S. Tsirkin wrote: > > > > On Wed, Dec 02, 2020 at 10:55:15AM +0000, Daniel P. Berrangé wrote: > > > > > On Wed, Dec 02, 2020 at 11:51:05AM +0100, Juan Quintela wrote: > > > > > > "Michael S. Tsirkin" wrote: > > > > > > > On Wed, Dec 02, 2020 at 05:31:53AM -0500, Michael S. Tsirkin wrote: > > > > > > >> On Wed, Dec 02, 2020 at 10:27:18AM +0000, Daniel P. Berrangé wrote: > > > > > > >> > On Wed, Dec 02, 2020 at 05:13:18AM -0500, Michael S. Tsirkin wrote: > > > > > > >> > > On Wed, Nov 18, 2020 at 09:37:22AM +0100, Juan Quintela wrote: > > > > > > >> > > > If we have a paused guest, it can't unplug the network VF device, so > > > > > > >> > > > we wait there forever. Just change the code to give one error on that > > > > > > >> > > > case. > > > > > > >> > > > > > > > > > >> > > > Signed-off-by: Juan Quintela > > > > > > >> > > > > > > > > >> > > It's certainly possible but it's management that created > > > > > > >> > > this situation after all - why do we bother to enforce > > > > > > >> > > a policy? It is possible that management will unpause immediately > > > > > > >> > > afterwards and everything will proceed smoothly. > > > > > > >> > > > > > > > > >> > > Yes migration will not happen until guest is > > > > > > >> > > unpaused but the same it true of e.g. a guest that is stuck > > > > > > >> > > because of a bug. > > > > > > >> > > > > > > > >> > That's pretty different behaviour from how migration normally handles > > > > > > >> > a paused guest, which is that it is guaranteed to complete the migration > > > > > > >> > in as short a time as network bandwidth allows. > > > > > > >> > > > > > > > >> > Just ignoring the situation I think will lead to surprise apps / admins, > > > > > > >> > because the person/entity invoking the migration is not likely to have > > > > > > >> > checked wether this particular guest uses net failover or not before > > > > > > >> > invoking - they'll just be expecting a paused migration to run fast and > > > > > > >> > be guaranteed to complete. > > > > > > >> > > > > > > > >> > Regards, > > > > > > >> > Daniel > > > > > > >> > > > > > > >> Okay I guess. But then shouldn't we handle the reverse situation too: > > > > > > >> pausing guest after migration started but before device was > > > > > > >> unplugged? > > > > > > >> > > > > > > > > > > > > > > Thinking of which, I have no idea how we'd handle it - fail > > > > > > > pausing guest until migration is cancelled? > > > > > > > > > > > > > > All this seems heavy handed to me ... > > > > > > > > > > > > This is the minimal fix that I can think of. > > > > > > > > > > > > Further solution would be: > > > > > > - Add a new migration parameter: migrate-paused > > > > > > - change libvirt to use the new parameter if it exist > > > > > > - in qemu, when we do start migration (but after we wait for the unplug > > > > > > device) paused the guest before starting migration and resume it after > > > > > > migration finish. > > > > > > > > > > It would also have to handle issuing of paused after migration has > > > > > been started - delay the pause request until the nuplug is complete > > > > > is one answer. > > > > > > > > Hmm my worry would be that pausing is one way to give cpu > > > > resources back to host. It's problematic if guest can delay > > > > that indefinitely. > > > > > > hmm, yes, that is awkward. Perhaps we should just report an explicit > > > error then. > > > > Report an error in response to which command? Do you mean > > fail migration? > > If mgt attempt to pause an existing migration that hasn't finished > the PCI unplug stage, then fail the pause request. Pause guest not migration ... Might be tricky ... Let me ask this, why not just produce a warning that migration wan't finish until guest actually runs? User will then know and unpause the guest when he wants migration to succeed ... For example, user can restrict the amount of cpu using cgroups to a level where almost no progress is made. QEMU can't detect this .... > > > > > In normal cases this won't happen, as unplug will have > > > easily completed before the mgmt app pauses the running migration. > > > In broken/malicious guest cases, this at least ives mgmt a heads up > > > that something is wrong and they might then decide to cancel the > > > migration. > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|