From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92473C07E9D for ; Mon, 19 Jul 2021 10:43:11 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 3B1F560FF4 for ; Mon, 19 Jul 2021 10:43:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3B1F560FF4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:59786 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m5Qju-0005YR-C8 for qemu-devel@archiver.kernel.org; Mon, 19 Jul 2021 06:43:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:46310) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m5Qi4-0002XG-LP for qemu-devel@nongnu.org; Mon, 19 Jul 2021 06:41:16 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:33343) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m5Qi2-0000Lg-QP for qemu-devel@nongnu.org; Mon, 19 Jul 2021 06:41:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1626691274; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=uf+2KExhMx8stdhlldJJFx0aoMN0F1ioCD/fjdYVNO0=; b=irZcyRZFj0k71zgbevicDrEn+XZ3S7nu8smByIPeLCMeuRr5NPAqCcfMphmVfBcGmWKf9r Ip6jOdZzMf+ScM56nI70jLYgO7IWljPetoBpP9FCWHxscw+UG8cSS7c2w1XEFsid1iKDqJ i1J0467lBvUTv5FfMI/GEm+ksP+rwmQ= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-316-rNJD4JALNkOfukNMnk44jA-1; Mon, 19 Jul 2021 06:41:12 -0400 X-MC-Unique: rNJD4JALNkOfukNMnk44jA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C25A1100C612; Mon, 19 Jul 2021 10:41:11 +0000 (UTC) Received: from blackfin.pond.sub.org (ovpn-112-11.ams2.redhat.com [10.36.112.11]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 59715272A3; Mon, 19 Jul 2021 10:41:11 +0000 (UTC) Received: by blackfin.pond.sub.org (Postfix, from userid 1000) id 8B01F11326B9; Mon, 19 Jul 2021 12:41:09 +0200 (CEST) From: Markus Armbruster To: David Gibson Subject: Re: spapr_events: Sure we may ignore migrate_add_blocker() failure? References: <87tukvaejt.fsf@dusky.pond.sub.org> <87lf62ydow.fsf@dusky.pond.sub.org> Date: Mon, 19 Jul 2021 12:41:09 +0200 In-Reply-To: (David Gibson's message of "Mon, 19 Jul 2021 17:20:58 +1000") Message-ID: <875yx6oabe.fsf@dusky.pond.sub.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=armbru@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain Received-SPF: pass client-ip=170.10.133.124; envelope-from=armbru@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -42 X-Spam_score: -4.3 X-Spam_bar: ---- X-Spam_report: (-4.3 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.466, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Aravinda Prasad , Ganesh Goudar , qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" David Gibson writes: > On Mon, Jul 19, 2021 at 09:18:07AM +0200, Markus Armbruster wrote: >> David Gibson writes: >> >> > On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote: >> >> Commit 2500fb423a "migration: Include migration support for machine >> >> check handling" adds this: >> >> >> >> ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err); >> >> if (ret == -EBUSY) { >> >> /* >> >> * We don't want to abort so we let the migration to continue. >> >> * In a rare case, the machine check handler will run on the target. >> >> * Though this is not preferable, it is better than aborting >> >> * the migration or killing the VM. >> >> */ >> >> warn_report("Received a fwnmi while migration was in progress"); >> >> } >> >> >> >> migrate_add_blocker() can fail in two ways: >> >> >> >> 1. -EBUSY: migration is already in progress >> >> >> >> Ignoring this one is clearly intentional. The comment explains why. >> >> I'm taking it at face value (I'm a spapr ignoramus). >> > >> > Right. The argument isn't really about papr particularly, except >> > insofar as understanding what fwnmi is. fwnmi (FirmWare assisted NMI) >> > is a reporting mechanism for certain low-level hardware failures >> > (think memory ECC or cpu level faults, IIRC). If we migrate between >> > detecting and reporting the error, then the particulars we report will >> > be mostly meaningless since they relate to hardware we're no longer >> > running on. Hence the migration blocker. >> > >> > However, migrating away from a (non-fatal) fwnmi error is a pretty >> > reasonable response, so we don't want to actually fail a migration if >> > its already in progress. >> > >> >> Aside: I doubt >> >> the warning is going to help users. >> > >> > You're probably right, but it's not very clear how to do better. It >> > might possibly help someone in tech support explain why the reported >> > fwnmi doesn't seem to match the hardware the guest is (now) running >> > on. >> >> Perhaps pointing to the actual problem could help: the FWNMI's >> information is mostly meaningless. > > Sorry, I don't follow what you're suggesting. We warn warning: Received a fwnmi while migration was in progress when we fail to block migration because it's already in progress. But what does this mean? Perhaps warn like this: warning: FWNMI while migration is in progress The guest's report for this may be less than useful. My phrasing may well be off, but I hope you get the idea. Note that we keep quiet when we fail to block migration due to -only-migrate. I agree with that. The failure makes a difference only when migration gets triggered in a narrow time window, which should be quite rare. Would be nice to warn when migration does get triggered in that time window, though. Not sure it's worth the trouble, in particular if we'd have to create infrastructure first. > >> >> >> 2. -EACCES: we're running with -only-migratable >> >> >> >> Why may we ignore -only-migratable here? >> > >> > Short answer: because I didn't think about that case. Long answer: >> > I think we probably shoud ignore it anyway. As above, receiving a >> > fwnmi doesn't really prevent migration, it just means that if you're >> > unlucky it can report stale information. Since migrating away from a >> > possibly-dubious host would be a reasonable response to a non-fatal >> > fwnmi, I don't think we want to simply prohibit fwnmi entirely with >> > -only-migratable. >> >> I think the comment text and placement could be improved to make clear >> ignoring this failure is intentional, too. How do you like the >> following? > > That's fair.. > >> >> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c >> index a8f2cc6bdc..54d8e856d3 100644 >> --- a/hw/ppc/spapr_events.c >> +++ b/hw/ppc/spapr_events.c >> @@ -911,16 +911,14 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered) >> } >> } >> >> + /* >> + * Try to block migration while FWNMI is being handled, so the >> + * machine check handler runs where the information passed to it >> + * actually makes sense. This won't actually block migration, >> + * only delay it slightly. If the attempt fails, carry on. >> + */ >> ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, NULL); >> if (ret == -EBUSY) { >> - /* >> - * We don't want to abort so we let the migration to continue. >> - * In a rare case, the machine check handler will run on the target. >> - * Though this is not preferable, it is better than aborting >> - * the migration or killing the VM. It is okay to call >> - * migrate_del_blocker on a blocker that was not added (which the >> - * nmi-interlock handler would do when it's called after this). >> - */ >> warn_report("Received a fwnmi while migration was in progress"); >> } > > LGTM. Thanks, I'll post this.