From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 507A4C433EF for ; Wed, 6 Jul 2022 15:48:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234677AbiGFPsZ (ORCPT ); Wed, 6 Jul 2022 11:48:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234104AbiGFPsD (ORCPT ); Wed, 6 Jul 2022 11:48:03 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D3A8E29C9D for ; Wed, 6 Jul 2022 08:40:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1657122056; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LEoUbxgsePf+OXuUNkCuNmp2U3FI3n3qfmJydYglPV4=; b=izuvelDxI9LawtXwwSvKteCxk6Kom2DvkKAEM/tm5DtR8t+brKa6oIjsK+XEZVeFICPW41 FEF5QdQ4g+BYwtyhmsn/WcLAj3EIt34s7CLlCx8w4alRyfgYUHLGSskqsDE9S6A4Ef2Ald OSya7DXK9j4DXmqfpjCxtRPbW6ToXTk= Received: from mail-io1-f72.google.com (mail-io1-f72.google.com [209.85.166.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-154-RjnugONSNnaVthHq_jAQgA-1; Wed, 06 Jul 2022 11:40:55 -0400 X-MC-Unique: RjnugONSNnaVthHq_jAQgA-1 Received: by mail-io1-f72.google.com with SMTP id b5-20020a05660214c500b0067530b15cc1so8334875iow.9 for ; Wed, 06 Jul 2022 08:40:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:organization:mime-version:content-transfer-encoding; bh=LEoUbxgsePf+OXuUNkCuNmp2U3FI3n3qfmJydYglPV4=; b=RS0LJ1PZ0sznLdUOGw7kdMWJNWY0EFhsQwZDA34rEQbZkYPWoMjdevMSCSgqR1KAcK YDpbCrM2ylPtB/72Og9A7qD83eu39A6KYwQMeVY1by+4UW+2kLcZPPR488V8qnZ/FPDc TaOgO0tDntBSN3IHSFQreVFdA3DqEPUwnYZcYcRGOMtM+rBSb/eRvd6WVAORpVDDbLus IlskwQPVJkWJ/NSEBOsaomGy7PIxC4Y1Ge2HDb3mJCVwIbBK10RTktJiWa/EbvYBMC3E UTrbbcjD6X2jGWpK9+hgzKTtiPOkP+UUfZuBni7rozE6gK3WrDR86MdJ/uiCUze6x00M rBQg== X-Gm-Message-State: AJIora/H1Y98xgNCX4s45B6xDX/5RaSgwE2icmWQ7eqCGTTY15N+WoVk yh33HMSoEQ8h927aw5xUJHIwWHbY5tPIP6lbpe6AUxRNugTmnZAim41yci+F9ng2/NXk9b9JeGR 9Ab+EI+bI11HOOQpcvzos X-Received: by 2002:a05:6602:1802:b0:675:7d87:aacf with SMTP id t2-20020a056602180200b006757d87aacfmr22783713ioh.110.1657122054790; Wed, 06 Jul 2022 08:40:54 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uw3J8g+FHvcgDOZiei+pAS2xPcEwiEhiLpXoXncwE3BpEJ7hqCd3vxBziEnUMPcby61f6v4w== X-Received: by 2002:a05:6602:1802:b0:675:7d87:aacf with SMTP id t2-20020a056602180200b006757d87aacfmr22783682ioh.110.1657122054454; Wed, 06 Jul 2022 08:40:54 -0700 (PDT) Received: from redhat.com ([38.15.36.239]) by smtp.gmail.com with ESMTPSA id d12-20020a0566022bec00b0066958ec56d9sm17003884ioy.40.2022.07.06.08.40.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Jul 2022 08:40:53 -0700 (PDT) Date: Wed, 6 Jul 2022 09:40:21 -0600 From: Alex Williamson To: Abhishek Sahu Cc: Cornelia Huck , Yishai Hadas , Jason Gunthorpe , Shameer Kolothum , Kevin Tian , "Rafael J . Wysocki" , Max Gurtovoy , Bjorn Helgaas , , , , Subject: Re: [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver Message-ID: <20220706094021.39b0ef73.alex.williamson@redhat.com> In-Reply-To: <20220701110814.7310-6-abhsahu@nvidia.com> References: <20220701110814.7310-1-abhsahu@nvidia.com> <20220701110814.7310-6-abhsahu@nvidia.com> Organization: Red Hat MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org On Fri, 1 Jul 2022 16:38:13 +0530 Abhishek Sahu wrote: > Some devices (Like NVIDIA VGA or 3D controller) require driver > involvement each time before going into D3cold. In the regular flow, > the guest driver do all the required steps inside the guest OS and > then the IOCTL will be called for D3cold entry by the hypervisor. Now, > if there is any activity on the host side (For example user has run > lspci, dump the config space through sysfs, etc.), then the runtime PM > framework will resume the device first, perform the operation and then > suspend the device again. This second time suspend will be without > guest driver involvement. This patch adds the support to prevent > second-time runtime suspend if there is any wake-up. This prevention > is either based on the predefined vendor/class id list or the user can > specify the flag (VFIO_PM_LOW_POWER_REENTERY_DISABLE) during entry for > the same. > > 'pm_runtime_reentry_allowed' flag tracks if this re-entry is allowed. > It will be set during the entry time. > > 'pm_runtime_resumed' flag tracks if there is any wake-up before the > guest performs the wake-up. If re-entry is not allowed, then during > runtime resume, the runtime PM count will be incremented, and this > flag will be set. This flag will be checked during guest D3cold exit > and then skip the runtime PM-related handling if this flag is set. > > During guest low power exit time, all vdev power-related flags are > accessed under 'memory_lock' and usage count will be incremented. The > resume will be triggered after releasing the lock since the runtime > resume callback again requires the lock. pm_runtime_get_noresume()/ > pm_runtime_resume() have been used instead of > pm_runtime_resume_and_get() to handle the following scenario during > the race condition. > > a. The guest triggered the low power exit. > b. The guest thread got the lock and cleared the vdev related > flags and released the locks. > c. Before pm_runtime_resume_and_get(), the host lspci thread got > scheduled and triggered the runtime resume. > d. Now, all the vdev related flags are cleared so there won't be > any extra handling inside the runtime resume. > e. The runtime PM put the device again into the suspended state. > f. The guest triggered pm_runtime_resume_and_get() got called. > > So, at step (e), the suspend is happening without the guest driver > involvement. Now, by using pm_runtime_get_noresume() before releasing > 'memory_lock', the runtime PM framework can't suspend the device due > to incremented usage count. Nak, this policy should be implemented in userspace. Thanks, Alex > Signed-off-by: Abhishek Sahu > --- > drivers/vfio/pci/vfio_pci_core.c | 87 ++++++++++++++++++++++++++++++-- > include/linux/vfio_pci_core.h | 2 + > 2 files changed, 84 insertions(+), 5 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 8c17ca41d156..1ddaaa6ccef5 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -191,6 +191,20 @@ static bool vfio_pci_nointx(struct pci_dev *pdev) > return false; > } > > +static bool vfio_pci_low_power_reentry_allowed(struct pci_dev *pdev) > +{ > + /* > + * The NVIDIA display class requires driver involvement for every > + * D3cold entry. The audio and other classes can go into D3cold > + * without driver involvement. > + */ > + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA && > + ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)) > + return false; > + > + return true; > +} > + > static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev) > { > struct pci_dev *pdev = vdev->pdev; > @@ -295,6 +309,27 @@ static int vfio_pci_core_runtime_resume(struct device *dev) > if (vdev->pm_intx_masked) > vfio_pci_intx_unmask(vdev); > > + down_write(&vdev->memory_lock); > + > + /* > + * The runtime resume callback will be called for one of the following > + * two cases: > + * > + * - If the user has called IOCTL explicitly to move the device out of > + * the low power state or closed the device. > + * - If there is device access on the host side. > + * > + * For the second case, check if re-entry to the low power state is > + * allowed. If not, then increment the usage count so that runtime PM > + * framework won't suspend the device and set the 'pm_runtime_resumed' > + * flag. > + */ > + if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) { > + pm_runtime_get_noresume(dev); > + vdev->pm_runtime_resumed = true; > + } > + up_write(&vdev->memory_lock); > + > return 0; > } > #endif /* CONFIG_PM */ > @@ -415,9 +450,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > */ > down_write(&vdev->memory_lock); > if (vdev->pm_runtime_engaged) { > + if (!vdev->pm_runtime_resumed) { > + pm_runtime_get_noresume(&pdev->dev); > + do_resume = true; > + } > + vdev->pm_runtime_resumed = false; > vdev->pm_runtime_engaged = false; > - pm_runtime_get_noresume(&pdev->dev); > - do_resume = true; > } > up_write(&vdev->memory_lock); > > @@ -1227,12 +1265,17 @@ static int vfio_pci_pm_validate_flags(u32 flags) > if (!flags) > return -EINVAL; > /* Only valid flags should be set */ > - if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) > + if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT | > + VFIO_PM_LOW_POWER_REENTERY_DISABLE)) > return -EINVAL; > /* Both enter and exit should not be set */ > if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) == > (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) > return -EINVAL; > + /* re-entry disable can only be set with enter */ > + if ((flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE) && > + !(flags & VFIO_PM_LOW_POWER_ENTER)) > + return -EINVAL; > > return 0; > } > @@ -1255,10 +1298,17 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags, > > if (flags & VFIO_DEVICE_FEATURE_GET) { > down_read(&vdev->memory_lock); > - if (vdev->pm_runtime_engaged) > + if (vdev->pm_runtime_engaged) { > vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER; > - else > + if (!vdev->pm_runtime_reentry_allowed) > + vfio_pm.flags |= > + VFIO_PM_LOW_POWER_REENTERY_DISABLE; > + } else { > vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT; > + if (!vfio_pci_low_power_reentry_allowed(pdev)) > + vfio_pm.flags |= > + VFIO_PM_LOW_POWER_REENTERY_DISABLE; > + } > up_read(&vdev->memory_lock); > > if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm))) > @@ -1286,6 +1336,19 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags, > } > > vdev->pm_runtime_engaged = true; > + vdev->pm_runtime_resumed = false; > + > + /* > + * If there is any access when the device is in the runtime > + * suspended state, then the device will be resumed first > + * before access and then the device will be suspended again. > + * Check if this second time suspend is allowed and track the > + * same in 'pm_runtime_reentry_allowed' flag. > + */ > + vdev->pm_runtime_reentry_allowed = > + vfio_pci_low_power_reentry_allowed(pdev) && > + !(vfio_pm.flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE); > + > up_write(&vdev->memory_lock); > pm_runtime_put(&pdev->dev); > } else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) { > @@ -1296,6 +1359,20 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags, > } > > vdev->pm_runtime_engaged = false; > + if (vdev->pm_runtime_resumed) { > + vdev->pm_runtime_resumed = false; > + up_write(&vdev->memory_lock); > + return 0; > + } > + > + /* > + * The 'memory_lock' will be acquired again inside the runtime > + * resume callback. So, increment the usage count inside the > + * lock and call pm_runtime_resume() after releasing the lock. > + * If there is any race condition between the wake-up generated > + * at the host and the current path. Then the incremented usage > + * count will prevent the device to go into the suspended state. > + */ > pm_runtime_get_noresume(&pdev->dev); > up_write(&vdev->memory_lock); > ret = pm_runtime_resume(&pdev->dev); > diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h > index bf4823b008f9..18cc83b767b8 100644 > --- a/include/linux/vfio_pci_core.h > +++ b/include/linux/vfio_pci_core.h > @@ -126,6 +126,8 @@ struct vfio_pci_core_device { > bool needs_pm_restore; > bool pm_intx_masked; > bool pm_runtime_engaged; > + bool pm_runtime_resumed; > + bool pm_runtime_reentry_allowed; > struct pci_saved_state *pci_saved_state; > struct pci_saved_state *pm_save; > int ioeventfds_nr;