All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Jake Oshins <jakeo@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>,
	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	"wei.liu@kernel.org" <wei.liu@kernel.org>,
	KY Srinivasan <kys@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Stephen Hemminger <sthemmin@microsoft.com>,
	"linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Michael Kelley (LINUX)" <mikelley@microsoft.com>,
	"robh@kernel.org" <robh@kernel.org>,
	"kw@linux.com" <kw@linux.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time
Date: Thu, 28 Apr 2022 14:12:13 -0500	[thread overview]
Message-ID: <20220428191213.GA36573@bhelgaas> (raw)
In-Reply-To: <SN4PR2101MB0878E466880C047D3A0D0C92ABFB9@SN4PR2101MB0878.namprd21.prod.outlook.com>

On Tue, Apr 26, 2022 at 07:25:43PM +0000, Jake Oshins wrote:
> > -----Original Message-----
> > From: Dexuan Cui <decui@microsoft.com>
> > Sent: Tuesday, April 26, 2022 11:32 AM
> > To: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > Cc: Jake Oshins <jakeo@microsoft.com>; Bjorn Helgaas <helgaas@kernel.org>;
> > bhelgaas@google.com; Alex Williamson <alex.williamson@redhat.com>;
> > wei.liu@kernel.org; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> > <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>;
> > linux-hyperv@vger.kernel.org; linux-pci@vger.kernel.org; linux-
> > kernel@vger.kernel.org; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> > robh@kernel.org; kw@linux.com; kvm@vger.kernel.org
> > Subject: RE: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce
> > VM boot time
> > 
> > > From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> > > Sent: Tuesday, April 26, 2022 9:45 AM
> > > > ...
> > > > Sorry I don't quite follow. pci-hyperv allocates MMIO for the bridge
> > > > window in hv_pci_allocate_bridge_windows() and registers the MMIO
> > > > ranges to the core PCI driver via pci_add_resource(), and later the
> > > > core PCI driver probes the bus/device(s), validates the BAR sizes
> > > > and the pre-initialized BAR values, and uses the BAR configuration.
> > > > IMO the whole process doesn't require the bit PCI_COMMAND_MEMORY to
> > > > be pre-set, and there should be no issue to delay setting the bit to
> > > > a PCI device device's .probe() -> pci_enable_device().
> > >
> > > IIUC you want to bootstrap devices with PCI_COMMAND_MEMORY clear
> > > (otherwise PCI core would toggle it on and off for eg BAR sizing).
> > >
> > > Is that correct ?
> > 
> > Yes, that's the exact purpose of this patch.
> > 
> > Do you see any potential architectural issue with the patch?
> > From my reading of the core PCI code, it looks like this should be safe.

I don't know much about Hyper-V, but in general I don't think the PCI
core should turn on PCI_COMMAND_MEMORY at all unless a driver requests
it.  I assume that if a guest OS depends on PCI_COMMAND_MEMORY being
set, guest firmware would take care of setting it.

> > Jake has some concerns that I don't quite follow.
> > @Jake, could you please explain the concerns with more details?
> 
> First, let me say that I really don't know whether this is an issue.
> I know it's an issue with other operating system kernels.  I'm
> curious whether the Linux kernel / Linux PCI driver would behave in
> a way that has an issue here.
> 
> The VM has a window of address space into which it chooses to put
> PCI device's BARs.  The guest OS will generally pick the value that
> is within the BAR, by default, but it can theoretically place the
> device in any free address space.  The subset of the VM's memory
> address space which can be populated by devices' BARs is finite, and
> generally not particularly large.
> 
> Imagine a VM that is configured with 25 NVMe controllers, each of
> which requires 64KiB of address space.  (This is just an example.)
> At first boot, all of these NVMe controllers are packed into address
> space, one after the other.
> 
> While that VM is running, one of the 25 NVMe controllers fails and
> is replaced with an NVMe controller from a separate manufacturer,
> but this one requires 128KiB of memory, for some reason.  Perhaps it
> implements the "controller buffer" feature of NVMe.  It doesn't fit
> in the hole that was vacated by the failed NVMe controller, so it
> needs to be placed somewhere else in address space.  This process
> continues over months, with several more failures and replacements.
> Eventually, the address space is very fragmented.
> 
> At some point, there is an attempt to place an NVMe controller into
> the VM but there is no contiguous block of address space free which
> would allow that NVMe controller to operate.  There is, however,
> enough total address space if the other, currently functioning, NVMe
> controllers are moved from the address space that they are using to
> other ranges, consolidating their usage and reducing fragmentation.
> Let's call this a rebalancing of memory resources.
> 
> When the NVMe controllers are moved, a new value is written into
> their BAR.  In general, the PCI spec would require that you clear
> the memory enable bit in the command register (PCI_COMMAND_MEMORY)
> during this move operation, both so that there's never a moment when
> two devices are occupying the same address space and because writing
> a 64-bit BAR atomically isn't possible.  This is the reason that I
> originally wrote the code in this driver to unmap the device from
> the VM's address space when the memory enable bit is cleared.
> 
> What I don't know is whether this sequence of operations can ever
> happen in Linux, or perhaps in a VM running Linux.  Will it
> rebalance resources in order to consolidate address space?  If it
> will, will this involve clearing the memory enable bit to ensure
> that two devices never overlap?

This sequence definitely can occur in Linux, but it hasn't yet become
a real priority.  But we do already have issues with assigning space
for hot-added devices in general, especially if firmware hasn't
assigned large windows to things like Thunderbolt controllers.  I
suspect that we have or will soon have issues where resource
assignment starts failing after a few hotplugs, e.g., dock/undock
events.

There have been patches posted to rebalance resources (quiesce
drivers, reassign, restart drivers), but they haven't gone anywhere
yet for lack of interest and momentum.  I do feel like we're the
tracks and the train is coming, though ;)

Bjorn

  parent reply	other threads:[~2022-04-28 19:12 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-21 18:26 [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time Dexuan Cui
2022-04-26 16:44 ` Lorenzo Pieralisi
2022-04-26 18:31   ` Dexuan Cui
2022-04-26 19:25     ` Jake Oshins
2022-04-27 22:41       ` Alex Williamson
2022-04-28 19:12       ` Bjorn Helgaas [this message]
2022-04-28 19:21         ` [EXTERNAL] " Jake Oshins
2022-04-29  1:11           ` Dexuan Cui
2022-04-29  9:43             ` Lorenzo Pieralisi
  -- strict thread matches above, loose matches on Subject: below --
2022-04-19 22:00 Dexuan Cui
2022-04-20 14:47 ` Michael Kelley (LINUX)
2022-04-29 10:15 ` Lorenzo Pieralisi
2022-04-29 15:47   ` Dexuan Cui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220428191213.GA36573@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=decui@microsoft.com \
    --cc=haiyangz@microsoft.com \
    --cc=jakeo@microsoft.com \
    --cc=kvm@vger.kernel.org \
    --cc=kw@linux.com \
    --cc=kys@microsoft.com \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lorenzo.pieralisi@arm.com \
    --cc=mikelley@microsoft.com \
    --cc=robh@kernel.org \
    --cc=sthemmin@microsoft.com \
    --cc=wei.liu@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.