From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Cornelia Huck <cohuck@redhat.com>,
Yishai Hadas <yishaih@nvidia.com>,
bhelgaas@google.com, saeedm@nvidia.com,
linux-pci@vger.kernel.org, kvm@vger.kernel.org,
netdev@vger.kernel.org, kuba@kernel.org, leonro@nvidia.com,
kwankhede@nvidia.com, mgurtovoy@nvidia.com, maorg@nvidia.com,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
Date: Thu, 28 Oct 2021 09:30:35 -0600 [thread overview]
Message-ID: <20211028093035.17ecbc5d.alex.williamson@redhat.com> (raw)
In-Reply-To: <20211027192345.GJ2744544@nvidia.com>
On Wed, 27 Oct 2021 16:23:45 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Wed, Oct 27, 2021 at 01:05:20PM -0600, Alex Williamson wrote:
>
> > > As far as the actual issue, if you hadn't just discovered it now
> > > nobody would have known we have this gap - much like how the very
> > > similar reset issue was present in VFIO for so many years until you
> > > plugged it.
> >
> > But the fact that we did discover it is hugely important. We've
> > identified that the potential use case is significantly limited and
> > that userspace doesn't have a good mechanism to determine when to
> > expose that limitation to the user.
>
> Huh?
>
> We've identified that, depending on device behavior, the kernel may
> need to revoke MMIO access to protect itself from hostile userspace
> triggering TLP Errors or something.
>
> Well behaved userspace must already stop touching the MMIO on the
> device when !RUNNING - I see no compelling argument against that
> position.
Not touching MMIO is not specified in our uAPI protocol, nor is it an
obvious assumption to me, nor is it sufficient to assume well behaved
userspace in the implementation of a kernel interface.
> We've been investigating how the mlx5 HW will behave in corner cases,
> and currently it looks like mlx5 vfio will not generate error TLPs, or
> corrupt the device itself due to MMIO operations when !RUNNING. So the
> driver itself, as written, probably does not currently have a bug
> here, or need changes.
This is a system level observation or is it actually looking at the
bus? An Unsupported Request on MMIO write won't even generate an AER
on some systems, but others can trigger a fatal error on others.
That sounds like potentially good news, but either way we're still also
discussing a fundamental gap in the uAPI for quiescing multiple devices
in a coordinated way and how we actually define !_RUNNING.
> > We're tossing around solutions that involve extensions, if not
> > changes to the uAPI. It's Wednesday of rc7.
>
> The P2P issue is seperate, and as I keep saying, unless you want to
> block support for any HW that does not have freeze&queice userspace
> must be aware of this ability and it is logical to design it as an
> extension from where we are now.
Is this essentially suggesting that the uAPI be clarified to state
that the base implementation is only applicable to userspace contexts
with a single migratable vfio device instance? Does that need to
preemptively include /dev/iommu generically, ie. anything that could
potentially have an IOMMU mapping to the device?
I agree that it would be easier to add a capability to expose
multi-device compatibility than to try to retrofit one to expose a
restriction.
> > I feel like we've already been burned by making one of these
> > "reasonable quanta of progress" to accept and mark experimental
> > decisions with where we stand between defining the uAPI in the kernel
> > and accepting an experimental implementation in QEMU.
>
> I won't argue there..
>
> > Now we have multiple closed driver implementations (none of which
> > are contributing to this discussion), but thankfully we're not
> > committed to supporting them because we have no open
> > implementations. I think we could get away with ripping up the uAPI
> > if we really needed to.
>
> Do we need to?
I'd prefer not.
> > > > Deciding at some point in the future to forcefully block device MMIO
> > > > access from userspace when the device stops running is clearly a user
> > > > visible change and therefore subject to the don't-break-userspace
> > > > clause.
> > >
> > > I don't think so, this was done for reset retroactively after
> > > all. Well behaved qmeu should have silenced all MMIO touches as part
> > > of the ABI contract anyhow.
> >
> > That's not obvious to me and I think it conflates access to the device
> > and execution of the device. If it's QEMU's responsibility to quiesce
> > access to the device anyway, why does the kernel need to impose this
> > restriction. I'd think we'd generally only impose such a restriction
> > if the alternative allows the user to invoke bad behavior outside the
> > scope of their use of the device or consistency of the migration data.
> > It appears that any such behavior would be implementation specific here.
>
> I think if an implementation has a problem, like error TLPs, then yes,
> it must fence. The conservative specification of the uAPI is that
> userspace should not allow MMIO when !RUNNING.
>
> If we ever get any implementation that needs this to fence then we
> should do it for all implementations just out of consistency.
Like I've indicated, this is not an obvious corollary of the !_RUNNING
state to me. I'd tend more towards letting userspace do what they want
and only restrict as necessary to protect the host. For example the
state of the device when !_RUNNING may be changed by external stimuli,
including MMIO and DMA accesses, but the device does not independently
advance state.
Also, I think we necessarily require config space read-access to
support migration, which begs the question specifically which regions,
if any, are restricted when !_RUNNING? Could we get away with zapping
mmaps (sigbus on fault) but allowing r/w access?
> > > The "don't-break-userspace" is not an absolute prohibition, Linus has
> > > been very clear this limitation is about direct, ideally demonstrable,
> > > breakage to actually deployed software.
> >
> > And if we introduce an open driver that unblocks QEMU support to become
> > non-experimental, I think that's where we stand.
>
> Yes, if qemu becomes deployed, but our testing shows qemu support
> needs a lot of work before it is deployable, so that doesn't seem to
> be an immediate risk.
Good news... I guess... but do we know what other uAPI changes might
be lurking without completing that effort?
> > > > That might also indicate that "freeze" is only an implementation
> > > > specific requirement. Thanks,
> > >
> > > It doesn't matter if a theoretical device can exist that doesn't need
> > > "freeze" - this device does, and so it is the lowest common
> > > denominator for the uAPI contract and userspace must abide by the
> > > restriction.
> >
> > Sorry, "to the victor go the spoils" is not really how I strictly want
> > to define a uAPI contract with userspace.
>
> This is not the "victor go the spoils" this is meeting the least
> common denominator of HW we have today.
>
> If some fictional HW can be more advanced and can snapshot not freeze,
> that is great, but it doesn't change one bit that mlx5 cannot and will
> not work that way. Since mlx5 must be supported, there is no choice
> but to define the uAPI around its limitations.
But it seems like you've found that mlx5 is resilient to these things
that you're also deeming necessary to restrict.
> snapshot devices are strictly a superset of freeze devices, they can
> emulate freeze by doing snapshot at the freeze operation.
True.
> In all cases userspace should not touch the device when !RUNNING to
> preserve generality to all implementations.
Not and obvious conclusion to me.
> > If we're claiming that userspace is responsible for quiescing
> > devices and we're providing a means for that to occur, and userspace
> > is already responsible for managing MMIO access, then the only
> > reason the kernel would forcefully impose such a restriction itself
> > would be to protect the host and the implementation of that would
> > depend on whether this is expected to be a universal or device
> > specific limitation.
>
> I think the best way forward is to allow for revoke to happen if we
> ever need it (by specification), and not implement it right now.
>
> So, I am not left with a clear idea what is still open that you see as
> blocking. Can you summarize?
It seems we have numerous uAPI questions floating around, including
whether the base specification is limited to a single physical device
within the user's IOMMU context, what the !_RUNNING state actually
implies about the device state, expectations around userspace access
to device regions while in this state, and who is responsible for
limiting such access, and uncertainty what other uAPI changes are
necessary as QEMU support is stabilized.
Why should we rush a driver in just before the merge window and
potentially increase our experimental driver debt load rather than
continue to co-develop kernel and userspace drivers and maybe also
get input from the owners of the existing out-of-tree drivers? Thanks,
Alex
next prev parent reply other threads:[~2021-10-28 15:30 UTC|newest]
Thread overview: 100+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-19 10:58 [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 01/14] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 02/14] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 03/14] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 04/14] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 05/14] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 06/14] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device Yishai Hadas
2021-10-19 11:16 ` Max Gurtovoy
2021-10-20 8:58 ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 07/14] vfio: Fix VFIO_DEVICE_STATE_SET_ERROR macro Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 08/14] vfio: Add a macro for VFIO_DEVICE_STATE_ERROR Yishai Hadas
2021-10-19 15:48 ` Alex Williamson
2021-10-19 15:50 ` Alex Williamson
2021-10-20 7:35 ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 09/14] vfio/pci_core: Make the region->release() function optional Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 10/14] net/mlx5: Introduce migration bits and structures Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 11/14] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2021-10-19 18:43 ` Alex Williamson
2021-10-19 19:23 ` Jason Gunthorpe
2021-10-19 20:58 ` Alex Williamson
2021-10-19 23:04 ` Jason Gunthorpe
2021-10-20 8:28 ` Yishai Hadas
2021-10-20 16:52 ` Alex Williamson
2021-10-20 18:59 ` Jason Gunthorpe
2021-10-20 21:07 ` Alex Williamson
2021-10-21 9:34 ` Cornelia Huck
2021-10-21 21:47 ` Alex Williamson
2021-10-25 12:29 ` Jason Gunthorpe
2021-10-25 14:28 ` Alex Williamson
2021-10-25 14:56 ` Jason Gunthorpe
2021-10-26 14:42 ` Alex Williamson
2021-10-26 15:18 ` Jason Gunthorpe
2021-10-26 19:50 ` Alex Williamson
2021-10-26 23:43 ` Jason Gunthorpe
2021-10-27 19:05 ` Alex Williamson
2021-10-27 19:23 ` Jason Gunthorpe
2021-10-28 15:08 ` Cornelia Huck
2021-10-29 0:26 ` Jason Gunthorpe
2021-10-29 7:35 ` Yishai Hadas
2021-10-28 15:30 ` Alex Williamson [this message]
2021-10-28 23:47 ` Jason Gunthorpe
2021-10-29 6:57 ` Cornelia Huck
2021-10-29 7:48 ` Yishai Hadas
2021-10-29 10:32 ` Shameerali Kolothum Thodi
2021-10-29 12:15 ` Jason Gunthorpe
2021-10-29 22:06 ` Alex Williamson
2021-11-01 17:25 ` Jason Gunthorpe
2021-11-02 11:19 ` Shameerali Kolothum Thodi
2021-11-02 14:56 ` Alex Williamson
2021-11-02 15:54 ` Jason Gunthorpe
2021-11-02 16:22 ` Alex Williamson
2021-11-02 16:36 ` Jason Gunthorpe
2021-11-02 20:15 ` Alex Williamson
2021-11-03 12:09 ` Jason Gunthorpe
2021-11-03 15:44 ` Alex Williamson
2021-11-03 16:10 ` Jason Gunthorpe
2021-11-03 18:04 ` Alex Williamson
2021-11-04 11:19 ` Cornelia Huck
2021-11-05 16:53 ` Cornelia Huck
2021-11-16 16:59 ` Cornelia Huck
2021-11-05 13:24 ` Jason Gunthorpe
2021-11-05 15:31 ` Alex Williamson
2021-11-15 23:29 ` Jason Gunthorpe
2021-11-16 17:57 ` Alex Williamson
2021-11-16 19:25 ` Jason Gunthorpe
2021-11-16 21:10 ` Alex Williamson
2021-11-17 1:48 ` Jason Gunthorpe
2021-11-18 18:15 ` Alex Williamson
2021-11-22 19:18 ` Jason Gunthorpe
2021-11-08 8:53 ` Tian, Kevin
2021-11-08 12:35 ` Jason Gunthorpe
2021-11-09 0:58 ` Tian, Kevin
2021-11-09 12:45 ` Jason Gunthorpe
2021-10-25 16:34 ` Dr. David Alan Gilbert
2021-10-25 17:55 ` Alex Williamson
2021-10-25 18:47 ` Dr. David Alan Gilbert
2021-10-25 19:15 ` Jason Gunthorpe
2021-10-26 8:40 ` Dr. David Alan Gilbert
2021-10-26 12:13 ` Jason Gunthorpe
2021-10-26 14:52 ` Alex Williamson
2021-10-26 15:56 ` Jason Gunthorpe
2021-10-26 14:29 ` Alex Williamson
2021-10-26 14:51 ` Dr. David Alan Gilbert
2021-10-26 15:25 ` Jason Gunthorpe
2021-10-20 8:01 ` Yishai Hadas
2021-10-20 16:25 ` Jason Gunthorpe
2021-10-21 10:46 ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 13/14] vfio/pci: Expose vfio_pci_aer_err_detected() Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2021-10-19 18:55 ` Alex Williamson
2021-10-19 19:10 ` Jason Gunthorpe
2021-10-20 8:46 ` Yishai Hadas
2021-10-20 16:46 ` Jason Gunthorpe
2021-10-20 17:45 ` Alex Williamson
2021-10-20 18:57 ` Jason Gunthorpe
2021-10-20 21:38 ` Alex Williamson
2021-10-21 10:39 ` Yishai Hadas
2021-11-17 16:42 ` vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver) Cornelia Huck
2021-11-17 17:47 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211028093035.17ecbc5d.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=bhelgaas@google.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=jgg@nvidia.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=leonro@nvidia.com \
--cc=linux-pci@vger.kernel.org \
--cc=maorg@nvidia.com \
--cc=mgurtovoy@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=saeedm@nvidia.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).