Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Yishai Hadas <yishaih@nvidia.com>,
	bhelgaas@google.com, saeedm@nvidia.com,
	linux-pci@vger.kernel.org, kvm@vger.kernel.org,
	netdev@vger.kernel.org, kuba@kernel.org, leonro@nvidia.com,
	kwankhede@nvidia.com, mgurtovoy@nvidia.com, maorg@nvidia.com
Subject: Re: [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2
Date: Wed, 2 Feb 2022 16:36:56 -0700	[thread overview]
Message-ID: <20220202163656.4c0cc386.alex.williamson@redhat.com> (raw)
In-Reply-To: <20220202002459.GP1786498@nvidia.com>

On Tue, 1 Feb 2022 20:24:59 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 01, 2022 at 02:49:16PM -0700, Alex Williamson wrote:
> > On Tue, 1 Feb 2022 14:36:20 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Feb 01, 2022 at 10:04:08AM -0700, Alex Williamson wrote:
> > >   
> > > > Ok, let me parrot back to see if I understand.  -ENOTTY will be
> > > > returned if the ioctl doesn't exist, in which case device_state is
> > > > untouched and cannot be trusted.  At the same time, we expect the user
> > > > to use the feature ioctl to make sure the ioctl exists, so it would
> > > > seem that we've reclaimed that errno if we believe the user should
> > > > follow the protocol.    
> > > 
> > > I don't follow - the documentation says what the code does, if you get
> > > ENOTTY returned then you don't get the device_state too. Saying the
> > > user shouldn't have called it in the first place is completely
> > > correct, but doesn't change the device_state output.  
> > 
> > The documentation says "...the device state output is not reliable", and
> > I have to question whether this qualifies as a well specified,
> > interoperable spec with such language.  We're essentially asking users
> > to keep track that certain errnos result in certain fields of the
> > structure _maybe_ being invalid.  
> 
> So you are asking to remove "is not reliable" and just phrase is as:
> 
> "device_state is updated to the current value when -1 is returned,
> except when these XXX errnos are returned?
> 
> (actually userspace can tell directly without checking the errno - as
> if -1 is returned the device_state cannot be the requested target
> state anyhow)

If we decide to keep the existing code, then yes the spec should
indicate the device_state is invalid, not just unreliable for those
errnos, but I'm also of the opinion that returning an error condition
AND providing valid data in the return structure for all but a few
errnos and expecting userspace to get this correct is not a good API.

> > Now you're making me wonder how much I care to invest in semantic
> > arguments over extended errnos :-\  
> 
> Well, I know I don't :) We don't have consistency in the kernel and
> userspace is hard pressed to make any sense of it most of the time,
> IMHO. It just doesn't practically matter..
> 
> > > We don't know the device_state in the core code because it can only be
> > > read under locking that is controlled by the driver. I hope when we
> > > get another driver merged that we can hoist the locking, but right now
> > > I'm not really sure - it is a complicated lock.  
> > 
> > The device cannot self transition to a new state, so if the core were
> > to serialize this ioctl then the device_state provided by the driver is
> > valid, regardless of its internal locking.  
> 
> It is allowed to transition to RUNNING due to reset events it captures
> and since we capture the reset through the PCI hook, not from VFIO,
> the core code doesn't synchronize well. See patch 14

Looking... your .reset_done() function sets a deferred_reset flag and
attempts to grab the state_mutex.  If there's contention on that mutex,
exit since the lock holder will perform the state transition when
dropping that mutex, otherwise reset_done will itself drop the mutex to
do that state change.  The reset_lock assures that we cannot race as the
state_mutex is being released.

So the scenario is that the user MUST be performing a reset coincident
to accessing the device_state and the solution is that the user's
SET_STATE returns success and a new device state that's already bogus
due to the reset.  Why wouldn't the solution here be to return -EAGAIN
to the user or reattempt the SET_STATE since the user is clearly now
disconnected from the actual device_state?

> > Whether this ioctl should be serialized anyway is probably another good
> > topic to breach.  Should a user be able to have concurrent ioctls
> > setting conflicting states?  
> 
> The driver is required to serialize, the core code doesn't touch any
> global state and doesn't need serializing.
> 
> > I'd suggest that ioctl return structure is only valid at all on
> > success and we add a GET interface to return the current device  
> 
> We can do this too, but it is a bunch of code to achieve this and I
> don't have any use case to read back the device_state beyond debugging
> and debugging is fine with this. IMHO

A bunch of code?  If we use a FEATURE ioctl, it just extends the
existing implementation to add GET support.  That looks rather trivial.
That seems like a selling point for using the FEATURE ioctl TBH.

> > It's entirely possible that I'm overly averse to ioctl proliferation,
> > but for every new ioctl we need to take a critical look at the proposed
> > API, use case, applicability, and extensibility.    
> 
> This is all basicly the same no matter where it is put, the feature
> multiplexer is just an ioctl in some semi-standard format, but the
> vfio pattern of argsz/flags is also a standard format that is
> basically the same thing.
> 
> We still need to think about extensibility, alignment, etc..
> 
> The problem I usually see with ioctls is not proliferation, but ending
> up with too many choices and a big ?? when it comes to adding
> something new.
> 
> Clear rules where things should go and why is the best, it matters
> less what the rules actually are IMHO.
> 
> > > I don't want to touch capabilities, but we can try to use feature for
> > > set state. Please confirm this is what you want.  
> > 
> > It's a team sport, but to me it seems like it fits well both in my
> > mental model of interacting with a device feature, without
> > significantly altering the uAPI you're defining anyway.  
> 
> Well, my advice is that ioctls are fine, and a bit easier all around.
> eg strace and syzkaller are a bit easier if everything neatly maps
> into one struct per ioctl - their generator tools are optimized for
> this common case.
> 
> Simple multiplexors are next-best-fine, but there should be a clear
> idea when to use the multiplexer, or not.
> 
> Things like the cap chains enter a whole world of adventure for
> strace/syzkaller :)

vfio's argsz/flags is not only a standard framework, but it's one that
promotes extensions.  We were able to add capability chains with
backwards compatibility because of this design.  IMO, that's avoided
ioctl sprawl; we've been able to maintain a fairly small set of core
ioctls rather than add add a new ioctl every time we want to describe
some new property of a device or region or IOMMU.  I think that
improves the usability of the uAPI.  I certainly wouldn't want to
program to a uAPI with a million ioctls.  A counter argument is that
we're making the interface more complex, but at the same time we're
adding shared infrastructure for dealing with that complexity.

Of course we do continue to add new ioctls as necessary, including this
FEATURE ioctl, and I recognize that with such a generic multiplexer we
run the risk of over using it, ie. everything looks like a nail.  You
initially did not see the fit for setting device state as interacting
with a device feature, but it doesn't seem like you had a strong
objection to my explanation of it in that context.

So I think if the FEATURE ioctl has an ongoing place in our uAPI (using
it to expose migration flags would seem to be a point in that
direction) and it doesn't require too many contortions to think of the
operation we're trying to perform on the device as interacting with a
device FEATURE, and there are no functional or performance implications
of it, I would think we should use it.  To do otherwise would suggest
that we should consider the FEATURE ioctl a failed experiment and not
continue to expand its use.

I'd be interested to hear more input on this from the community.

> > > You'll want the same for the PRE_COPY related information too?  
> > 
> > I hadn't gotten there yet.  It seems like a discontinuity to me that
> > we're handing out new FDs for data transfer sessions, but then we
> > require the user to come back to the device to query about the data its
> > reading through that other FD.    
> 
> An earlier draft of this put it on the data FD, but v6 made it fully
> optional with no functional impact on the data FD. The values decrease
> as the data FD progresses and increases as the VM dirties data - ie it
> is 50/50 data_fd/device behavior.
> 
> It doesn't matter which way, but it feels quite weird to have the main
> state function is a FEATURE and the precopy query is an ioctl.

If the main state function were a FEATURE ioctl on the device and the
data transfer query was an ioctl on the FD returned from that feature
ioctl, I don't see how that's weird at all.  Different FDs, different
interfaces.

To me, the device has provided a separate FD for data transfer, so the
fact that we consume the data via that FD, but monitor our progress in
consuming that data back on the device FD is a bit strange.

> > Should that be an ioctl on the data stream FD itself?    
> 
> I can be. Implementation wise it is about a wash.
> 
> > Is there a use case for also having it on the STOP_COPY FD?  
> 
> I didn't think of one worthwhile enough to mandate implementing it in
> every driver.

Can the user perform an lseek(2) on the migration FD?  Maybe that would
be the difference between what we need for PRE_COPY vs STOP_COPY.  In
the latter case the data should be a fixes size and perhaps we don't
need another interface to know how much data to expect.

One use case would be that we want to be able to detect whether we can
meet service guarantees as quickly as possible with the minimum
resource consumption and downtime.  If we can determine from the device
that we can't possibly transfer its state in the required time, we can
abort immediately without waiting for a downtime exception or flooding
the migration link.  Thanks,

Alex

next prev parent reply	other threads:[~2022-02-02 23:37 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-30 16:08 [PATCH V6 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
2022-01-31 23:41   ` Alex Williamson
2022-02-01  0:11     ` Jason Gunthorpe
2022-02-01 15:47       ` Alex Williamson
2022-02-01 15:49         ` Jason Gunthorpe
2022-01-30 16:08 ` [PATCH V6 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
2022-01-31 23:43   ` Alex Williamson
2022-02-01  0:31     ` Jason Gunthorpe
2022-02-01 17:04       ` Alex Williamson
2022-02-01 18:36         ` Jason Gunthorpe
2022-02-01 21:49           ` Alex Williamson
2022-02-02  0:24             ` Jason Gunthorpe
2022-02-02 23:36               ` Alex Williamson [this message]
2022-02-03 14:17                 ` Jason Gunthorpe
2022-02-04 12:12                 ` Cornelia Huck
2022-02-03 15:51               ` Tarun Gupta (SW-GPU)
2022-02-01 12:06   ` Cornelia Huck
2022-02-01 12:10     ` Jason Gunthorpe
2022-02-01 12:18       ` Cornelia Huck
2022-02-01 12:27         ` Jason Gunthorpe
2022-01-30 16:08 ` [PATCH V6 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
2022-02-01 11:54   ` Cornelia Huck
2022-02-01 12:13     ` Jason Gunthorpe
2022-02-01 18:31   ` Alex Williamson
2022-02-01 18:53     ` Jason Gunthorpe
2022-02-01 19:13       ` Alex Williamson
2022-02-01 19:50         ` Jason Gunthorpe
2022-02-02 23:54           ` Alex Williamson
2022-02-03 14:22             ` Jason Gunthorpe
2022-01-30 16:08 ` [PATCH V6 mlx5-next 10/15] vfio: Remove migration protocol v1 Yishai Hadas
2022-02-01 11:23   ` Cornelia Huck
2022-02-01 12:13     ` Jason Gunthorpe
2022-02-01 12:39       ` Cornelia Huck
2022-02-01 12:54         ` Jason Gunthorpe
2022-02-01 13:26           ` Cornelia Huck
2022-02-01 13:52             ` Jason Gunthorpe
2022-02-01 14:19               ` Cornelia Huck
2022-02-01 14:29                 ` Jason Gunthorpe
2022-02-02 11:34                   ` Cornelia Huck
2022-02-02 12:22                     ` Jason Gunthorpe
2022-02-01 23:01         ` Alex Williamson
2022-02-02  0:28           ` Jason Gunthorpe
2022-02-02 11:38           ` Cornelia Huck
2022-01-30 16:08 ` [PATCH V6 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2022-01-30 16:08 ` [PATCH V6 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220202163656.4c0cc386.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=jgg@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=maorg@nvidia.com \
    --cc=mgurtovoy@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).