[virtio-comment] [PATCH v1 0/8] Introduce device migration support commands

All of lore.kernel.org
 help / color / mirror / Atom feed

* [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands
@ 2023-10-08 11:25 Parav Pandit
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration Parav Pandit
                   ` (7 more replies)
  0 siblings, 8 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

This series introduces administration commands for member device migration
for PCI transport; when needed it can be extended for other transports
too.

Use case requirements:
======================
1. A hypervisor system needs to provide a PCI VF as passthrough
   device to the guest virtual machine and also support live
   migration of this virtual machine.
2. A virtual machine may have one or more such passthrough
   virtio devices.
3. A virtual machine may have other PCI passthrough device
   which may also interact with virtio device.
4. A hypervisor runs a vendor agnostic driver with extension
   to support device migration.
5. A PCI VF passthrough device needs to support transparent
   device reset and PCI FLR must while the device migration is
   ongoing.
6. A owner driver do not involve in device operations mediation
   for the passthrough device at virtio interface level.
7. Mechanism is generic enough that applies to large family of
   virtio devices and it does not involve trapping any virtio
   device interfaces for the passthrough devices.

Overview:
=========
Above usecase requirements is solved by PCI PF group owner driver
facilitating the member device migration functionality using
administration commands.

There are three major functionalities.

1. Suspend and resume the device operation
2. Read and Write the device context containing all the information
   that can be transferred from source to destination to migrate to
   a member device
3. Track pages written by the device during device migration is
   ongoing

This comprehensive series introduces 4 infrastructure pieces
covering PCI transport, peer to peer PCI devices, page tracking (aka dirty page
tracking) and generic device context.

1. Device mode get,set (active, stop, freeze)
2. Device context read and write
3. Defines device context
4. Write reporting to track page addresses

This series enables virtio PCI SR-IOV member device to member device
migration. It can also be used to/from migrate from PCI SR-IOV member
device to software composed PCI device if/when needed which can
parse and compose software based PCI virtio device.

Example flow:
=============
Source hypervisor:
1. Instructs device to start tracking pages it is writing
2. Periodically query the addresses of the written pages
3. Suspend the device operation
4. Read the device context and transfer to destination hypervisor

Destination hypervisor:
5. Write the device context received from source
6. Resume the device that has newly written device context

Patch summary:
==============
patch-1: Adds theory of operation for device migration commands 
patch-2: Redefine reserved2 to command output field
patch-3: Defines short device context for split virtqueues
patch-4: Adds device migration commands
patch-5: Adds requirements for device migration commands
patch-6: Adds theory of operation for write reporting commands
patch-7: Adds write reporting commands
patch-8: Adds requirements for write reporting commands

It also takes inspiration from the similar idea presented at KVM Forum
at [1].

Changelog:
==========
1. Enrich device context to cover device configuration layout, feature bits
2. Fixed alignment of device context fields
3. Added missing Sign-off for the joint work done with Satananda
4. Added link to the github issue

Please review.

[1] https://static.sched.com/hosted_files/kvmforum2022/3a/KVM22-Migratable-Vhost-vDPA.pdf

Parav Pandit (8):
  admin: Add theory of operation for device migration
  admin: Redefine reserved2 as command specific output
  device-context: Define the device context fields for device migration
  admin: Add device migration admin commands
  admin: Add requirements of device migration commands
  admin: Add theory of operation for write recording commands
  admin: Add write recording commands
  admin: Add requirements of write reporting commands

 admin-cmds-device-migration.tex | 574 ++++++++++++++++++++++++++++++++
 admin.tex                       |  38 ++-
 content.tex                     |   1 +
 device-context.tex              | 142 ++++++++
 4 files changed, 748 insertions(+), 7 deletions(-)
 create mode 100644 admin-cmds-device-migration.tex
 create mode 100644 device-context.tex

-- 
2.34.1

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
@ 2023-10-08 11:25 ` Parav Pandit
  2023-10-09  8:49   ` Jason Wang
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

One or more passthrough PCI VF devices are ubiquitous for virtual
machines usage using generic kernel framework such as vfio [1].

A passthrough PCI VF device is fully owned by the virtual machine
device driver. This passthrough device controls its own device
reset flow, basic functionality as PCI VF function level reset
and rest of the virtio device functionality such as control vq,
config space access, data path descriptors handling.

Additionally, VM live migration using a precopy method is also widely used.

To support a VM live migration for such passthrough virtio devices,
the owner PCI PF device administers the device migration flow.

This patch introduces the basic theory of operation which describes the flow
and supporting administration commands.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/uapi/linux/vfio.h?h=v6.1.47

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
 admin-cmds-device-migration.tex | 94 +++++++++++++++++++++++++++++++++
 admin.tex                       |  1 +
 2 files changed, 95 insertions(+)
 create mode 100644 admin-cmds-device-migration.tex

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
new file mode 100644
index 0000000..f839af4
--- /dev/null
+++ b/admin-cmds-device-migration.tex
@@ -0,0 +1,94 @@
+\subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device / Device groups / Group
+administration commands / Device Migration}
+
+In some systems, there is a need to migrate a running virtual machine
+from one to another system. A running virtual machine has one or more
+passthrough virtio member devices attached to it. A passthrough device
+is entirely operated by the guest virtual machine. For example, with
+the SR-IOV group type, group member (VF) may undergo virtio device
+initialization and reset flow and may also undergo PCI function level
+reset(FLR) flow. Such flows must comply to the PCI standard and also
+virtio specification; at the same time such flows must not obstruct
+the device migration flow. In such a scenario, a group owner device
+can provide the administration command interface to facilitate the device
+migration related operations.
+
+When a virtual machine migrates from one hypervisor to another hypervisor,
+these hypervisors are named as source and destination hypervisor respectively.
+In such a scenario, a source hypervisor administers the
+member device to suspend the device and preserves the device context.
+Subsequently, a destination hypervisor administers the member device to
+setup a device context and resumes the member device. The source hypervisor
+reads the member device context and the destination hypervisor writes the member
+device context. The method to transfer the member device context from the source
+to the destination hypervisor is outside the scope of this specification.
+
+The member device can be in any of the three migration modes. The owner driver
+sets the member device in one of the following modes during device migration flow.
+
+\begin{tabularx}{\textwidth}{ |l||l|X| }
+\hline
+Value & Name & Description \\
+\hline \hline
+0x0   & Active &
+  It is the default mode after instantiation of the member device. \\
+\hline
+0x1   & Stop &
+ In this mode, the member device does not send any notifications,
+ and it does not access any driver memory.
+ The member device may receive driver notifications in this mode,
+ the member device context and device configuration space may change. \\
+\hline
+0x2   & Freeze &
+ In this mode, the member device does not accept any driver notifications,
+ it ignores any device configuration space writes,
+ the device do not have any changes in the device context. The
+ member device is not accessed in the system through the virtio interface. \\
+\hline
+\hline
+0x03-0xFF   & -    & reserved for future use \\
+\hline
+\end{tabularx}
+
+When the owner driver wants to stop the operation of the
+device, the owner driver sets the device mode to \field{Stop}. Once the
+device is in the \field{Stop} mode, the device does not initiate any notifications
+or does not access any driver memory. Since the member driver may be still
+active which may send further driver notifications to the device, the device
+context may be updated. When the member driver has stopped accessing the
+device, the owner driver sets the device to \field{Freeze} mode indicating
+to the device that no more driver access occurs. In the \field{Freeze} mode,
+no more changes occur in the device context. At this point, the device ensures
+that there will not be any update to the device context.
+
+The member device has a device context which the owner driver can either
+read or write. The member device context consist of any device specific
+data which is needed by the device to resume its operation when the device mode
+is changed from \field{Stop} to \field{Active} or from \field{Freeze}
+to \field{Active}.
+
+Once the device context is read, it is cleared from the device. Typically, on
+the source hypervisor, the owner driver reads the device context once when
+the device is in \field{Active} or \field{Stop} mode and later once the member
+device is in \field{Freeze} mode.
+
+Typically, the device context is read and written one time on the source and
+the destination hypervisor respectively once the device is in \field{Freeze}
+mode. On the destination hypervisor, after writing the device context,
+when the device mode set to \field{Active}, the device uses the most recently
+set device context and resumes the device operation.
+
+In an alternative flow, on the source hypervisor the owner driver may choose
+to read the device context first time while the device is in \field{Active} mode
+and second time once the device is in \field{Freeze} mode. Similarly, on the
+destination hypervisor writes the device context first time while the device
+is still running in \field{Active} mode on the source hypervisor and writes
+the device context second time while the device is in \field{Freeze} mode.
+This flow may result in very short setup time as the device context likely
+have minimal changes from the previously written device context. This flow may
+reduce the device migration time significantly and may have near constant
+device activation time regardless of number of virtqueues, resources and
+passthough devices in use by the migrating virtual machine.
+
+The owner driver can discard any partially read or written device context when
+any of the device migration flow should be aborted.
diff --git a/admin.tex b/admin.tex
index 0803c26..6eeef58 100644
--- a/admin.tex
+++ b/admin.tex
@@ -297,6 +297,7 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
 might differ between different group types.
 
 \input{admin-cmds-legacy-interface.tex}
+\input{admin-cmds-device-migration.tex}
 
 \devicenormative{\subsubsection}{Group administration commands}{Basic Facilities of a Virtio Device / Device groups / Group administration commands}
 
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 341+ messages in thread

* [virtio-comment] [PATCH v1 2/8] admin: Redefine reserved2 as command specific output
  2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration Parav Pandit
@ 2023-10-08 11:25 ` Parav Pandit
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 3/8] device-context: Define the device context fields for device migration Parav Pandit
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

Currently when a command wants to get two distinct types of data in
the result, such as one consumed by the driver, other to be zero
copied to some user buffers, the driver needs to prepare an
extra descriptor for driver consumed field. When such a field is
<= 4 bytes, extra descriptor is an overhead.

virtio_admin_command already has 4B of reserved for the device
writable area. Utilize it to define as device writable output.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
 admin.tex | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/admin.tex b/admin.tex
index 6eeef58..c86813d 100644
--- a/admin.tex
+++ b/admin.tex
@@ -90,8 +90,7 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
         /* Device-writable part */
         le16 status;
         le16 status_qualifier;
-        /* unused, reserved for future extensions */
-        u8 reserved2[4];
+        u8 command_specific_output[4];
         u8 command_specific_result[];
 };
 \end{lstlisting}
@@ -192,11 +191,15 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
 \hline
 \end{tabularx}
 
-Each command uses a different \field{command_specific_data} and
-\field{command_specific_result} structures and the length of
+Each command uses a different \field{command_specific_data},
+\field{command_specific_output} and
+\field{command_specific_result} fields. The length of
 \field{command_specific_data} and \field{command_specific_result}
-depends on these structures and is described separately or is
-implicit in the structure description.
+depends on the command and is described separately or is
+implicit in the structure description. The \field{command_specific_output}
+describes any command specific output which is up to 4 bytes size. The
+\field{command_specific_output} contain one or more command specific
+fields.
 
 Before sending any group administration commands to the device, the driver
 needs to communicate to the device which commands it is going to
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 341+ messages in thread

* [virtio-comment] [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration Parav Pandit
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
@ 2023-10-08 11:25 ` Parav Pandit
  2023-10-08 11:41   ` [virtio-comment] " Michael S. Tsirkin
  2023-11-02 14:21   ` Michael S. Tsirkin
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 4/8] admin: Add device migration admin commands Parav Pandit
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

Define the device context and its fields for purpose of device
migration. The device context is read and written by the owner driver
on source and destination hypervisor respectively.

Device context fields will experience a rapid growth post this initial
version to cover many details of the device.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Satananda Burla <sburla@marvell.com>
---
changelog:
v0->v1:
- enrich device context to cover feature bits, device configuration
  fields
- corrected alignment of device context fields
---
 content.tex        |   1 +
 device-context.tex | 142 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 143 insertions(+)
 create mode 100644 device-context.tex

diff --git a/content.tex b/content.tex
index 0a62dce..2698931 100644
--- a/content.tex
+++ b/content.tex
@@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
 UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
 
 \input{admin.tex}
+\input{device-context.tex}
 
 \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
 
diff --git a/device-context.tex b/device-context.tex
new file mode 100644
index 0000000..5611382
--- /dev/null
+++ b/device-context.tex
@@ -0,0 +1,142 @@
+\section{Device Context}\label{sec:Basic Facilities of a Virtio Device / Device Context}
+
+The device context holds the information that a owner driver can use
+to setup a member device and resume its operation. The device context
+of a member device is read or written by the owner driver using
+administration commands.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_field_tlv {
+        le32 type;
+        le32 reserved;
+        le64 length;
+        u8 value[];
+};
+
+struct virtio_dev_ctx {
+        le32 field_count;
+        struct virtio_dev_ctx_field_tlv fields[];
+};
+
+\end{lstlisting}
+
+The \field{struct virtio_dev_ctx} is the device context of a member device.
+The \field{field_count} indicates how many instances of
+\field{struct virtio_dev_ctx_field_tlv} are present.
+
+The \field{struct virtio_dev_ctx_field_tlv} consist of \field{type} indicating
+what data is contained in the \field{value} of length \field{length}.
+The valid values for \field{type} can be found in the following table:
+
+\begin{tabularx}{\textwidth}{ |l||l|X| }
+\hline
+type & Name & Description \\
+\hline \hline
+0x0 & VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG & Provides common configuration space of device for PCI transport \\
+\hline
+0x1 & VIRTIO_DEV_CTX_DEV_CFG_LAYOUT & Provides device specific configuration layout \\
+\hline
+0x2 & VIRTIO_DEV_CTX_DEV_FEATURES & Provides device features \\
+\hline
+0x3 & VIRTIO_DEV_CTX_PCI_VQ_CFG & Provides Virtqueue configuration for PCI transport \\
+\hline
+0x4 & VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG & Provides Queue run time state \\
+\hline
+0x5 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of virtqueue descriptors owned by device  \\
+\hline
+0x6 - 0xFFFFFFFF & - & Reserved for future types \\
+\hline
+\end{tabularx}
+
+\subsubsection{Device Context Fields}\label{sec:Basic Facilities of a Virtio Device / Device Context / Device Context Fields}
+
+\paragraph{PCI Common Configuration Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Common Configuration Context}
+
+For the field VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG, \field{type} is set to 0x0.
+The \field{value} is in format of \field{struct virtio_pci_common_cfg}.
+The \field{length} is the length of \field{struct virtio_pci_common_cfg}.
+
+\paragraph{Device Configuration Layout Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Device Configuration Layout Context}
+
+For the field VIRTIO_DEV_CTX_DEV_CFG_LAYOUT, \field{type} is set to 0x1.
+The \field{value} is in format of device specific configuration layout listed
+in each of the device's device configuration layout section.
+The \field{length} is the length of the device configuration layout data.
+
+\paragraph{Device Features Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Device Features Context}
+
+For the field VIRTIO_DEV_CTX_DEV_FEATURES, \field{type} is set to 0x2.
+The \field{value} is in format of device feature bits listed in
+\ref{sec:Basic Facilities of a Virtio Device / Feature Bits} in the format of \field{struct virtio_dev_ctx_features}.
+The \field{length} is the length of the device features.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_pci_vq_cfg {
+        le64 feature_bits[];
+};
+\end{lstlisting}
+
+\paragraph{PCI Virtqueue Configuration Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Virtqueue Configuration Context}
+
+For the field VIRTIO_DEV_CTX_PCI_VQ_CFG, \field{type} is set to 0x3.
+The \field{value} is in format of \field{struct virtio_dev_ctx_pci_vq_cfg}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_pci_vq_cfg}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_pci_vq_cfg {
+        le16 vq_index;
+        le16 queue_size;
+        le16 queue_msix_vector;
+        le64 queue_desc;
+        le64 queue_driver;
+        le64 queue_device;
+};
+\end{lstlisting}
+
+One or multiple entries of PCI Virtqueue Configuration Context may exist, each such
+entry corresponds to a unique virtqueue identified by the \field{vq_index}.
+
+\paragraph{Virtqueue Split Mode Runtime Context}
+\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Virtqueue Split Mode Runtime Context}
+
+For the field VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG, \field{type} is set to 0x4.
+The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_runtime}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_vq_split_runtime {
+        le16 vq_index;
+        le16 dev_avail_idx;
+        u8 enabled;
+};
+\end{lstlisting}
+
+The \field{dev_avail_idx} indicates the next available index of the virtqueue from which
+the device must start processing the available ring.
+
+One or multiple entries of Virtqueue Split Mode Runtime Context may exist, each such
+entry corresponds to a unique virtqueue identified by the \field{vq_index}.
+
+\paragraph{Virtqueue Split Mode Device owned Descriptors Context}
+
+For the field VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC, \field{type} is set to 0x5.
+The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
+The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_dev_descs}.
+
+\begin{lstlisting}
+struct virtio_dev_ctx_vq_split_dev_descs {
+        le16 vq_index;
+        le16 desc_count;
+        le16 desc_idx[];
+};
+\end{lstlisting}
+
+The \field{desc_idx} contains indices of the descriptors in \field{desc_count} of a
+virtqueue identified by \field{vq_index} which is owned by the device.
+
+One or multiple entries of Virtqueue Split Mode Device owned Descriptors Context may exist, each such
+entry corresponds to a unique virtqueue identified by the \field{vq_index}.
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 341+ messages in thread

* [virtio-comment] [PATCH v1 4/8] admin: Add device migration admin commands
  2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
                   ` (2 preceding siblings ...)
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 3/8] device-context: Define the device context fields for device migration Parav Pandit
@ 2023-10-08 11:25 ` Parav Pandit
  2023-10-18  6:46   ` [virtio-comment] " Michael S. Tsirkin
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 5/8] admin: Add requirements of device migration commands Parav Pandit
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

A passthrough device is mapped to the guest VM. A passthrough device
accessed by the driver can undergo its own device reset and for PCI
transport it can undergo its PCI FLR while the guest VM migration is
ongoing.
The passhtrough device may not have any direct channel through which
device migration related administrative tasks can be done, and even if
it may have such adminstative task must not be interrupted by the
device reset or VF FLR flow initiated by the passthrough device.

Hence, the owner driver which administers the member devices,
facilitate the device migration flow.

Add device migration administration commands that owner driver can use
for the passthrough device.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Satananda Burla <sburla@marvell.com>
---
 admin-cmds-device-migration.tex | 201 +++++++++++++++++++++++++++++++-
 1 file changed, 200 insertions(+), 1 deletion(-)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index f839af4..b7bfc09 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -65,7 +65,8 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 read or write. The member device context consist of any device specific
 data which is needed by the device to resume its operation when the device mode
 is changed from \field{Stop} to \field{Active} or from \field{Freeze}
-to \field{Active}.
+to \field{Active}. The device context is described in section
+\ref{sec:Basic Facilities of a Virtio Device / Device Context}.
 
 Once the device context is read, it is cleared from the device. Typically, on
 the source hypervisor, the owner driver reads the device context once when
@@ -92,3 +93,201 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 
 The owner driver can discard any partially read or written device context when
 any of the device migration flow should be aborted.
+
+The owner driver uses following device migration group administration commands.
+
+\begin{enumerate}
+\item Device Mode Get Command
+\item Device Mode Set Command
+\item Device Context Size Get Command
+\item Device Context Read Command
+\item Device Context Write Command
+\item Device Context Discard Command
+\end{enumerate}
+
+These commands are currently only defined for the SR-IOV group type.
+
+\paragraph{Device Mode Get Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Mode Get Command}
+
+This command reads the mode of the device.
+For the command VIRTIO_ADMIN_CMD_DEV_MODE_GET, \field{opcode}
+is set to 0x7.
+The \field{group_member_id} refers to the member device to be accessed.
+This command has no command specific data.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_mode_get_result {
+        u8 mode;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_mode_get_result}
+returned by the device where the device returns the \field{mode} value to
+either \field{Active} or \field{Stop} or \field{Freeze}.
+
+\paragraph{Device Mode Set Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Mode Set Command}
+
+This command sets the mode of the device.
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_dev_mode_set_data} describing the new device mode.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_mode_set_data {
+        u8 mode;
+};
+\end{lstlisting}
+
+For the command VIRTIO_ADMIN_CMD_DEV_MODE_SET, \field{opcode} is set to 0x8.
+The \field{group_member_id} refers to the member device to be accessed.
+The \field{mode} is set to either \field{Active} or \field{Stop} or
+\field{Freeze}.
+
+This command has no command specific result. When the command completes
+successfully, device is set in the new \field{mode}. When the command fails
+the device stays in the previous mode.
+
+\paragraph{Device Context Size Get Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Size Get Command}
+
+This command returns the remaining estimated device context size. The 
+driver can query the remaining estimated device context size
+for the current mode or for the \field{Freeze} mode. While
+reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the
+actual device context size may differ than what is being returned by
+this command. After reading the device context using command
+VIRTIO_ADMIN_CMD_DEV_CTX_READ, the remaining estimated context size
+usually reduces by amount of device context read by the driver using
+VIRTIO_ADMIN_CMD_DEV_CTX_READ command. If the device context is updated
+rapidly the remaining estimated context size may also increase even after
+reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, \field{opcode} is set to 0x9.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_size_get_data {
+        u8 freeze_mode;
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_dev_ctx_size_get_data}.
+When \field{freeze_mode} is set to 1, the device returns the estimated
+device context size when the device will be in \field{Freeze} mode.
+As the device context is read from the device, the remaining estimated
+context size may decrease. For example, member device mode is
+\field{Stop}, the device has estimated total device context size
+as 12KB; the device would return 12KB for the first
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command, once the driver has
+already read 8KB of device context data using
+VIRTIO_ADMIN_CMD_DEV_CTX_READ command, and the remaining data is
+4KB, hence the device returns 4KB in the subsequent
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_size_get_result {
+        le64 size;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result} is in
+the format \field{struct virtio_admin_cmd_dev_ctx_size_get_result}.
+
+Once the device context is fully read, this command returns zero for
+\field{size} until the new device context is generated.
+
+\paragraph{Device Context Read Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Read Command}
+
+This command reads the current device context.
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_READ, \field{opcode} is set to 0xa.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command has no command specific data.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_rd_len {
+        le32 context_len;
+};
+
+struct virtio_admin_cmd_dev_ctx_rd_result {
+        u8 data[];
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_ctx_rd_result}
+returned by the device containing the device context data and
+\field{command_specific_output} is in format of
+\field{struct virtio_admin_cmd_dev_ctx_rd_len} containing length of
+context data returned by the device in the command response. When the length
+returned is zero or when the returned context data is less the data requested by
+the driver, the device do not have any device context data left that the device
+can report, at this point the device context stream ends.
+
+The driver can read the whole device context data using one or multiple
+commands. When the device context does not fit in the
+\field{command_specific_result}, driver reads the subsequent remaining
+bytes using one or more subsequent commands.
+
+\paragraph{Device Context Write Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Write Command}
+
+This command writes the device context data. The device context can be written
+only when the device mode is \field{Freeze}.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, \field{opcode}
+is set to 0xb.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_ctx_wr_data {
+        u8 data[];
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_legacy_common_cfg_wr_data} describing
+the access to be performed.
+
+This command has no command specific result.
+The device fails the command when command is executed when the device mode
+is other than \field{Freeze}.
+
+The written device context is effective when the device mode is changed
+from \field{Freeze} to \field{Stop} or from \field{Freeze} to \field{Active}.
+
+The driver can write the whole device context using one or multiple
+commands. When the device context does not fit in one command result the
+driver writes the subsequent remaining bytes using one or more subsequent
+commands.
+
+\paragraph{Device Context Discard Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Discard Command}
+
+This command discards any partial device context that is yet to be read
+by the driver and it also discards any device context that is partially written.
+This command can be used by the driver to abort any device context migration
+flow when there may have been any partial context read or write operations
+have occurred.
+
+For the command VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD, \field{opcode}
+is set to 0xc.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command has no command specific data.
+This command has no command specific result.
+
+Once this command completes successfully, the device context is
+discarded. If the device context that is discarded was part of the write
+operation, once this command completes, the device functions as if the device
+context was never written. If the device context that is discarded was part
+of the read operation, once this command completes, the device functions as if
+the device context was never read in the given device mode. Once the device
+context is discarded, in subsequent VIRTIO_ADMIN_CMD_DEV_CTX_READ command,
+the device returns new device context entry. Once the device context is
+discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
+context.
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 341+ messages in thread

* [virtio-comment] [PATCH v1 5/8] admin: Add requirements of device migration commands
  2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
                   ` (3 preceding siblings ...)
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 4/8] admin: Add device migration admin commands Parav Pandit
@ 2023-10-08 11:25 ` Parav Pandit
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 6/8] admin: Add theory of operation for write recording commands Parav Pandit
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

Add device and driver side requirements for the device migration
commands.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
 admin-cmds-device-migration.tex | 102 ++++++++++++++++++++++++++++++++
 admin.tex                       |  14 ++++-
 2 files changed, 115 insertions(+), 1 deletion(-)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index b7bfc09..88e1af9 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -291,3 +291,105 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 the device returns new device context entry. Once the device context is
 discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
 context.
+
+\devicenormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
+
+A device MUST either support all of, or none of
+VIRTIO_ADMIN_CMD_DEV_MODE_GET,
+VIRTIO_ADMIN_CMD_DEV_MODE_SET,
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET,
+VIRTIO_ADMIN_CMD_DEV_READ,
+VIRTIO_ADMIN_CMD_DEV_WRITE and
+VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD commands.
+
+When the device \field{mode} supplied in the command
+VIRTIO_ADMIN_CMD_DEV_MODE_SET is same as what the mode in the device, the device
+MUST complete the command successfully.
+
+The device MUST fail the command VIRTIO_ADMIN_CMD_DEV_MODE_SET when the \field{mode}
+is other than \field{Active} or \field{Stop} or \field{Freeze}.
+
+When changing the device mode using the command VIRTIO_ADMIN_CMD_DEV_MODE_SET,
+if the command fails, the device MUST retain the current device mode.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_MODE_SET command when \field{mode}
+is set to \field{Active} or \field{Stop} and if the device context is
+partially read or written using VIRTIO_ADMIN_CMD_DEV_CTX_READ and
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE commands respectively.
+
+When VIRTIO_ADMIN_CMD_DEV_CTX_READ command is received multiple times
+in a given mode, and when the complete device context is already read by the
+driver, on subsequent reception of command VIRTIO_ADMIN_CMD_DEV_CTX_READ,
+the device MUST complete the command successfully with
+\field{context_len} set to zero.
+
+The device MUST support reading the device context when the device is
+in any mode \field{Active} or \field{Stop} or \field{Freeze} using command
+VIRTIO_ADMIN_CMD_DEV_CTX_READ.
+
+When the device is in any of the mode, and if the device context is read
+partially using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the device MUST discard
+the device context when VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD command is executed;
+In subsequent execution of VIRTIO_ADMIN_CMD_DEV_CTX_READ and
+VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, the device MUST return the remaining
+estimated device context size and the device context respectively for the
+current mode as if VIRTIO_ADMIN_CMD_DEV_CTX_READ was never received by the
+device for the current device mode.
+
+The device MUST support writing the complete device context multiple times
+by the command VIRTIO_ADMIN_CMD_DEV_CTX_WRITE.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command when the device
+mode is not \field{Freeze}.
+
+When the device is in \field{Freeze} mode, and if any device context is
+written partially by VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, the device MUST discard
+the device context when VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD
+command is executed, i.e. the device functions as if the command
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE was never received.
+
+For the SR-IOV group type, when the device context is read using
+VIRTIO_ADMIN_CMD_DEV_CTX_READ from one device and written to the anoother device
+using VIRTIO_ADMIN_CMD_DEV_CTX_WRITE, the driver MUST read and write
+device context only if the device PCI subsystem vendor id and device id
+match for both the devices.
+
+For the SR-IOV group type, a function level reset(FLR) operation MUST set the
+device mode to \field{Active}.
+
+For the SR-IOV group type, when the device is in \field{Freeze} mode, any
+write access to configuration space MUST not update any fields and any
+configuration space read MAY return any value.
+
+For the SR-IOV group type, regardless of the membe device \field{mode}, all
+the PCI transport level registers MUST be always accessible and the member device
+MUST function the same way for all the PCI transport level registers
+regardless of the member device mode.
+
+For the SR-IOV group type, for the VIRTIO_PCI_CAP_PCI_CFG capability area,
+the device MUST ignore writes when the device mode is set to \field{Freeze}
+and on receiving the reads, the device MUST function same regardless of the
+device mode is \field{Active} or \field{Stop} or \field{Freeze}.
+
+\drivernormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
+
+The driver SHOULD read the complete device context using one or multiple
+VIRTIO_ADMIN_CMD_DEV_CTX_READ commands.
+
+The driver MAY write the device context before changing the device mode from
+\field{Freeze} to \field{Stop} or from \field{Freeze} to \field{Active};
+the driver MUST write a complete device context using one or multiple
+VIRTIO_ADMIN_CMD_DEV_CTX_WRITE commands.
+
+The driver MUST NOT change the device mode to \field{Stop} or \field{Active}
+in the command VIRTIO_ADMIN_CMD_DEV_MODE_SET when device context is
+partially written.
+
+For the SR-IOV group type, the driver SHOULD NOT access device configuration
+space described in section
+\ref{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
+when the device mode is set to \field{Freeze}.
+
+For the SR-IOV group type, the driver MUST NOT write into the
+VIRTIO_PCI_CAP_PCI_CFG capability area when the device mode is set to
+\field{Freeze}.
diff --git a/admin.tex b/admin.tex
index c86813d..3429c4e 100644
--- a/admin.tex
+++ b/admin.tex
@@ -126,7 +126,19 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
 \hline
 0x0006 & VIRTIO_ADMIN_CMD_LEGACY_NOTIFY_INFO & Query the notification region information \\
 \hline
-0x0007 - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
+0x0007 & VIRTIO_ADMIN_CMD_DEV_MODE_GET & Query the device mode \\
+\hline
+0x0008 & VIRTIO_ADMIN_CMD_DEV_MODE_SET & Set the device mode \\
+\hline
+0x0009 & VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET & Query the device context size \\
+\hline
+0x000a & VIRTIO_ADMIN_CMD_DEV_CTX_READ & Read the device context data \\
+\hline
+0x000b & VIRTIO_ADMIN_CMD_DEV_CTX_WRITE & Write the device context data \\
+\hline
+0x000c & VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD & Clear the device context data \\
+\hline
+0x000d - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
 \hline
 0x8000 - 0xFFFF & - & Reserved for future commands (possibly using a different structure)    \\
 \hline
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 341+ messages in thread

* [virtio-comment] [PATCH v1 6/8] admin: Add theory of operation for write recording commands
  2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
                   ` (4 preceding siblings ...)
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 5/8] admin: Add requirements of device migration commands Parav Pandit
@ 2023-10-08 11:25 ` Parav Pandit
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 7/8] admin: Add " Parav Pandit
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 8/8] admin: Add requirements of write reporting commands Parav Pandit
  7 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

During a device migration flow (typically in a precopy phase of the
live migration), a device may write to the guest memory. Some
iommu/hypervisor may not be able to track these written pages.
These pages to be migrated from source to destination hypervisor.

A device which writes to these pages, provides the page address record
of the to the owner device. The owner device starts write
recording for the device and queries all the page addresses written by
the device.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Satananda Burla <sburla@marvell.com>
---
 admin-cmds-device-migration.tex | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index 88e1af9..e98d552 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -94,6 +94,21 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 The owner driver can discard any partially read or written device context when
 any of the device migration flow should be aborted.
 
+During the device migration flow, a passthrough device may write data to the
+guest virtual machine memory, a source hypervisor needs to keep track of these
+written memory to migrate such memory to destination hypervisor.
+Some systems may not be able to keep track of such memory write addresses at
+hypervisor level. In such a scenario, a device records and reports these
+written memory addresses to the owner device. Such an address is named as
+IO virtual address (IOVA). The owner driver enables write recording for one or
+more IOVA ranges per device during device migration flow. The owner driver
+periodically queries these written IOVA records from the device. As the driver
+reads the written IOVA records, the device clears those records from the device.
+Once the device reports zero or small number of written IOVA records, the device
+mode is set to \field{Stop} or \field{Freeze}. Once the device is set to \field{Stop}
+or \field{Freeze} mode, and once all the IOVA records are read, the driver stops
+the write recording in the device.
+
 The owner driver uses following device migration group administration commands.
 
 \begin{enumerate}
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 341+ messages in thread

* [virtio-comment] [PATCH v1 7/8] admin: Add write recording commands
  2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
                   ` (5 preceding siblings ...)
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 6/8] admin: Add theory of operation for write recording commands Parav Pandit
@ 2023-10-08 11:25 ` Parav Pandit
  2023-10-08 11:52   ` [virtio-comment] " Michael S. Tsirkin
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 8/8] admin: Add requirements of write reporting commands Parav Pandit
  7 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

When migrating a virtual machine with passthrough
virtio devices, the virtio device may write into the guest
memory. Some systems may not be able to keep track of these
pages efficiently.

To facilitate such a system, a device provides the record
of pages which are written by the device. In one use case, this
commands connect to the vfio framework at [1].

The owner driver configures the member device for list of address
ranges for which it expects write recording and reporting by the device.

The owner driver periodically queries the written pages address record
which gets cleared from the device upon reading it.

When the write records reduces over the time, at one point write recording
is stopped after the device mode is set to FREEZE.

[1] https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1207

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Satananda Burla <sburla@marvell.com>
---
 admin-cmds-device-migration.tex | 146 ++++++++++++++++++++++++++++++--
 admin.tex                       |  10 ++-
 2 files changed, 146 insertions(+), 10 deletions(-)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index e98d552..49835eb 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -97,15 +97,16 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 During the device migration flow, a passthrough device may write data to the
 guest virtual machine memory, a source hypervisor needs to keep track of these
 written memory to migrate such memory to destination hypervisor.
-Some systems may not be able to keep track of such memory write addresses at
-hypervisor level. In such a scenario, a device records and reports these
-written memory addresses to the owner device. Such an address is named as
-IO virtual address (IOVA). The owner driver enables write recording for one or
-more IOVA ranges per device during device migration flow. The owner driver
-periodically queries these written IOVA records from the device. As the driver
-reads the written IOVA records, the device clears those records from the device.
-Once the device reports zero or small number of written IOVA records, the device
-mode is set to \field{Stop} or \field{Freeze}. Once the device is set to \field{Stop}
+Some systems may not be able to keep track of such
+memory writes at addresses at hypervisor level. In such a scenario, a device
+records and reports these written memory addresses to the owner device. Such an
+address is named as IO virtual address (IOVA). The owner driver enables write
+recording for one or more IOVA ranges per device during device migration
+flow. The owner driver periodically queries these written IOVA records from
+the device. As the driver reads the written IOVA records,
+the device clears those records from the device. Once the device reports
+zero or small number of written IOVA records, the device is set to
+\field{Stop} or \field{Freeze} mode. Once the device is set to \field{Stop}
 or \field{Freeze} mode, and once all the IOVA records are read, the driver stops
 the write recording in the device.
 
@@ -118,6 +119,10 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 \item Device Context Read Command
 \item Device Context Write Command
 \item Device Context Discard Command
+\item Device Write Record Capabilities Query Command
+\item Device Write Records Start Command
+\item Device Write Records Stop Command
+\item Device Write Records Read Command
 \end{enumerate}
 
 These commands are currently only defined for the SR-IOV group type.
@@ -307,6 +312,129 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
 context.
 
+\paragraph{Device Write Record Capabilities Query Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Record Capabilities Query Command}
+
+This command reads the device write record capabilities.
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY, \field{opcode}
+is set to 0xd.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_dev_write_record_cap_result {
+        le32 supported_iova_page_size_bitmap;
+        le32 supported_iova_ranges;
+};
+\end{lstlisting}
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_write_record_cap_result}
+returned by the device. The \field{supported_iova_page_size_bitmap} indicates
+the granularity at which the device can record IOVA ranges. the minimum
+granularity can be 4KB. Bit 0 corresponds to 4KB, bit 1 corresponds to 8KB, bit 31
+corresponds to 4TB. The device supports at least one page granularity.
+The device support one or more IOVA page granularity; for each IOVA page
+granularity, the device sets corresponding bit in the
+\field{supported_iova_page_size_bitmap}. The \field{supported_iova_ranges}
+indicates how many unique (non overlapping) IOVA ranges can be recorded by
+the device.
+
+\paragraph{Device Write Records Start Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Records Start Command}
+
+This command starts the write recording in the device for the specified IOVA
+ranges.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START, \field{opcode}
+is set to 0xe.
+The \field{group_member_id} refers to the member device to be accessed.
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_write_record_start_data}.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_write_record_start_entry {
+        le64 iova;
+        le64 page_count;
+};
+
+struct virtio_admin_cmd_write_record_start_data {
+        le64 page_size;
+        le32 count;
+        u8 reserved[4];
+        struct virtio_admin_cmd_write_record_start_entry entries[];
+};
+
+\end{lstlisting}
+
+The \field{count} is set to indicate number of valid \field{entries}.
+The \field{iova} indicates the start IOVA address. The \field{page_count}
+indicates number of pages of size \field{page_size} starting from \field{iova}
+to record for write reporting. VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
+command contains unique i.e. non overlapping IOVA range entries.
+Whenever a memory write occurs by the device in the supplied IOVA range, the
+device records the actual IOVA and number of bytes written to the IOVA.
+These write records can be read by the
+the driver using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
+
+This command has no command specific result.
+
+\paragraph{Device Write Record Stop Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Record Stop Command}
+
+This command stops the write recording in the device for IOVA ranges
+which were previously started using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
+command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP, \field{opcode}
+is set to 0xf.
+The \field{group_member_id} refers to the member device to be accessed.
+
+This command does not have any command specific data.
+This command has no command specific result.
+
+\paragraph{Device Write Records Read Command}
+\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Records Read Command}
+
+This command reads the device write records for which the write recording is
+previously started using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
+
+For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ, \field{opcode}
+is set to 0x10.
+The \field{group_member_id} refers to the member device to be accessed.
+
+\begin{lstlisting}
+struct virtio_admin_cmd_write_records_read_data {
+        le64 iova;
+        le64 length;
+};
+
+struct virtio_admin_cmd_dev_write_records_cnt {
+        le32 count;
+};
+
+struct virtio_admin_cmd_dev_write_records_result {
+        le64 iova_entries[];
+};
+\end{lstlisting}
+
+The \field{command_specific_data} is in the format
+\field{struct virtio_admin_cmd_write_records_read_data}. The driver
+sets the \field {iova} indicating the start IOVA address for up to the
+\field{length} number of bytes. The supplied IOVA range same or smaller
+than the range supplied when write recording is started by the driver
+in VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
+
+When the command completes successfully, \field{command_specific_result}
+is in the format \field{struct virtio_admin_cmd_dev_write_records_result}
+and \field{command_specific_result} is in format of
+\field{struct virtio_admin_cmd_dev_write_records_cnt} containing number
+of write records returned by the device. When the command completes
+successfully, the write records which are returned in the result are
+cleared from the device and same records cannot be read again. When new
+writes occur at same IOVA range or at different once, those records can be read
+as new write records.
+
 \devicenormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
 
 A device MUST either support all of, or none of
diff --git a/admin.tex b/admin.tex
index 3429c4e..cffd85e 100644
--- a/admin.tex
+++ b/admin.tex
@@ -138,7 +138,15 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
 \hline
 0x000c & VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD & Clear the device context data \\
 \hline
-0x000d - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
+0x000d & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY & Query Write recording capabilities \\
+\hline
+0x000e & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START & Start Write recording in the device \\
+\hline
+0x000f & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP & Stop all write recording in the device \\
+\hline
+0x0010 & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ & Read and clear write records from the device \\
+\hline
+0x0011 - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
 \hline
 0x8000 - 0xFFFF & - & Reserved for future commands (possibly using a different structure)    \\
 \hline
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 341+ messages in thread

* [virtio-comment] [PATCH v1 8/8] admin: Add requirements of write reporting commands
  2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
                   ` (6 preceding siblings ...)
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 7/8] admin: Add " Parav Pandit
@ 2023-10-08 11:25 ` Parav Pandit
  7 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-08 11:25 UTC (permalink / raw)
  To: virtio-comment, mst, cohuck; +Cc: sburla, shahafs, maorg, yishaih, Parav Pandit

Add device and driver requirements for the write reporting commands.

Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
Signed-off-by: Parav Pandit <parav@nvidia.com>
---
 admin-cmds-device-migration.tex | 36 +++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
index 49835eb..09e772a 100644
--- a/admin-cmds-device-migration.tex
+++ b/admin-cmds-device-migration.tex
@@ -514,6 +514,34 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 and on receiving the reads, the device MUST function same regardless of the
 device mode is \field{Active} or \field{Stop} or \field{Freeze}.
 
+A device MUST either support all of, or none of
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START,
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP and
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ commands.
+
+If the device supports VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY
+command, the device MUST set minimum one bit in the
+\field{supported_iova_page_size_bitmap} and set non zero value in the
+\field{supported_iova_ranges}.
+
+The device MUST fail the VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ and
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP commands
+if the write recording is not started by the driver.
+
+The device MUST fail VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command
+if the write recording is not started.
+
+For the SR-IOV group type, for the VF member device, VF function level
+reset (FLR) MUST NOT stop write recording on the VF device and it MUST NOT
+clear any write records already gathered by the owner device.
+
+The device MUST clear the write records which are returned in the
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ result. After command completion
+of VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ if new write record is created
+for the same IOVA range, the device MUST report such a write record as
+new entry.
+
 \drivernormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
 
 The driver SHOULD read the complete device context using one or multiple
@@ -536,3 +564,11 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
 For the SR-IOV group type, the driver MUST NOT write into the
 VIRTIO_PCI_CAP_PCI_CFG capability area when the device mode is set to
 \field{Freeze}.
+
+The driver MUST NOT invoke VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
+for overlapping IOVA ranges, each IOVA range supplied in the command or
+across multiple commands MUST be supplying unique ranges.
+
+If the write recording is started by the driver using
+VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START commands, the driver MUST explicitly
+stop the wrie recording using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP command.
-- 
2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 341+ messages in thread

* [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 3/8] device-context: Define the device context fields for device migration Parav Pandit
@ 2023-10-08 11:41   ` Michael S. Tsirkin
  2023-10-09  4:15     ` Parav Pandit
  2023-10-09 10:34     ` Zhu, Lingshan
  2023-11-02 14:21   ` Michael S. Tsirkin
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-08 11:41 UTC (permalink / raw)
  To: Parav Pandit; +Cc: virtio-comment, cohuck, sburla, shahafs, maorg, yishaih

On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
> Define the device context and its fields for purpose of device
> migration. The device context is read and written by the owner driver
> on source and destination hypervisor respectively.
> 
> Device context fields will experience a rapid growth post this initial
> version to cover many details of the device.
> 
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Signed-off-by: Satananda Burla <sburla@marvell.com>
> ---
> changelog:
> v0->v1:
> - enrich device context to cover feature bits, device configuration
>   fields
> - corrected alignment of device context fields
> ---
>  content.tex        |   1 +
>  device-context.tex | 142 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 143 insertions(+)
>  create mode 100644 device-context.tex
> 
> diff --git a/content.tex b/content.tex
> index 0a62dce..2698931 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
>  UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>  
>  \input{admin.tex}
> +\input{device-context.tex}
>  
>  \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
>  
> diff --git a/device-context.tex b/device-context.tex
> new file mode 100644
> index 0000000..5611382
> --- /dev/null
> +++ b/device-context.tex
> @@ -0,0 +1,142 @@
> +\section{Device Context}\label{sec:Basic Facilities of a Virtio Device / Device Context}
> +
> +The device context holds the information that a owner driver can use
> +to setup a member device and resume its operation. The device context
> +of a member device is read or written by the owner driver using
> +administration commands.
> +
> +\begin{lstlisting}
> +struct virtio_dev_ctx_field_tlv {
> +        le32 type;
> +        le32 reserved;
> +        le64 length;
> +        u8 value[];
> +};
> +
> +struct virtio_dev_ctx {
> +        le32 field_count;
> +        struct virtio_dev_ctx_field_tlv fields[];
> +};
> +
> +\end{lstlisting}
> +
> +The \field{struct virtio_dev_ctx} is the device context of a member device.
> +The \field{field_count} indicates how many instances of
> +\field{struct virtio_dev_ctx_field_tlv} are present.
> +
> +The \field{struct virtio_dev_ctx_field_tlv} consist of \field{type} indicating
> +what data is contained in the \field{value} of length \field{length}.
> +The valid values for \field{type} can be found in the following table:
> +
> +\begin{tabularx}{\textwidth}{ |l||l|X| }
> +\hline
> +type & Name & Description \\
> +\hline \hline
> +0x0 & VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG & Provides common configuration space of device for PCI transport \\
> +\hline
> +0x1 & VIRTIO_DEV_CTX_DEV_CFG_LAYOUT & Provides device specific configuration layout \\
> +\hline
> +0x2 & VIRTIO_DEV_CTX_DEV_FEATURES & Provides device features \\
> +\hline
> +0x3 & VIRTIO_DEV_CTX_PCI_VQ_CFG & Provides Virtqueue configuration for PCI transport \\
> +\hline
> +0x4 & VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG & Provides Queue run time state \\
> +\hline
> +0x5 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of virtqueue descriptors owned by device  \\
> +\hline
> +0x6 - 0xFFFFFFFF & - & Reserved for future types \\
> +\hline
> +\end{tabularx}


I don't think this is enough, e.g. virtio net has internal state
controlled thought CVQ commands. how do you intend to address/migrate
these?

> +\subsubsection{Device Context Fields}\label{sec:Basic Facilities of a Virtio Device / Device Context / Device Context Fields}
> +
> +\paragraph{PCI Common Configuration Context}
> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Common Configuration Context}
> +
> +For the field VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG, \field{type} is set to 0x0.
> +The \field{value} is in format of \field{struct virtio_pci_common_cfg}.
> +The \field{length} is the length of \field{struct virtio_pci_common_cfg}.
> +
> +\paragraph{Device Configuration Layout Context}
> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Device Configuration Layout Context}
> +
> +For the field VIRTIO_DEV_CTX_DEV_CFG_LAYOUT, \field{type} is set to 0x1.
> +The \field{value} is in format of device specific configuration layout listed
> +in each of the device's device configuration layout section.
> +The \field{length} is the length of the device configuration layout data.

Unclear. I am guessing it's doing things like setting up RO
fields? This needs to be specified per device really.
Also how some fields behave might depend on features.

> +
> +\paragraph{Device Features Context}
> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Device Features Context}
> +
> +For the field VIRTIO_DEV_CTX_DEV_FEATURES, \field{type} is set to 0x2.
> +The \field{value} is in format of device feature bits listed in
> +\ref{sec:Basic Facilities of a Virtio Device / Feature Bits} in the format of \field{struct virtio_dev_ctx_features}.
> +The \field{length} is the length of the device features.
> +
> +\begin{lstlisting}
> +struct virtio_dev_ctx_pci_vq_cfg {
> +        le64 feature_bits[];
> +};
> +\end{lstlisting}
> +
> +\paragraph{PCI Virtqueue Configuration Context}
> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Virtqueue Configuration Context}
> +
> +For the field VIRTIO_DEV_CTX_PCI_VQ_CFG, \field{type} is set to 0x3.
> +The \field{value} is in format of \field{struct virtio_dev_ctx_pci_vq_cfg}.
> +The \field{length} is the length of \field{struct virtio_dev_ctx_pci_vq_cfg}.
> +
> +\begin{lstlisting}
> +struct virtio_dev_ctx_pci_vq_cfg {
> +        le16 vq_index;
> +        le16 queue_size;
> +        le16 queue_msix_vector;
> +        le64 queue_desc;
> +        le64 queue_driver;
> +        le64 queue_device;
> +};
> +\end{lstlisting}
> +
> +One or multiple entries of PCI Virtqueue Configuration Context may exist, each such
> +entry corresponds to a unique virtqueue identified by the \field{vq_index}.
> +
> +\paragraph{Virtqueue Split Mode Runtime Context}
> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Virtqueue Split Mode Runtime Context}
> +
> +For the field VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG, \field{type} is set to 0x4.
> +The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
> +The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_runtime}.
> +
> +\begin{lstlisting}
> +struct virtio_dev_ctx_vq_split_runtime {
> +        le16 vq_index;
> +        le16 dev_avail_idx;
> +        u8 enabled;
> +};
> +\end{lstlisting}
> +
> +The \field{dev_avail_idx} indicates the next available index of the virtqueue from which
> +the device must start processing the available ring.
> +
> +One or multiple entries of Virtqueue Split Mode Runtime Context may exist, each such
> +entry corresponds to a unique virtqueue identified by the \field{vq_index}.
> +
> +\paragraph{Virtqueue Split Mode Device owned Descriptors Context}
> +
> +For the field VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC, \field{type} is set to 0x5.
> +The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
> +The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_dev_descs}.
> +
> +\begin{lstlisting}
> +struct virtio_dev_ctx_vq_split_dev_descs {
> +        le16 vq_index;
> +        le16 desc_count;
> +        le16 desc_idx[];
> +};
> +\end{lstlisting}
> +
> +The \field{desc_idx} contains indices of the descriptors in \field{desc_count} of a
> +virtqueue identified by \field{vq_index} which is owned by the device.
> +
> +One or multiple entries of Virtqueue Split Mode Device owned Descriptors Context may exist, each such
> +entry corresponds to a unique virtqueue identified by the \field{vq_index}.
> -- 
> 2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] Re: [PATCH v1 7/8] admin: Add write recording commands
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 7/8] admin: Add " Parav Pandit
@ 2023-10-08 11:52   ` Michael S. Tsirkin
  2023-10-09  4:14     ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-08 11:52 UTC (permalink / raw)
  To: Parav Pandit; +Cc: virtio-comment, cohuck, sburla, shahafs, maorg, yishaih

On Sun, Oct 08, 2023 at 02:25:54PM +0300, Parav Pandit wrote:
> When migrating a virtual machine with passthrough
> virtio devices, the virtio device may write into the guest
> memory. Some systems may not be able to keep track of these
> pages efficiently.
> 
> To facilitate such a system, a device provides the record
> of pages which are written by the device. In one use case, this
> commands connect to the vfio framework at [1].
> 
> The owner driver configures the member device for list of address
> ranges for which it expects write recording and reporting by the device.
> 
> The owner driver periodically queries the written pages address record
> which gets cleared from the device upon reading it.
> 
> When the write records reduces over the time, at one point write recording
> is stopped after the device mode is set to FREEZE.
> 
> [1] https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1207
> 
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Signed-off-by: Satananda Burla <sburla@marvell.com>
> ---
>  admin-cmds-device-migration.tex | 146 ++++++++++++++++++++++++++++++--
>  admin.tex                       |  10 ++-
>  2 files changed, 146 insertions(+), 10 deletions(-)
> 
> diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
> index e98d552..49835eb 100644
> --- a/admin-cmds-device-migration.tex
> +++ b/admin-cmds-device-migration.tex
> @@ -97,15 +97,16 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
>  During the device migration flow, a passthrough device may write data to the
>  guest virtual machine memory, a source hypervisor needs to keep track of these
>  written memory to migrate such memory to destination hypervisor.
> -Some systems may not be able to keep track of such memory write addresses at
> -hypervisor level. In such a scenario, a device records and reports these
> -written memory addresses to the owner device. Such an address is named as
> -IO virtual address (IOVA). The owner driver enables write recording for one or
> -more IOVA ranges per device during device migration flow. The owner driver
> -periodically queries these written IOVA records from the device. As the driver
> -reads the written IOVA records, the device clears those records from the device.
> -Once the device reports zero or small number of written IOVA records, the device
> -mode is set to \field{Stop} or \field{Freeze}. Once the device is set to \field{Stop}
> +Some systems may not be able to keep track of such
> +memory writes at addresses at hypervisor level. In such a scenario, a device
> +records and reports these written memory addresses to the owner device.


what does it mean to record them?

> Such an
> +address is named as IO virtual address (IOVA).

I don't know what does this have to do with IOVA. For that matter
everything would have to be "IOVA". Spec calls these physical
address and let's stick to that.


> The owner driver enables write
> +recording for one or more IOVA ranges per device during device migration
> +flow. The owner driver periodically queries these written IOVA records from
> +the device.

periodical reads without any indication are the only option then?

> As the driver reads the written IOVA records,
> +the device clears those records from the device. Once the device reports
> +zero or small number of written IOVA records, the device is set to
> +\field{Stop} or \field{Freeze} mode. Once the device is set to \field{Stop}
>  or \field{Freeze} mode, and once all the IOVA records are read, the driver stops
>  the write recording in the device.


it is not great that you are rewriting text you just wrote in patch 1
here. pls find a way not to make reviewers read everything twice.

> @@ -118,6 +119,10 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
>  \item Device Context Read Command
>  \item Device Context Write Command
>  \item Device Context Discard Command
> +\item Device Write Record Capabilities Query Command
> +\item Device Write Records Start Command
> +\item Device Write Records Stop Command
> +\item Device Write Records Read Command
>  \end{enumerate}
>  
>  These commands are currently only defined for the SR-IOV group type.
> @@ -307,6 +312,129 @@ \subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device /
>  discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
>  context.
>  
> +\paragraph{Device Write Record Capabilities Query Command}
> +\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Record Capabilities Query Command}
> +
> +This command reads the device write record capabilities.
> +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY, \field{opcode}
> +is set to 0xd.
> +The \field{group_member_id} refers to the member device to be accessed.
> +
> +\begin{lstlisting}
> +struct virtio_admin_cmd_dev_write_record_cap_result {
> +        le32 supported_iova_page_size_bitmap;
> +        le32 supported_iova_ranges;
> +};
> +\end{lstlisting}
> +
> +When the command completes successfully, \field{command_specific_result}
> +is in the format \field{struct virtio_admin_cmd_dev_write_record_cap_result}
> +returned by the device. The \field{supported_iova_page_size_bitmap} indicates
> +the granularity at which the device can record IOVA ranges. the minimum
> +granularity can be 4KB. Bit 0 corresponds to 4KB, bit 1 corresponds to 8KB, bit 31
> +corresponds to 4TB. The device supports at least one page granularity.
> +The device support one or more IOVA page granularity; for each IOVA page
> +granularity, the device sets corresponding bit in the
> +\field{supported_iova_page_size_bitmap}. The \field{supported_iova_ranges}
> +indicates how many unique (non overlapping) IOVA ranges can be recorded by
> +the device.

what role does this granularity play? i see no mention of it down the
road.


> +
> +\paragraph{Device Write Records Start Command}
> +\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Records Start Command}
> +
> +This command starts the write recording in the device for the specified IOVA
> +ranges.
> +
> +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START, \field{opcode}
> +is set to 0xe.
> +The \field{group_member_id} refers to the member device to be accessed.
> +
> +The \field{command_specific_data} is in the format
> +\field{struct virtio_admin_cmd_write_record_start_data}.
> +
> +\begin{lstlisting}
> +struct virtio_admin_cmd_write_record_start_entry {
> +        le64 iova;
> +        le64 page_count;
> +};
> +
> +struct virtio_admin_cmd_write_record_start_data {
> +        le64 page_size;
> +        le32 count;
> +        u8 reserved[4];
> +        struct virtio_admin_cmd_write_record_start_entry entries[];
> +};
> +
> +\end{lstlisting}
> +
> +The \field{count} is set to indicate number of valid \field{entries}.
> +The \field{iova} indicates the start IOVA address. The \field{page_count}
> +indicates number of pages of size \field{page_size} starting from \field{iova}
> +to record for write reporting. VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> +command contains unique i.e. non overlapping IOVA range entries.
> +Whenever a memory write occurs by the device in the supplied IOVA range, the
> +device records the actual IOVA and number of bytes written to the IOVA.
> +These write records can be read by the
> +the driver using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
> +
> +This command has no command specific result.
> +
> +\paragraph{Device Write Record Stop Command}
> +\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Record Stop Command}
> +
> +This command stops the write recording in the device for IOVA ranges
> +which were previously started using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> +command.
> +
> +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP, \field{opcode}
> +is set to 0xf.
> +The \field{group_member_id} refers to the member device to be accessed.
> +
> +This command does not have any command specific data.
> +This command has no command specific result.
> +
> +\paragraph{Device Write Records Read Command}
> +\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Write Records Read Command}
> +
> +This command reads the device write records for which the write recording is
> +previously started using VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
> +
> +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ, \field{opcode}
> +is set to 0x10.
> +The \field{group_member_id} refers to the member device to be accessed.
> +
> +\begin{lstlisting}
> +struct virtio_admin_cmd_write_records_read_data {
> +        le64 iova;
> +        le64 length;
> +};
> +
> +struct virtio_admin_cmd_dev_write_records_cnt {
> +        le32 count;
> +};
> +
> +struct virtio_admin_cmd_dev_write_records_result {
> +        le64 iova_entries[];
> +};
> +\end{lstlisting}
> +
> +The \field{command_specific_data} is in the format
> +\field{struct virtio_admin_cmd_write_records_read_data}. The driver
> +sets the \field {iova} indicating the start IOVA address for up to the
> +\field{length} number of bytes. The supplied IOVA range same or smaller
> +than the range supplied when write recording is started by the driver
> +in VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.

Seems pretty sparse. Lots of hypervisors chose to implement
a bit per page strategy.

> +
> +When the command completes successfully, \field{command_specific_result}
> +is in the format \field{struct virtio_admin_cmd_dev_write_records_result}
> +and \field{command_specific_result} is in format of
> +\field{struct virtio_admin_cmd_dev_write_records_cnt} containing number
> +of write records returned by the device.

what are these records though? 


> When the command completes
> +successfully, the write records which are returned in the result are
> +cleared from the device and same records cannot be read again. When new
> +writes occur at same IOVA range or at different once, those records can be read
> +as new write records.


this last sentence just confuses.

> +
>  \devicenormative{\paragraph}{Device Migration}{Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration}
>  
>  A device MUST either support all of, or none of
> diff --git a/admin.tex b/admin.tex
> index 3429c4e..cffd85e 100644
> --- a/admin.tex
> +++ b/admin.tex
> @@ -138,7 +138,15 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
>  \hline
>  0x000c & VIRTIO_ADMIN_CMD_DEV_CTX_DISCARD & Clear the device context data \\
>  \hline
> -0x000d - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
> +0x000d & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY & Query Write recording capabilities \\
> +\hline
> +0x000e & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START & Start Write recording in the device \\
> +\hline
> +0x000f & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP & Stop all write recording in the device \\
> +\hline
> +0x0010 & VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ & Read and clear write records from the device \\
> +\hline
> +0x0011 - 0x7FFF & - & Commands using \field{struct virtio_admin_cmd}    \\
>  \hline
>  0x8000 - 0xFFFF & - & Reserved for future commands (possibly using a different structure)    \\
>  \hline
> -- 
> 2.34.1


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] RE: [PATCH v1 7/8] admin: Add write recording commands
  2023-10-08 11:52   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-10-09  4:14     ` Parav Pandit
  2023-10-09 10:57       ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-09  4:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Sunday, October 8, 2023 5:23 PM
> 
> On Sun, Oct 08, 2023 at 02:25:54PM +0300, Parav Pandit wrote:
> > When migrating a virtual machine with passthrough virtio devices, the
> > virtio device may write into the guest memory. Some systems may not be
> > able to keep track of these pages efficiently.
> >
> > To facilitate such a system, a device provides the record of pages
> > which are written by the device. In one use case, this commands
> > connect to the vfio framework at [1].
> >
> > The owner driver configures the member device for list of address
> > ranges for which it expects write recording and reporting by the device.
> >
> > The owner driver periodically queries the written pages address record
> > which gets cleared from the device upon reading it.
> >
> > When the write records reduces over the time, at one point write
> > recording is stopped after the device mode is set to FREEZE.
> >
> > [1]
> > https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vf
> > io.h#L1207
> >
> > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > ---
> >  admin-cmds-device-migration.tex | 146
> ++++++++++++++++++++++++++++++--
> >  admin.tex                       |  10 ++-
> >  2 files changed, 146 insertions(+), 10 deletions(-)
> >
> > diff --git a/admin-cmds-device-migration.tex
> > b/admin-cmds-device-migration.tex index e98d552..49835eb 100644
> > --- a/admin-cmds-device-migration.tex
> > +++ b/admin-cmds-device-migration.tex
> > @@ -97,15 +97,16 @@ \subsubsection{Device Migration}\label{sec:Basic
> > Facilities of a Virtio Device /  During the device migration flow, a
> > passthrough device may write data to the  guest virtual machine
> > memory, a source hypervisor needs to keep track of these  written memory to
> migrate such memory to destination hypervisor.
> > -Some systems may not be able to keep track of such memory write
> > addresses at -hypervisor level. In such a scenario, a device records
> > and reports these -written memory addresses to the owner device. Such
> > an address is named as -IO virtual address (IOVA). The owner driver
> > enables write recording for one or -more IOVA ranges per device during
> > device migration flow. The owner driver -periodically queries these
> > written IOVA records from the device. As the driver -reads the written IOVA
> records, the device clears those records from the device.
> > -Once the device reports zero or small number of written IOVA records,
> > the device -mode is set to \field{Stop} or \field{Freeze}. Once the
> > device is set to \field{Stop}
> > +Some systems may not be able to keep track of such memory writes at
> > +addresses at hypervisor level. In such a scenario, a device records
> > +and reports these written memory addresses to the owner device.
> 
> 
> what does it mean to record them?
>
May be we can use a different verb than record. May be tracking?
Record means, device accumulates these written pages addresses, and when driver queries, it returns these addresses and clears it internally within the device.
 
> > Such an
> > +address is named as IO virtual address (IOVA).
> 
> I don't know what does this have to do with IOVA. For that matter everything
> would have to be "IOVA". Spec calls these physical address and let's stick to
> that.
> 
Make sense.
At device level it does not have knowledge of IOVA.
I will rename it.

> 
> > The owner driver enables write
> > +recording for one or more IOVA ranges per device during device
> > +migration flow. The owner driver periodically queries these written
> > +IOVA records from the device.
> 
> periodical reads without any indication are the only option then?
>
At least for now, this is starting point. Software stack such as QEMU does it periodically.
When new use case arise, may be it can be extended.
 
> > As the driver reads the written IOVA records,
> > +the device clears those records from the device. Once the device
> > +reports zero or small number of written IOVA records, the device is
> > +set to \field{Stop} or \field{Freeze} mode. Once the device is set to
> > +\field{Stop}
> >  or \field{Freeze} mode, and once all the IOVA records are read, the
> > driver stops  the write recording in the device.
> 
> 
> it is not great that you are rewriting text you just wrote in patch 1 here. pls find
> a way not to make reviewers read everything twice.
> 
There is small duplication of one line explaining mode change, rest is contextual to the write recording.
Merging with text of patch_1 was slightly complicated to read, so one sentence is duplicated.

I will again check if patch_1 text extension is easier to read.

> > @@ -118,6 +119,10 @@ \subsubsection{Device Migration}\label{sec:Basic
> > Facilities of a Virtio Device /  \item Device Context Read Command
> > \item Device Context Write Command  \item Device Context Discard
> > Command
> > +\item Device Write Record Capabilities Query Command \item Device
> > +Write Records Start Command \item Device Write Records Stop Command
> > +\item Device Write Records Read Command
> >  \end{enumerate}
> >
> >  These commands are currently only defined for the SR-IOV group type.
> > @@ -307,6 +312,129 @@ \subsubsection{Device Migration}\label{sec:Basic
> > Facilities of a Virtio Device /  discarded, subsequent
> > VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
> context.
> >
> > +\paragraph{Device Write Record Capabilities Query Command}
> > +\label{par:Basic Facilities of a Virtio Device / Device groups /
> > +Group administration commands / Device Migration / Device Write
> > +Record Capabilities Query Command}
> > +
> > +This command reads the device write record capabilities.
> > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
> > +\field{opcode} is set to 0xd.
> > +The \field{group_member_id} refers to the member device to be accessed.
> > +
> > +\begin{lstlisting}
> > +struct virtio_admin_cmd_dev_write_record_cap_result {
> > +        le32 supported_iova_page_size_bitmap;
> > +        le32 supported_iova_ranges;
> > +};
> > +\end{lstlisting}
> > +
> > +When the command completes successfully,
> > +\field{command_specific_result} is in the format \field{struct
> > +virtio_admin_cmd_dev_write_record_cap_result}
> > +returned by the device. The \field{supported_iova_page_size_bitmap}
> > +indicates the granularity at which the device can record IOVA ranges.
> > +the minimum granularity can be 4KB. Bit 0 corresponds to 4KB, bit 1
> > +corresponds to 8KB, bit 31 corresponds to 4TB. The device supports at least
> one page granularity.
> > +The device support one or more IOVA page granularity; for each IOVA
> > +page granularity, the device sets corresponding bit in the
> > +\field{supported_iova_page_size_bitmap}. The
> > +\field{supported_iova_ranges} indicates how many unique (non
> > +overlapping) IOVA ranges can be recorded by the device.
> 
> what role does this granularity play? i see no mention of it down the road.
> 
The page_size in struct virtio_admin_cmd_write_record_start_data must match to the granularity supplied above.
I missed it. Will add in v2.
This is very useful comment.

> 
> > +
> > +\paragraph{Device Write Records Start Command} \label{par:Basic
> > +Facilities of a Virtio Device / Device groups / Group administration
> > +commands / Device Migration / Device Write Records Start Command}
> > +
> > +This command starts the write recording in the device for the
> > +specified IOVA ranges.
> > +
> > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START,
> > +\field{opcode} is set to 0xe.
> > +The \field{group_member_id} refers to the member device to be accessed.
> > +
> > +The \field{command_specific_data} is in the format \field{struct
> > +virtio_admin_cmd_write_record_start_data}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_admin_cmd_write_record_start_entry {
> > +        le64 iova;
> > +        le64 page_count;
> > +};
> > +
> > +struct virtio_admin_cmd_write_record_start_data {
> > +        le64 page_size;
> > +        le32 count;
> > +        u8 reserved[4];
> > +        struct virtio_admin_cmd_write_record_start_entry entries[];
> > +};
> > +
> > +\end{lstlisting}
> > +
> > +The \field{count} is set to indicate number of valid \field{entries}.
> > +The \field{iova} indicates the start IOVA address. The
> > +\field{page_count} indicates number of pages of size
> > +\field{page_size} starting from \field{iova} to record for write
> > +reporting. VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > +command contains unique i.e. non overlapping IOVA range entries.
> > +Whenever a memory write occurs by the device in the supplied IOVA
> > +range, the device records the actual IOVA and number of bytes written to the
> IOVA.
> > +These write records can be read by the the driver using
> > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
> > +
> > +This command has no command specific result.
> > +
> > +\paragraph{Device Write Record Stop Command} \label{par:Basic
> > +Facilities of a Virtio Device / Device groups / Group administration
> > +commands / Device Migration / Device Write Record Stop Command}
> > +
> > +This command stops the write recording in the device for IOVA ranges
> > +which were previously started using
> > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > +command.
> > +
> > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP,
> > +\field{opcode} is set to 0xf.
> > +The \field{group_member_id} refers to the member device to be accessed.
> > +
> > +This command does not have any command specific data.
> > +This command has no command specific result.
> > +
> > +\paragraph{Device Write Records Read Command} \label{par:Basic
> > +Facilities of a Virtio Device / Device groups / Group administration
> > +commands / Device Migration / Device Write Records Read Command}
> > +
> > +This command reads the device write records for which the write
> > +recording is previously started using
> VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
> > +
> > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ,
> > +\field{opcode} is set to 0x10.
> > +The \field{group_member_id} refers to the member device to be accessed.
> > +
> > +\begin{lstlisting}
> > +struct virtio_admin_cmd_write_records_read_data {
> > +        le64 iova;
> > +        le64 length;
> > +};
> > +
> > +struct virtio_admin_cmd_dev_write_records_cnt {
> > +        le32 count;
> > +};
> > +
> > +struct virtio_admin_cmd_dev_write_records_result {
> > +        le64 iova_entries[];
> > +};
> > +\end{lstlisting}
> > +
> > +The \field{command_specific_data} is in the format \field{struct
> > +virtio_admin_cmd_write_records_read_data}. The driver sets the \field
> > +{iova} indicating the start IOVA address for up to the \field{length}
> > +number of bytes. The supplied IOVA range same or smaller than the
> > +range supplied when write recording is started by the driver in
> > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
> 
> Seems pretty sparse. Lots of hypervisors chose to implement a bit per page
> strategy.
This command result addresses will feed into such bit.

> 
> > +
> > +When the command completes successfully,
> > +\field{command_specific_result} is in the format \field{struct
> > +virtio_admin_cmd_dev_write_records_result}
> > +and \field{command_specific_result} is in format of \field{struct
> > +virtio_admin_cmd_dev_write_records_cnt} containing number of write
> > +records returned by the device.
> 
> what are these records though?
> 
It is struct virtio_admin_cmd_dev_write_records_result.
I will rephrase it to link to struct virtio_admin_cmd_dev_write_records_result.

> 
> > When the command completes
> > +successfully, the write records which are returned in the result are
> > +cleared from the device and same records cannot be read again. When
> > +new writes occur at same IOVA range or at different once, those
> > +records can be read as new write records.
> 
> 
> this last sentence just confuses.
> 
How about just keeping below text rewrite?
When the command completes successfully, the write records returned in the result are cleared from the device.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-08 11:41   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-10-09  4:15     ` Parav Pandit
  2023-10-09 15:54       ` Michael S. Tsirkin
  2023-10-09 10:34     ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-09  4:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

Hi Michael,

> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Sunday, October 8, 2023 5:12 PM

[..]
> > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline type & Name &
> > +Description \\ \hline \hline
> > +0x0 & VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG & Provides common
> > +configuration space of device for PCI transport \\ \hline
> > +0x1 & VIRTIO_DEV_CTX_DEV_CFG_LAYOUT & Provides device specific
> > +configuration layout \\ \hline
> > +0x2 & VIRTIO_DEV_CTX_DEV_FEATURES & Provides device features \\
> > +\hline
> > +0x3 & VIRTIO_DEV_CTX_PCI_VQ_CFG & Provides Virtqueue configuration
> > +for PCI transport \\ \hline
> > +0x4 & VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG & Provides Queue run
> time
> > +state \\ \hline
> > +0x5 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of
> > +virtqueue descriptors owned by device  \\ \hline
> > +0x6 - 0xFFFFFFFF & - & Reserved for future types \\ \hline
> > +\end{tabularx}
> 
> 
> I don't think this is enough, e.g. virtio net has internal state
> controlled thought CVQ commands. how do you intend to address/migrate
> these?
>
Post this series, the 32-bit type field will be split into two ranges.
First range (existing) to cover common content across all device type.
Second range to contain device specific content, containing non internal fields such as fields setup by the guest directly over CVQ.
 
> > +\subsubsection{Device Context Fields}\label{sec:Basic Facilities of a Virtio
> Device / Device Context / Device Context Fields}
> > +
> > +\paragraph{PCI Common Configuration Context}
> > +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context
> Fields/ PCI Common Configuration Context}
> > +
> > +For the field VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG, \field{type}
> is set to 0x0.
> > +The \field{value} is in format of \field{struct virtio_pci_common_cfg}.
> > +The \field{length} is the length of \field{struct virtio_pci_common_cfg}.
> > +
> > +\paragraph{Device Configuration Layout Context}
> > +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context
> Fields/ Device Configuration Layout Context}
> > +
> > +For the field VIRTIO_DEV_CTX_DEV_CFG_LAYOUT, \field{type} is set to 0x1.
> > +The \field{value} is in format of device specific configuration layout listed
> > +in each of the device's device configuration layout section.
> > +The \field{length} is the length of the device configuration layout data.
> 
> Unclear. I am guessing it's doing things like setting up RO
> fields? This needs to be specified per device really.
> Also how some fields behave might depend on features.
In practice fields in this area do not change a lot, but it can for example the link status/speed of net device.
So it is not RO per say.

Regarding it be device specific or just a config_length blob, I think config_length blob is just fine for device_context use.
This is because there isn’t a need for migration driver to parse any of these fields.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration Parav Pandit
@ 2023-10-09  8:49   ` Jason Wang
  2023-10-09 10:06     ` Parav Pandit
  2023-10-09 12:02     ` Parav Pandit
  0 siblings, 2 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-09  8:49 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment, mst, cohuck, sburla, shahafs, maorg, yishaih,
	Zhu Lingshan

Adding LingShan.

Parav, if you want any specific people to comment, please do cc them.

On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
>
> One or more passthrough PCI VF devices are ubiquitous for virtual
> machines usage using generic kernel framework such as vfio [1].

Mentioning a specific subsystem in a specific OS may mislead the user
to think it can only work in that setup. Let's not do that, virtio is
not only used for Linux and VFIO.

>
> A passthrough PCI VF device is fully owned by the virtual machine
> device driver.

Is this true? Even VFIO needs to mediate PCI stuff. Or how do you
define "passthrough" here?

> This passthrough device controls its own device
> reset flow, basic functionality as PCI VF function level reset

How about other PCI stuff? Or Why is FLR special?

> and rest of the virtio device functionality such as control vq,

What do you mean by "rest of"? Which part is not controlled and why?

> config space access, data path descriptors handling.
>
> Additionally, VM live migration using a precopy method is also widely used.

Why is this mentioned here?

>
> To support a VM live migration for such passthrough virtio devices,
> the owner PCI PF device administers the device migration flow.

Well, if this is specific only to PCI SR-IOV, I'd move it to the PCI
transport part. But I guess not.

>
> This patch introduces the basic theory of operation which describes the flow
> and supporting administration commands.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/uapi/linux/vfio.h?h=v6.1.47
>
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> ---
>  admin-cmds-device-migration.tex | 94 +++++++++++++++++++++++++++++++++
>  admin.tex                       |  1 +
>  2 files changed, 95 insertions(+)
>  create mode 100644 admin-cmds-device-migration.tex
>
> diff --git a/admin-cmds-device-migration.tex b/admin-cmds-device-migration.tex
> new file mode 100644
> index 0000000..f839af4
> --- /dev/null
> +++ b/admin-cmds-device-migration.tex
> @@ -0,0 +1,94 @@
> +\subsubsection{Device Migration}\label{sec:Basic Facilities of a Virtio Device / Device groups / Group
> +administration commands / Device Migration}
> +
> +In some systems, there is a need to migrate a running virtual machine
> +from one to another system. A running virtual machine has one or more
> +passthrough virtio member devices attached to it. A passthrough device
> +is entirely operated by the guest virtual machine. For example, with
> +the SR-IOV group type, group member (VF) may undergo virtio device
> +initialization and reset flow

What do you mean by "reset flow"? It looks not like a terminology
defined in the PCI spec. And Google gives me nothing about this.

> and may also undergo PCI function level
> +reset(FLR) flow.

Why is only FLR special here? I've asked FRS but you ignore the question.

> Such flows must comply to the PCI standard and also
> +virtio specification;

This seems unnecessary and obvious as it applies to all other PCI and
virtio functionality.

What's more, for the things that need to be synchronized, I don't see
any descriptions in this patch. And if it doesn't need, why?

> at the same time such flows must not obstruct
> +the device migration flow. In such a scenario, a group owner device
> +can provide the administration command interface to facilitate the device
> +migration related operations.
> +
> +When a virtual machine migrates from one hypervisor to another hypervisor,
> +these hypervisors are named as source and destination hypervisor respectively.
> +In such a scenario, a source hypervisor administers the
> +member device to suspend the device and preserves the device context.
> +Subsequently, a destination hypervisor administers the member device to
> +setup a device context and resumes the member device. The source hypervisor
> +reads the member device context and the destination hypervisor writes the member
> +device context. The method to transfer the member device context from the source
> +to the destination hypervisor is outside the scope of this specification.
> +
> +The member device can be in any of the three migration modes. The owner driver
> +sets the member device in one of the following modes during device migration flow.
> +
> +\begin{tabularx}{\textwidth}{ |l||l|X| }
> +\hline
> +Value & Name & Description \\
> +\hline \hline
> +0x0   & Active &
> +  It is the default mode after instantiation of the member device. \\

I don't think we ever define "instantiation" anywhere.

> +\hline
> +0x1   & Stop &
> + In this mode, the member device does not send any notifications,
> + and it does not access any driver memory.

What's the meaning of "driver memory"?

And stop seems to be a source of inflight buffers.

> + The member device may receive driver notifications in this mode,

What's the meaning of "receive"? For example if the device can still
process buffers, "stop" is not accurate.

> + the member device context

I don't think we define "device context" anywhere.

>and device configuration space may change. \\
> +\hline

I still don't get why we need a "stop" state in the middle.

> +0x2   & Freeze &
> + In this mode, the member device does not accept any driver notifications,

This is too vague. Is the device allowed to be freezed in the middle
of any virtio or PCI operations?

For example, in the middle of feature negotiation etc. It may cause
implementation specific sub-states which can't be migrated easily.

And what's more, the above state machine seems to be virtio specific,
but you don't explain the interaction with the device status state
machine. For example, what happens if the driver wants to reset but
the device is in stop mode? You told me it is addressed in your series
but looks not. Once you try to describe that, you're actually try to
connect states between the two state machines.

> + it ignores any device configuration space writes,

How about read and the device configuration changes?

> + the device do not have any changes in the device context. The
> + member device is not accessed in the system through the virtio interface. \\

But accessible via PCI interface?

For example, what happens if we want to freeze during FLR? Does the
hypervisor need to wait for the FLR to be completed?

> +\hline
> +\hline
> +0x03-0xFF   & -    & reserved for future use \\
> +\hline
> +\end{tabularx}
> +
> +When the owner driver wants to stop the operation of the
> +device, the owner driver sets the device mode to \field{Stop}. Once the
> +device is in the \field{Stop} mode, the device does not initiate any notifications
> +or does not access any driver memory. Since the member driver may be still
> +active which may send further driver notifications to the device, the device
> +context may be updated. When the member driver has stopped accessing the
> +device, the owner driver sets the device to \field{Freeze} mode indicating
> +to the device that no more driver access occurs. In the \field{Freeze} mode,
> +no more changes occur in the device context. At this point, the device ensures
> +that there will not be any update to the device context.

What is missed here are:

1) it is a virtio specific states or not
2) if it is a virtio specific state, if or how to synchronize with
transport specific interfaces and why
3) can active go directly to freeze and why

> +
> +The member device has a device context which the owner driver can either
> +read or write. The member device context consist of any device specific
> +data which is needed by the device to resume its operation when the device mode

This is too vague. There're states that are not suitable for cmd/queue
for sure. I'd split it into

1) common states: virtqueue, dirty pages
2) device specific states: defined be each device

> +is changed from \field{Stop} to \field{Active} or from \field{Freeze}
> +to \field{Active}.
> +
> +Once the device context is read, it is cleared from the device.

This is horrible, it means we can't easily

1) re-try the migration
2) recover from migration failure

> Typically, on
> +the source hypervisor, the owner driver reads the device context once when
> +the device is in \field{Active} or \field{Stop} mode and later once the member
> +device is in \field{Freeze} mode.

Why need the read while device context could be changed? Or is the
dirty page part of the device context?

> +
> +Typically, the device context is read and written one time on the source and
> +the destination hypervisor respectively once the device is in \field{Freeze}
> +mode. On the destination hypervisor, after writing the device context,
> +when the device mode set to \field{Active}, the device uses the most recently
> +set device context and resumes the device operation.

There's no context sequence, so this is obvious. It's the semantic of
all other existing interfaces.

> +
> +In an alternative flow, on the source hypervisor the owner driver may choose
> +to read the device context first time while the device is in \field{Active} mode
> +and second time once the device is in \field{Freeze} mode.

Who is going to synchronize the device context with possible
configuration from the driver?

> Similarly, on the
> +destination hypervisor writes the device context first time while the device
> +is still running in \field{Active} mode on the source hypervisor and writes
> +the device context second time while the device is in \field{Freeze} mode.
> +This flow may result in very short setup time as the device context likely
> +have minimal changes from the previously written device context.

Is the hypervisor who is in charge of doing the comparison and writing
only the delta?

> This flow may
> +reduce the device migration time significantly and may have near constant
> +device activation time regardless of number of virtqueues, resources and
> +passthough devices in use by the migrating virtual machine.

Thanks



> +
> +The owner driver can discard any partially read or written device context when
> +any of the device migration flow should be aborted.
> diff --git a/admin.tex b/admin.tex
> index 0803c26..6eeef58 100644
> --- a/admin.tex
> +++ b/admin.tex
> @@ -297,6 +297,7 @@ \subsection{Group administration commands}\label{sec:Basic Facilities of a Virti
>  might differ between different group types.
>
>  \input{admin-cmds-legacy-interface.tex}
> +\input{admin-cmds-device-migration.tex}
>
>  \devicenormative{\subsubsection}{Group administration commands}{Basic Facilities of a Virtio Device / Device groups / Group administration commands}
>
> --
> 2.34.1
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-09  8:49   ` Jason Wang
@ 2023-10-09 10:06     ` Parav Pandit
  2023-10-10  5:51       ` Jason Wang
  2023-10-09 12:02     ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-09 10:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 9, 2023 2:19 PM
> 
> Adding LingShan.
> 
Thanks for adding him.

> Parav, if you want any specific people to comment, please do cc them.
> 
Sure, will cc them in v2 as now I see there is interest in the review.

> On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > One or more passthrough PCI VF devices are ubiquitous for virtual
> > machines usage using generic kernel framework such as vfio [1].
> 
> Mentioning a specific subsystem in a specific OS may mislead the user to think
> it can only work in that setup. Let's not do that, virtio is not only used for Linux
> and VFIO.
> 
Not really. it is an example in the cover letter.
It is not the only use case.
A use case gives a crisp clarity of what UAPI it needs to fulfil.
So I will keep it. It is anyway written as one use case.

> >
> > A passthrough PCI VF device is fully owned by the virtual machine
> > device driver.
> 
> Is this true? Even VFIO needs to mediate PCI stuff. Or how do you define
> "passthrough" here?
> 
Other than PCI config registers and due to some legacy, msix.
The "device interface" side is not mediated.
The definition of passthrough here is: To not mediate a device type specific and virtio specific interfaces for modern and future devices.

> > This passthrough device controls its own device reset flow, basic
> > functionality as PCI VF function level reset
> 
> How about other PCI stuff? Or Why is FLR special?
FLR is special for the readers to get the clarity that FLR is also done by the guest driver hence, the device migration commands do not interact/depend with FLR flow.

> 
> > and rest of the virtio device functionality such as control vq,
> 
> What do you mean by "rest of"?
> 
As given in the example cvq.

> Which part is not controlled and why?
Not controlled because as states, it is passthrough device.

> > config space access, data path descriptors handling.
> >
> > Additionally, VM live migration using a precopy method is also widely used.
> 
> Why is this mentioned here?
> 
Huh. You should be positive for bringing clarity to the readers on understanding the use case.
And you seem opposite, but ok.

As stated, it for the reader to understand the use case and see how proposed commands addresses the use case.

> >
> > To support a VM live migration for such passthrough virtio devices,
> > the owner PCI PF device administers the device migration flow.
> 
> Well, if this is specific only to PCI SR-IOV, I'd move it to the PCI transport part.
> But I guess not.
We took the decision to not do so, for other group commands as well.
After Michael's suggestion we moved it to group commands.
So I will not debate this further.

> 
> >
> > This patch introduces the basic theory of operation which describes
> > the flow and supporting administration commands.
> >
> > [1]
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/
> > include/uapi/linux/vfio.h?h=v6.1.47
> >
> > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > ---
> >  admin-cmds-device-migration.tex | 94
> +++++++++++++++++++++++++++++++++
> >  admin.tex                       |  1 +
> >  2 files changed, 95 insertions(+)
> >  create mode 100644 admin-cmds-device-migration.tex
> >
> > diff --git a/admin-cmds-device-migration.tex
> > b/admin-cmds-device-migration.tex new file mode 100644 index
> > 0000000..f839af4
> > --- /dev/null
> > +++ b/admin-cmds-device-migration.tex
> > @@ -0,0 +1,94 @@
> > +\subsubsection{Device Migration}\label{sec:Basic Facilities of a
> > +Virtio Device / Device groups / Group administration commands /
> > +Device Migration}
> > +
> > +In some systems, there is a need to migrate a running virtual machine
> > +from one to another system. A running virtual machine has one or more
> > +passthrough virtio member devices attached to it. A passthrough
> > +device is entirely operated by the guest virtual machine. For
> > +example, with the SR-IOV group type, group member (VF) may undergo
> > +virtio device initialization and reset flow
> 
> What do you mean by "reset flow"? It looks not like a terminology defined in the
> PCI spec. And Google gives me nothing about this.
> 
"reset flow" = virtio specification section 2.4 Device Reset flow.

> > and may also undergo PCI function level
> > +reset(FLR) flow.
> 
> Why is only FLR special here? I've asked FRS but you ignore the question.
> 
FLR is special to bring clarity that guest owns the VF doing FLR, hence hypervisor cannot mediate any registers of the VF.

> > Such flows must comply to the PCI standard and also
> > +virtio specification;
> 
> This seems unnecessary and obvious as it applies to all other PCI and virtio
> functionality.
> 
Great. But your comment is contradicts.

> What's more, for the things that need to be synchronized, I don't see any
> descriptions in this patch. And if it doesn't need, why?
With which operation should it be synchronized and why?
Can you please be specific?

It is not written in this series, because we believe it must not be synchronized as it is fully controlled by the guest.

> 
> > at the same time such flows must not obstruct
> > +the device migration flow. In such a scenario, a group owner device
> > +can provide the administration command interface to facilitate the
> > +device migration related operations.
> > +
> > +When a virtual machine migrates from one hypervisor to another
> > +hypervisor, these hypervisors are named as source and destination
> hypervisor respectively.
> > +In such a scenario, a source hypervisor administers the member device
> > +to suspend the device and preserves the device context.
> > +Subsequently, a destination hypervisor administers the member device
> > +to setup a device context and resumes the member device. The source
> > +hypervisor reads the member device context and the destination
> > +hypervisor writes the member device context. The method to transfer
> > +the member device context from the source to the destination hypervisor is
> outside the scope of this specification.
> > +
> > +The member device can be in any of the three migration modes. The
> > +owner driver sets the member device in one of the following modes during
> device migration flow.
> > +
> > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name &
> > +Description \\ \hline \hline
> > +0x0   & Active &
> > +  It is the default mode after instantiation of the member device. \\
> 
> I don't think we ever define "instantiation" anywhere.
> 
Well a transport has implicit definition of the instantiation already.
May be a text can be added, but don’t see a value in duplicating PCI spec here.

> > +\hline
> > +0x1   & Stop &
> > + In this mode, the member device does not send any notifications,
> > +and it does not access any driver memory.
> 
> What's the meaning of "driver memory"?
> 
May be guest memory? Or do you suggest a better naming for the memory allocated by the guest driver?

> And stop seems to be a source of inflight buffers.
> 
I didn’t follow it.
If you mean without stop there are no inflight buffer, then I don’t agree.
We don’t want to violate the spec by having descriptors with zero size returned.
Stop is not the source of inflight descriptors.

There are inflight descriptors with the device that are not yet returned to the driver, and device wont return them as zero size wrong completions.

> > + The member device may receive driver notifications in this mode,
> 
> What's the meaning of "receive"? For example if the device can still process
> buffers, "stop" is not accurate.
> 
Receive means, driver can send the notification as PCIe TLP that device may receive as incoming PCIe TLP.

In "stop" mode, the device wont process descriptors.

> > + the member device context
> 
> I don't think we define "device context" anywhere.
> 
It is defined further in the description.

> >and device configuration space may change. \\
> > +\hline
> 
> I still don't get why we need a "stop" state in the middle.
> 
All pci devices which belong to a single guest VM are not stopped atomically.
Hence, one device which is in freeze mode, may still receive driver notifications from other pci device, or it may experience a read from the shared memory and get garbage data.
And things can break.
Hence the stop mode, ensures that all the devices get enough chance to stop themselves, and later when freezed, to not change anything internally.

> > +0x2   & Freeze &
> > + In this mode, the member device does not accept any driver
> > +notifications,
> 
> This is too vague. Is the device allowed to be freezed in the middle of any virtio
> or PCI operations?
> 
> For example, in the middle of feature negotiation etc. It may cause
> implementation specific sub-states which can't be migrated easily.
> 
Yes. it is allowed in middle of feature negotiation, for sure.
It is passthrough device, hence hypervisor layer do not get to see sub-state.

Not sure why you comment, why it cannot be migrated easily.
The device context already covers this sub-state.

> And what's more, the above state machine seems to be virtio specific, but you
> don't explain the interaction with the device status state machine. 
First, above is not a state machine.
Second, it is not virtio specific. It is present in leading OS that has fundamental requirement to support P2P devices.
Third, it is not, interacing with the _actua_ device status.

In "SUSPEND" patch-5, you already asked this question. I assume you asked again so that this series is complete.

> For example,
> what happens if the driver wants to reset but the device is in stop mode? You
> told me it is addressed in your series but looks not. Once you try to describe
> that, you're actually try to connect states between the two state machines.
> 
As listed in the definition of the stop mode, the device do not act on the incoming writes, it only keep tracks of its internal device context change as part of this.
We would enrich the device context for this, but no need to connects the admin mode controlled by the owner device with operational state (device_status) owned by the member device.

> > + it ignores any device configuration space writes,
> 
> How about read and the device configuration changes?
> 
As listed, device do not have any changes.
So device configuration change cannot occur.

The device requirements cover this content more explicitly:

For the SR-IOV group type, regardless of the member device mode, all the PCI transport level registers
MUST be always accessible and the member device MUST function the same way for all the PCI transport
level registers regardless of the member device mode.

> > + the device do not have any changes in the device context. The member
> > + device is not accessed in the system through the virtio interface.
> > + \\
> 
> But accessible via PCI interface?
> 
Yes, as usual.

> For example, what happens if we want to freeze during FLR? Does the
> hypervisor need to wait for the FLR to be completed?
> 
Hypervisor do not need wait for the FLR to be completed.

> > +\hline
> > +\hline
> > +0x03-0xFF   & -    & reserved for future use \\
> > +\hline
> > +\end{tabularx}
> > +
> > +When the owner driver wants to stop the operation of the device, the
> > +owner driver sets the device mode to \field{Stop}. Once the device is
> > +in the \field{Stop} mode, the device does not initiate any
> > +notifications or does not access any driver memory. Since the member
> > +driver may be still active which may send further driver
> > +notifications to the device, the device context may be updated. When
> > +the member driver has stopped accessing the device, the owner driver
> > +sets the device to \field{Freeze} mode indicating to the device that
> > +no more driver access occurs. In the \field{Freeze} mode, no more
> > +changes occur in the device context. At this point, the device ensures that
> there will not be any update to the device context.
> 
> What is missed here are:
> 
> 1) it is a virtio specific states or not
It is not.

> 2) if it is a virtio specific state, if or how to synchronize with transport specific
> interfaces and why
> 3) can active go directly to freeze and why
> 
Yes. don’t see a reason to not allow it.
Active to freeze mode can change is useful on the destination side, where destination hypervisor knows for sure that there is no other entity accessing the device.
And it needs to setup the device context, it received from the source side.
So setting freeze mode can be done directly.

> > +
> > +The member device has a device context which the owner driver can
> > +either read or write. The member device context consist of any device
> > +specific data which is needed by the device to resume its operation
> > +when the device mode
> 
> This is too vague. There're states that are not suitable for cmd/queue for sure.
> I'd split it into
> 
> 1) common states: virtqueue, dirty pages
> 2) device specific states: defined be each device
> 
This is theory of operation section. So it capturing such details.
Actual device context definition is outside of theory, and precise states of virtqueue, device specific, etc are in it.

> > +is changed from \field{Stop} to \field{Active} or from \field{Freeze}
> > +to \field{Active}.
> > +
> > +Once the device context is read, it is cleared from the device.
> 
> This is horrible, it means we can't easily
> 
> 1) re-try the migration
> 2) recover from migration failure
> 
Can you please explain the flow?
And which software stack may find this useful?
Is there any existing software that can utilize it?
Why that device context present with the software vanished, in your assumption, if it is?

> > Typically, on
> > +the source hypervisor, the owner driver reads the device context once
> > +when the device is in \field{Active} or \field{Stop} mode and later
> > +once the member device is in \field{Freeze} mode.
> 
> Why need the read while device context could be changed? Or is the dirty page
> part of the device context?
> 
It is not part of the dirty page.
It needs to read in the active/stop mode, so that it can be shared with destination hypervisor, which will pre-setup the complex context of the device, while it is still running on the source side.

> > +
> > +Typically, the device context is read and written one time on the
> > +source and the destination hypervisor respectively once the device is
> > +in \field{Freeze} mode. On the destination hypervisor, after writing
> > +the device context, when the device mode set to \field{Active}, the
> > +device uses the most recently set device context and resumes the device
> operation.
> 
> There's no context sequence, so this is obvious. It's the semantic of all other
> existing interfaces.
> 
Can you please what which existing interfaces do you mean here?

> > +
> > +In an alternative flow, on the source hypervisor the owner driver may
> > +choose to read the device context first time while the device is in
> > +\field{Active} mode and second time once the device is in \field{Freeze}
> mode.
> 
> Who is going to synchronize the device context with possible configuration from
> the driver?
> 
Not sure I understand the question.
If I understand you right, do you mean that,
When configuration change is done by the guest driver, how does device context change?

If so, device context reading will reflect the new configuration.

> > Similarly, on the
> > +destination hypervisor writes the device context first time while the
> > +device is still running in \field{Active} mode on the source
> > +hypervisor and writes the device context second time while the device is in
> \field{Freeze} mode.
> > +This flow may result in very short setup time as the device context
> > +likely have minimal changes from the previously written device context.
> 
> Is the hypervisor who is in charge of doing the comparison and writing only the
> delta?
> 
The spec commands allow to do so. So possibility exists from spec wise.
In current proposal, there isn’t a need for hypervisor to do so at all.

The destination side device gets to see the new device context and apply the delta.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-08 11:41   ` [virtio-comment] " Michael S. Tsirkin
  2023-10-09  4:15     ` Parav Pandit
@ 2023-10-09 10:34     ` Zhu, Lingshan
  2023-10-09 14:30       ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-09 10:34 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: virtio-comment, cohuck, sburla, shahafs, maorg, yishaih



On 10/8/2023 7:41 PM, Michael S. Tsirkin wrote:
> On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
>> Define the device context and its fields for purpose of device
>> migration. The device context is read and written by the owner driver
>> on source and destination hypervisor respectively.
>>
>> Device context fields will experience a rapid growth post this initial
>> version to cover many details of the device.
>>
>> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
>> Signed-off-by: Parav Pandit <parav@nvidia.com>
>> Signed-off-by: Satananda Burla <sburla@marvell.com>
>> ---
>> changelog:
>> v0->v1:
>> - enrich device context to cover feature bits, device configuration
>>    fields
>> - corrected alignment of device context fields
>> ---
>>   content.tex        |   1 +
>>   device-context.tex | 142 +++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 143 insertions(+)
>>   create mode 100644 device-context.tex
>>
>> diff --git a/content.tex b/content.tex
>> index 0a62dce..2698931 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
>>   UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>   
>>   \input{admin.tex}
>> +\input{device-context.tex}
>>   
>>   \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
>>   
>> diff --git a/device-context.tex b/device-context.tex
>> new file mode 100644
>> index 0000000..5611382
>> --- /dev/null
>> +++ b/device-context.tex
>> @@ -0,0 +1,142 @@
>> +\section{Device Context}\label{sec:Basic Facilities of a Virtio Device / Device Context}
>> +
>> +The device context holds the information that a owner driver can use
>> +to setup a member device and resume its operation. The device context
>> +of a member device is read or written by the owner driver using
>> +administration commands.
>> +
>> +\begin{lstlisting}
>> +struct virtio_dev_ctx_field_tlv {
>> +        le32 type;
>> +        le32 reserved;
>> +        le64 length;
>> +        u8 value[];
>> +};
>> +
>> +struct virtio_dev_ctx {
>> +        le32 field_count;
>> +        struct virtio_dev_ctx_field_tlv fields[];
>> +};
>> +
>> +\end{lstlisting}
so this still doesn't work for nested
>> +
>> +The \field{struct virtio_dev_ctx} is the device context of a member device.
>> +The \field{field_count} indicates how many instances of
>> +\field{struct virtio_dev_ctx_field_tlv} are present.
>> +
>> +The \field{struct virtio_dev_ctx_field_tlv} consist of \field{type} indicating
>> +what data is contained in the \field{value} of length \field{length}.
>> +The valid values for \field{type} can be found in the following table:
>> +
>> +\begin{tabularx}{\textwidth}{ |l||l|X| }
>> +\hline
>> +type & Name & Description \\
>> +\hline \hline
>> +0x0 & VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG & Provides common configuration space of device for PCI transport \\
>> +\hline
>> +0x1 & VIRTIO_DEV_CTX_DEV_CFG_LAYOUT & Provides device specific configuration layout \\
>> +\hline
>> +0x2 & VIRTIO_DEV_CTX_DEV_FEATURES & Provides device features \\
>> +\hline
>> +0x3 & VIRTIO_DEV_CTX_PCI_VQ_CFG & Provides Virtqueue configuration for PCI transport \\
>> +\hline
>> +0x4 & VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG & Provides Queue run time state \\
>> +\hline
>> +0x5 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of virtqueue descriptors owned by device  \\
>> +\hline
>> +0x6 - 0xFFFFFFFF & - & Reserved for future types \\
>> +\hline
>> +\end{tabularx}
>
> I don't think this is enough, e.g. virtio net has internal state
> controlled thought CVQ commands. how do you intend to address/migrate
> these?
>
>> +\subsubsection{Device Context Fields}\label{sec:Basic Facilities of a Virtio Device / Device Context / Device Context Fields}
>> +
>> +\paragraph{PCI Common Configuration Context}
>> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Common Configuration Context}
>> +
>> +For the field VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG, \field{type} is set to 0x0.
>> +The \field{value} is in format of \field{struct virtio_pci_common_cfg}.
>> +The \field{length} is the length of \field{struct virtio_pci_common_cfg}.
>> +
>> +\paragraph{Device Configuration Layout Context}
>> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Device Configuration Layout Context}
>> +
>> +For the field VIRTIO_DEV_CTX_DEV_CFG_LAYOUT, \field{type} is set to 0x1.
>> +The \field{value} is in format of device specific configuration layout listed
>> +in each of the device's device configuration layout section.
>> +The \field{length} is the length of the device configuration layout data.
> Unclear. I am guessing it's doing things like setting up RO
> fields? This needs to be specified per device really.
> Also how some fields behave might depend on features.
>
>> +
>> +\paragraph{Device Features Context}
>> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Device Features Context}
>> +
>> +For the field VIRTIO_DEV_CTX_DEV_FEATURES, \field{type} is set to 0x2.
>> +The \field{value} is in format of device feature bits listed in
>> +\ref{sec:Basic Facilities of a Virtio Device / Feature Bits} in the format of \field{struct virtio_dev_ctx_features}.
>> +The \field{length} is the length of the device features.
>> +
>> +\begin{lstlisting}
>> +struct virtio_dev_ctx_pci_vq_cfg {
>> +        le64 feature_bits[];
>> +};
>> +\end{lstlisting}
>> +
>> +\paragraph{PCI Virtqueue Configuration Context}
>> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ PCI Virtqueue Configuration Context}
>> +
>> +For the field VIRTIO_DEV_CTX_PCI_VQ_CFG, \field{type} is set to 0x3.
>> +The \field{value} is in format of \field{struct virtio_dev_ctx_pci_vq_cfg}.
>> +The \field{length} is the length of \field{struct virtio_dev_ctx_pci_vq_cfg}.
>> +
>> +\begin{lstlisting}
>> +struct virtio_dev_ctx_pci_vq_cfg {
>> +        le16 vq_index;
>> +        le16 queue_size;
>> +        le16 queue_msix_vector;
>> +        le64 queue_desc;
>> +        le64 queue_driver;
>> +        le64 queue_device;
>> +};
>> +\end{lstlisting}
>> +
>> +One or multiple entries of PCI Virtqueue Configuration Context may exist, each such
>> +entry corresponds to a unique virtqueue identified by the \field{vq_index}.
>> +
>> +\paragraph{Virtqueue Split Mode Runtime Context}
>> +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context Fields/ Virtqueue Split Mode Runtime Context}
>> +
>> +For the field VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG, \field{type} is set to 0x4.
>> +The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
>> +The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_runtime}.
>> +
>> +\begin{lstlisting}
>> +struct virtio_dev_ctx_vq_split_runtime {
>> +        le16 vq_index;
>> +        le16 dev_avail_idx;
>> +        u8 enabled;
>> +};
>> +\end{lstlisting}
>> +
>> +The \field{dev_avail_idx} indicates the next available index of the virtqueue from which
>> +the device must start processing the available ring.
>> +
>> +One or multiple entries of Virtqueue Split Mode Runtime Context may exist, each such
>> +entry corresponds to a unique virtqueue identified by the \field{vq_index}.
>> +
>> +\paragraph{Virtqueue Split Mode Device owned Descriptors Context}
>> +
>> +For the field VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC, \field{type} is set to 0x5.
>> +The \field{value} is in format of \field{struct virtio_dev_ctx_vq_split_runtime}.
>> +The \field{length} is the length of \field{struct virtio_dev_ctx_vq_split_dev_descs}.
>> +
>> +\begin{lstlisting}
>> +struct virtio_dev_ctx_vq_split_dev_descs {
>> +        le16 vq_index;
>> +        le16 desc_count;
>> +        le16 desc_idx[];
>> +};
>> +\end{lstlisting}
>> +
>> +The \field{desc_idx} contains indices of the descriptors in \field{desc_count} of a
>> +virtqueue identified by \field{vq_index} which is owned by the device.
>> +
>> +One or multiple entries of Virtqueue Split Mode Device owned Descriptors Context may exist, each such
>> +entry corresponds to a unique virtqueue identified by the \field{vq_index}.
>> -- 
>> 2.34.1
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] Re: [PATCH v1 7/8] admin: Add write recording commands
  2023-10-09  4:14     ` [virtio-comment] " Parav Pandit
@ 2023-10-09 10:57       ` Michael S. Tsirkin
  2023-10-09 11:48         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-09 10:57 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Mon, Oct 09, 2023 at 04:14:22AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Sunday, October 8, 2023 5:23 PM
> > 
> > On Sun, Oct 08, 2023 at 02:25:54PM +0300, Parav Pandit wrote:
> > > When migrating a virtual machine with passthrough virtio devices, the
> > > virtio device may write into the guest memory. Some systems may not be
> > > able to keep track of these pages efficiently.
> > >
> > > To facilitate such a system, a device provides the record of pages
> > > which are written by the device. In one use case, this commands
> > > connect to the vfio framework at [1].
> > >
> > > The owner driver configures the member device for list of address
> > > ranges for which it expects write recording and reporting by the device.
> > >
> > > The owner driver periodically queries the written pages address record
> > > which gets cleared from the device upon reading it.
> > >
> > > When the write records reduces over the time, at one point write
> > > recording is stopped after the device mode is set to FREEZE.
> > >
> > > [1]
> > > https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vf
> > > io.h#L1207
> > >
> > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > Signed-off-by: Satananda Burla <sburla@marvell.com>
> > > ---
> > >  admin-cmds-device-migration.tex | 146
> > ++++++++++++++++++++++++++++++--
> > >  admin.tex                       |  10 ++-
> > >  2 files changed, 146 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/admin-cmds-device-migration.tex
> > > b/admin-cmds-device-migration.tex index e98d552..49835eb 100644
> > > --- a/admin-cmds-device-migration.tex
> > > +++ b/admin-cmds-device-migration.tex
> > > @@ -97,15 +97,16 @@ \subsubsection{Device Migration}\label{sec:Basic
> > > Facilities of a Virtio Device /  During the device migration flow, a
> > > passthrough device may write data to the  guest virtual machine
> > > memory, a source hypervisor needs to keep track of these  written memory to
> > migrate such memory to destination hypervisor.
> > > -Some systems may not be able to keep track of such memory write
> > > addresses at -hypervisor level. In such a scenario, a device records
> > > and reports these -written memory addresses to the owner device. Such
> > > an address is named as -IO virtual address (IOVA). The owner driver
> > > enables write recording for one or -more IOVA ranges per device during
> > > device migration flow. The owner driver -periodically queries these
> > > written IOVA records from the device. As the driver -reads the written IOVA
> > records, the device clears those records from the device.
> > > -Once the device reports zero or small number of written IOVA records,
> > > the device -mode is set to \field{Stop} or \field{Freeze}. Once the
> > > device is set to \field{Stop}
> > > +Some systems may not be able to keep track of such memory writes at
> > > +addresses at hypervisor level. In such a scenario, a device records
> > > +and reports these written memory addresses to the owner device.
> > 
> > 
> > what does it mean to record them?
> >
> May be we can use a different verb than record. May be tracking?
> Record means, device accumulates these written pages addresses, and when driver queries, it returns these addresses and clears it internally within the device.

For example, what about two writes into same address. Do you get one
record or two? What if the length is different?


> > > Such an
> > > +address is named as IO virtual address (IOVA).
> > 
> > I don't know what does this have to do with IOVA. For that matter everything
> > would have to be "IOVA". Spec calls these physical address and let's stick to
> > that.
> > 
> Make sense.
> At device level it does not have knowledge of IOVA.
> I will rename it.
> 
> > 
> > > The owner driver enables write
> > > +recording for one or more IOVA ranges per device during device
> > > +migration flow. The owner driver periodically queries these written
> > > +IOVA records from the device.
> > 
> > periodical reads without any indication are the only option then?
> >
> At least for now, this is starting point. Software stack such as QEMU does it periodically.

So for CPU kvm switched to PML and this seems to work better -
it guarantees there's convergence.

> When new use case arise, may be it can be extended.
>  
> > > As the driver reads the written IOVA records,
> > > +the device clears those records from the device. Once the device
> > > +reports zero or small number of written IOVA records, the device is
> > > +set to \field{Stop} or \field{Freeze} mode. Once the device is set to
> > > +\field{Stop}
> > >  or \field{Freeze} mode, and once all the IOVA records are read, the
> > > driver stops  the write recording in the device.
> > 
> > 
> > it is not great that you are rewriting text you just wrote in patch 1 here. pls find
> > a way not to make reviewers read everything twice.
> > 
> There is small duplication of one line explaining mode change, rest is contextual to the write recording.
> Merging with text of patch_1 was slightly complicated to read, so one sentence is duplicated.
> 
> I will again check if patch_1 text extension is easier to read.

this is latex, space is ignored. if you are only changing one word just
don't move the rest and diff will look sane

> > > @@ -118,6 +119,10 @@ \subsubsection{Device Migration}\label{sec:Basic
> > > Facilities of a Virtio Device /  \item Device Context Read Command
> > > \item Device Context Write Command  \item Device Context Discard
> > > Command
> > > +\item Device Write Record Capabilities Query Command \item Device
> > > +Write Records Start Command \item Device Write Records Stop Command
> > > +\item Device Write Records Read Command
> > >  \end{enumerate}
> > >
> > >  These commands are currently only defined for the SR-IOV group type.
> > > @@ -307,6 +312,129 @@ \subsubsection{Device Migration}\label{sec:Basic
> > > Facilities of a Virtio Device /  discarded, subsequent
> > > VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command writes a new device
> > context.
> > >
> > > +\paragraph{Device Write Record Capabilities Query Command}
> > > +\label{par:Basic Facilities of a Virtio Device / Device groups /
> > > +Group administration commands / Device Migration / Device Write
> > > +Record Capabilities Query Command}
> > > +
> > > +This command reads the device write record capabilities.
> > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
> > > +\field{opcode} is set to 0xd.
> > > +The \field{group_member_id} refers to the member device to be accessed.
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_admin_cmd_dev_write_record_cap_result {
> > > +        le32 supported_iova_page_size_bitmap;
> > > +        le32 supported_iova_ranges;
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +When the command completes successfully,
> > > +\field{command_specific_result} is in the format \field{struct
> > > +virtio_admin_cmd_dev_write_record_cap_result}
> > > +returned by the device. The \field{supported_iova_page_size_bitmap}
> > > +indicates the granularity at which the device can record IOVA ranges.
> > > +the minimum granularity can be 4KB. Bit 0 corresponds to 4KB, bit 1
> > > +corresponds to 8KB, bit 31 corresponds to 4TB. The device supports at least
> > one page granularity.
> > > +The device support one or more IOVA page granularity; for each IOVA
> > > +page granularity, the device sets corresponding bit in the
> > > +\field{supported_iova_page_size_bitmap}. The
> > > +\field{supported_iova_ranges} indicates how many unique (non
> > > +overlapping) IOVA ranges can be recorded by the device.
> > 
> > what role does this granularity play? i see no mention of it down the road.
> > 
> The page_size in struct virtio_admin_cmd_write_record_start_data must match to the granularity supplied above.
> I missed it. Will add in v2.
> This is very useful comment.

Not that it's very clear what does page_size do.

> > 
> > > +
> > > +\paragraph{Device Write Records Start Command} \label{par:Basic
> > > +Facilities of a Virtio Device / Device groups / Group administration
> > > +commands / Device Migration / Device Write Records Start Command}
> > > +
> > > +This command starts the write recording in the device for the
> > > +specified IOVA ranges.
> > > +
> > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START,
> > > +\field{opcode} is set to 0xe.
> > > +The \field{group_member_id} refers to the member device to be accessed.
> > > +
> > > +The \field{command_specific_data} is in the format \field{struct
> > > +virtio_admin_cmd_write_record_start_data}.
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_admin_cmd_write_record_start_entry {
> > > +        le64 iova;
> > > +        le64 page_count;
> > > +};
> > > +
> > > +struct virtio_admin_cmd_write_record_start_data {
> > > +        le64 page_size;
> > > +        le32 count;
> > > +        u8 reserved[4];
> > > +        struct virtio_admin_cmd_write_record_start_entry entries[];
> > > +};
> > > +
> > > +\end{lstlisting}
> > > +
> > > +The \field{count} is set to indicate number of valid \field{entries}.
> > > +The \field{iova} indicates the start IOVA address. The
> > > +\field{page_count} indicates number of pages of size
> > > +\field{page_size} starting from \field{iova} to record for write
> > > +reporting. VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > > +command contains unique i.e. non overlapping IOVA range entries.
> > > +Whenever a memory write occurs by the device in the supplied IOVA
> > > +range, the device records the actual IOVA and number of bytes written to the
> > IOVA.
> > > +These write records can be read by the the driver using
> > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
> > > +
> > > +This command has no command specific result.
> > > +
> > > +\paragraph{Device Write Record Stop Command} \label{par:Basic
> > > +Facilities of a Virtio Device / Device groups / Group administration
> > > +commands / Device Migration / Device Write Record Stop Command}
> > > +
> > > +This command stops the write recording in the device for IOVA ranges
> > > +which were previously started using
> > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > > +command.
> > > +
> > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP,
> > > +\field{opcode} is set to 0xf.
> > > +The \field{group_member_id} refers to the member device to be accessed.
> > > +
> > > +This command does not have any command specific data.
> > > +This command has no command specific result.
> > > +
> > > +\paragraph{Device Write Records Read Command} \label{par:Basic
> > > +Facilities of a Virtio Device / Device groups / Group administration
> > > +commands / Device Migration / Device Write Records Read Command}
> > > +
> > > +This command reads the device write records for which the write
> > > +recording is previously started using
> > VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
> > > +
> > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ,
> > > +\field{opcode} is set to 0x10.
> > > +The \field{group_member_id} refers to the member device to be accessed.
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_admin_cmd_write_records_read_data {
> > > +        le64 iova;
> > > +        le64 length;
> > > +};
> > > +
> > > +struct virtio_admin_cmd_dev_write_records_cnt {
> > > +        le32 count;
> > > +};
> > > +
> > > +struct virtio_admin_cmd_dev_write_records_result {
> > > +        le64 iova_entries[];
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +The \field{command_specific_data} is in the format \field{struct
> > > +virtio_admin_cmd_write_records_read_data}. The driver sets the \field
> > > +{iova} indicating the start IOVA address for up to the \field{length}
> > > +number of bytes. The supplied IOVA range same or smaller than the
> > > +range supplied when write recording is started by the driver in
> > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
> > 
> > Seems pretty sparse. Lots of hypervisors chose to implement a bit per page
> > strategy.
> This command result addresses will feed into such bit.

Do we want to return it in this format then?

> > 
> > > +
> > > +When the command completes successfully,
> > > +\field{command_specific_result} is in the format \field{struct
> > > +virtio_admin_cmd_dev_write_records_result}
> > > +and \field{command_specific_result} is in format of \field{struct
> > > +virtio_admin_cmd_dev_write_records_cnt} containing number of write
> > > +records returned by the device.
> > 
> > what are these records though?
> > 
> It is struct virtio_admin_cmd_dev_write_records_result.
> I will rephrase it to link to struct virtio_admin_cmd_dev_write_records_result.
> 
> > 
> > > When the command completes
> > > +successfully, the write records which are returned in the result are
> > > +cleared from the device and same records cannot be read again. When
> > > +new writes occur at same IOVA range or at different once, those
> > > +records can be read as new write records.
> > 
> > 
> > this last sentence just confuses.
> > 
> How about just keeping below text rewrite?
> When the command completes successfully, the write records returned in the result are cleared from the device.

I think we need to explain in more detail what exactly is expected
to be recorded and when.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 7/8] admin: Add write recording commands
  2023-10-09 10:57       ` [virtio-comment] " Michael S. Tsirkin
@ 2023-10-09 11:48         ` Parav Pandit
  2023-10-09 16:15           ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-09 11:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Monday, October 9, 2023 4:28 PM

> > May be we can use a different verb than record. May be tracking?
> > Record means, device accumulates these written pages addresses, and when
> driver queries, it returns these addresses and clears it internally within the
> device.
> 
> For example, what about two writes into same address. Do you get one record
> or two? What if the length is different?
One record.
Since the writes are tracked/recorded at page granularity, even if length is different, it still one entry.

> 
> 
> > > > Such an
> > > > +address is named as IO virtual address (IOVA).
> > >
> > > I don't know what does this have to do with IOVA. For that matter
> > > everything would have to be "IOVA". Spec calls these physical
> > > address and let's stick to that.
> > >
> > Make sense.
> > At device level it does not have knowledge of IOVA.
> > I will rename it.
> >
> > >
> > > > The owner driver enables write
> > > > +recording for one or more IOVA ranges per device during device
> > > > +migration flow. The owner driver periodically queries these
> > > > +written IOVA records from the device.
> > >
> > > periodical reads without any indication are the only option then?
> > >
> > At least for now, this is starting point. Software stack such as QEMU does it
> periodically.
> 
> So for CPU kvm switched to PML and this seems to work better - it guarantees
> there's convergence.
> 
My guess that PML is better because write protection related faults are removed now.
The approach here is similar, but there is no PML kind of queue.
It is relatively manageable to slow down cpu on VMEXIT or other ways, compare to external network.

And secondly, it is driven by the hypervisor cpu availability for capacity planning etc.
So its periodic in nature that share similar scheme like PML.


> > When new use case arise, may be it can be extended.
> >
> > > > As the driver reads the written IOVA records,
> > > > +the device clears those records from the device. Once the device
> > > > +reports zero or small number of written IOVA records, the device
> > > > +is set to \field{Stop} or \field{Freeze} mode. Once the device is
> > > > +set to \field{Stop}
> > > >  or \field{Freeze} mode, and once all the IOVA records are read,
> > > > the driver stops  the write recording in the device.
> > >
> > >
> > > it is not great that you are rewriting text you just wrote in patch
> > > 1 here. pls find a way not to make reviewers read everything twice.
> > >
> > There is small duplication of one line explaining mode change, rest is
> contextual to the write recording.
> > Merging with text of patch_1 was slightly complicated to read, so one
> sentence is duplicated.
> >
> > I will again check if patch_1 text extension is easier to read.
> 
> this is latex, space is ignored. if you are only changing one word just don't move
> the rest and diff will look sane
> 
Yeah, right. I will fix this.

> > > > @@ -118,6 +119,10 @@ \subsubsection{Device
> > > > Migration}\label{sec:Basic Facilities of a Virtio Device /  \item
> > > > Device Context Read Command \item Device Context Write Command
> > > > \item Device Context Discard Command
> > > > +\item Device Write Record Capabilities Query Command \item Device
> > > > +Write Records Start Command \item Device Write Records Stop
> > > > +Command \item Device Write Records Read Command
> > > >  \end{enumerate}
> > > >
> > > >  These commands are currently only defined for the SR-IOV group type.
> > > > @@ -307,6 +312,129 @@ \subsubsection{Device
> > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command
> > > > writes a new device
> > > context.
> > > >
> > > > +\paragraph{Device Write Record Capabilities Query Command}
> > > > +\label{par:Basic Facilities of a Virtio Device / Device groups /
> > > > +Group administration commands / Device Migration / Device Write
> > > > +Record Capabilities Query Command}
> > > > +
> > > > +This command reads the device write record capabilities.
> > > > +For the command
> VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
> > > > +\field{opcode} is set to 0xd.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_dev_write_record_cap_result {
> > > > +        le32 supported_iova_page_size_bitmap;
> > > > +        le32 supported_iova_ranges; }; \end{lstlisting}
> > > > +
> > > > +When the command completes successfully,
> > > > +\field{command_specific_result} is in the format \field{struct
> > > > +virtio_admin_cmd_dev_write_record_cap_result}
> > > > +returned by the device. The
> > > > +\field{supported_iova_page_size_bitmap}
> > > > +indicates the granularity at which the device can record IOVA ranges.
> > > > +the minimum granularity can be 4KB. Bit 0 corresponds to 4KB, bit
> > > > +1 corresponds to 8KB, bit 31 corresponds to 4TB. The device
> > > > +supports at least
> > > one page granularity.
> > > > +The device support one or more IOVA page granularity; for each
> > > > +IOVA page granularity, the device sets corresponding bit in the
> > > > +\field{supported_iova_page_size_bitmap}. The
> > > > +\field{supported_iova_ranges} indicates how many unique (non
> > > > +overlapping) IOVA ranges can be recorded by the device.
> > >
> > > what role does this granularity play? i see no mention of it down the road.
> > >
> > The page_size in struct virtio_admin_cmd_write_record_start_data must
> match to the granularity supplied above.
> > I missed it. Will add in v2.
> > This is very useful comment.
> 
> Not that it's very clear what does page_size do.
> 
Page_size is the granularity on which to record the writes.
For example, when page_size = 2MB, any writes are aligned to 2MB page boundary.
If 8KB data is written, only single write record entry reported.

If the page_size = 4K, two write record entries reported.
( I assumed 4K aligned address to keep the example simple).

> > >
> > > > +
> > > > +\paragraph{Device Write Records Start Command} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device groups / Group
> > > > +administration commands / Device Migration / Device Write Records
> > > > +Start Command}
> > > > +
> > > > +This command starts the write recording in the device for the
> > > > +specified IOVA ranges.
> > > > +
> > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START,
> > > > +\field{opcode} is set to 0xe.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +The \field{command_specific_data} is in the format \field{struct
> > > > +virtio_admin_cmd_write_record_start_data}.
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_write_record_start_entry {
> > > > +        le64 iova;
> > > > +        le64 page_count;
> > > > +};
> > > > +
> > > > +struct virtio_admin_cmd_write_record_start_data {
> > > > +        le64 page_size;
> > > > +        le32 count;
> > > > +        u8 reserved[4];
> > > > +        struct virtio_admin_cmd_write_record_start_entry
> > > > +entries[]; };
> > > > +
> > > > +\end{lstlisting}
> > > > +
> > > > +The \field{count} is set to indicate number of valid \field{entries}.
> > > > +The \field{iova} indicates the start IOVA address. The
> > > > +\field{page_count} indicates number of pages of size
> > > > +\field{page_size} starting from \field{iova} to record for write
> > > > +reporting. VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > > > +command contains unique i.e. non overlapping IOVA range entries.
> > > > +Whenever a memory write occurs by the device in the supplied IOVA
> > > > +range, the device records the actual IOVA and number of bytes
> > > > +written to the
> > > IOVA.
> > > > +These write records can be read by the the driver using
> > > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
> > > > +
> > > > +This command has no command specific result.
> > > > +
> > > > +\paragraph{Device Write Record Stop Command} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device groups / Group
> > > > +administration commands / Device Migration / Device Write Record
> > > > +Stop Command}
> > > > +
> > > > +This command stops the write recording in the device for IOVA
> > > > +ranges which were previously started using
> > > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > > > +command.
> > > > +
> > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP,
> > > > +\field{opcode} is set to 0xf.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +This command does not have any command specific data.
> > > > +This command has no command specific result.
> > > > +
> > > > +\paragraph{Device Write Records Read Command} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device groups / Group
> > > > +administration commands / Device Migration / Device Write Records
> > > > +Read Command}
> > > > +
> > > > +This command reads the device write records for which the write
> > > > +recording is previously started using
> > > VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
> > > > +
> > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ,
> > > > +\field{opcode} is set to 0x10.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_write_records_read_data {
> > > > +        le64 iova;
> > > > +        le64 length;
> > > > +};
> > > > +
> > > > +struct virtio_admin_cmd_dev_write_records_cnt {
> > > > +        le32 count;
> > > > +};
> > > > +
> > > > +struct virtio_admin_cmd_dev_write_records_result {
> > > > +        le64 iova_entries[];
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +The \field{command_specific_data} is in the format \field{struct
> > > > +virtio_admin_cmd_write_records_read_data}. The driver sets the
> > > > +\field {iova} indicating the start IOVA address for up to the
> > > > +\field{length} number of bytes. The supplied IOVA range same or
> > > > +smaller than the range supplied when write recording is started
> > > > +by the driver in VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> command.
> > >
> > > Seems pretty sparse. Lots of hypervisors chose to implement a bit
> > > per page strategy.
> > This command result addresses will feed into such bit.
> 
> Do we want to return it in this format then?
I think yes, because converting to the bit is easy.
Reporting bit requires bitmap being function of VM memory and not based on amount of written pages.

> 
> > >
> > > > +
> > > > +When the command completes successfully,
> > > > +\field{command_specific_result} is in the format \field{struct
> > > > +virtio_admin_cmd_dev_write_records_result}
> > > > +and \field{command_specific_result} is in format of \field{struct
> > > > +virtio_admin_cmd_dev_write_records_cnt} containing number of
> > > > +write records returned by the device.
> > >
> > > what are these records though?
> > >
> > It is struct virtio_admin_cmd_dev_write_records_result.
> > I will rephrase it to link to struct virtio_admin_cmd_dev_write_records_result.
> >
> > >
> > > > When the command completes
> > > > +successfully, the write records which are returned in the result
> > > > +are cleared from the device and same records cannot be read
> > > > +again. When new writes occur at same IOVA range or at different
> > > > +once, those records can be read as new write records.
> > >
> > >
> > > this last sentence just confuses.
> > >
> > How about just keeping below text rewrite?
> > When the command completes successfully, the write records returned in the
> result are cleared from the device.
> 
> I think we need to explain in more detail what exactly is expected to be
> recorded and when.
> 
Ok. I will give second look now to improve this description.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-09  8:49   ` Jason Wang
  2023-10-09 10:06     ` Parav Pandit
@ 2023-10-09 12:02     ` Parav Pandit
  2023-10-09 16:19       ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-09 12:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

Hi Jason,

> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 9, 2023 2:19 PM
> 
> Adding LingShan.
> 
> Parav, if you want any specific people to comment, please do cc them.
> 
> On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > One or more passthrough PCI VF devices are ubiquitous for virtual
> > machines usage using generic kernel framework such as vfio [1].
> 
> Mentioning a specific subsystem in a specific OS may mislead the user to think
> it can only work in that setup. Let's not do that, virtio is not only used for Linux
> and VFIO.
This is just one example on how these commands are useful.
It can be useful in more ways too in more OSes too.
I will drop from the patch commit log and keep as information purpose in cover letter.
Would that work for you?

I don’t have any strong opinion to keep it or remove it as most stakeholders has the clear view of requirements now.
Let me know.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-09 10:34     ` Zhu, Lingshan
@ 2023-10-09 14:30       ` Parav Pandit
  2023-10-10  8:52         ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-09 14:30 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, October 9, 2023 4:04 PM
> 
> On 10/8/2023 7:41 PM, Michael S. Tsirkin wrote:
> > On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
> >> Define the device context and its fields for purpose of device
> >> migration. The device context is read and written by the owner driver
> >> on source and destination hypervisor respectively.
> >>
> >> Device context fields will experience a rapid growth post this
> >> initial version to cover many details of the device.
> >>
> >> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> >> Signed-off-by: Parav Pandit <parav@nvidia.com>
> >> Signed-off-by: Satananda Burla <sburla@marvell.com>
> >> ---
> >> changelog:
> >> v0->v1:
> >> - enrich device context to cover feature bits, device configuration
> >>    fields
> >> - corrected alignment of device context fields
> >> ---
> >>   content.tex        |   1 +
> >>   device-context.tex | 142
> +++++++++++++++++++++++++++++++++++++++++++++
> >>   2 files changed, 143 insertions(+)
> >>   create mode 100644 device-context.tex
> >>
> >> diff --git a/content.tex b/content.tex index 0a62dce..2698931 100644
> >> --- a/content.tex
> >> +++ b/content.tex
> >> @@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic Facilities
> of a Virtio Device / Expo
> >>   UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
> >>
> >>   \input{admin.tex}
> >> +\input{device-context.tex}
> >>
> >>   \chapter{General Initialization And Device
> >> Operation}\label{sec:General Initialization And Device Operation}
> >>
> >> diff --git a/device-context.tex b/device-context.tex new file mode
> >> 100644 index 0000000..5611382
> >> --- /dev/null
> >> +++ b/device-context.tex
> >> @@ -0,0 +1,142 @@
> >> +\section{Device Context}\label{sec:Basic Facilities of a Virtio
> >> +Device / Device Context}
> >> +
> >> +The device context holds the information that a owner driver can use
> >> +to setup a member device and resume its operation. The device
> >> +context of a member device is read or written by the owner driver
> >> +using administration commands.
> >> +
> >> +\begin{lstlisting}
> >> +struct virtio_dev_ctx_field_tlv {
> >> +        le32 type;
> >> +        le32 reserved;
> >> +        le64 length;
> >> +        u8 value[];
> >> +};
> >> +
> >> +struct virtio_dev_ctx {
> >> +        le32 field_count;
> >> +        struct virtio_dev_ctx_field_tlv fields[]; };
> >> +
> >> +\end{lstlisting}
> so this still doesn't work for nested

In one use case of nesting, that we came across is:
there is large host_VM which is hosting another guest_VMs.
In such case, the owner PF is passthrough to this host_VM and current proposed scheme continue to function for nesting as well for nested guest_VMs.

In second use case, where one want to bind only one member device to one VM, 
I think same plumbing can be extended to have another VF, to take the role of migration device instead of owner device.

I don’t see a good way to passthrough and also do in-band migration without lot of device specific trap and emulation.
I also don’t know the cpu performance numbers with 3 levels of nested page table translation which to my understanding cannot be accelerated by the current cpu.
Do you know how does it work for Intel x86_64?
Can it do > 2 level of nested page tables? If no, what is the perf characteristics to expect?

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-09  4:15     ` Parav Pandit
@ 2023-10-09 15:54       ` Michael S. Tsirkin
  2023-10-09 17:22         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-09 15:54 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Mon, Oct 09, 2023 at 04:15:01AM +0000, Parav Pandit wrote:
> Hi Michael,
> 
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Michael S. Tsirkin
> > Sent: Sunday, October 8, 2023 5:12 PM
> 
> [..]
> > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline type & Name &
> > > +Description \\ \hline \hline
> > > +0x0 & VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG & Provides common
> > > +configuration space of device for PCI transport \\ \hline
> > > +0x1 & VIRTIO_DEV_CTX_DEV_CFG_LAYOUT & Provides device specific
> > > +configuration layout \\ \hline
> > > +0x2 & VIRTIO_DEV_CTX_DEV_FEATURES & Provides device features \\
> > > +\hline
> > > +0x3 & VIRTIO_DEV_CTX_PCI_VQ_CFG & Provides Virtqueue configuration
> > > +for PCI transport \\ \hline
> > > +0x4 & VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG & Provides Queue run
> > time
> > > +state \\ \hline
> > > +0x5 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of
> > > +virtqueue descriptors owned by device  \\ \hline
> > > +0x6 - 0xFFFFFFFF & - & Reserved for future types \\ \hline
> > > +\end{tabularx}
> > 
> > 
> > I don't think this is enough, e.g. virtio net has internal state
> > controlled thought CVQ commands. how do you intend to address/migrate
> > these?
> >
> Post this series, the 32-bit type field will be split into two ranges.
> First range (existing) to cover common content across all device type.
> Second range to contain device specific content, containing non internal fields such as fields setup by the guest directly over CVQ.

How will all this be added though? You probably have a clear picture in
your head but I (and likely other tc members) don't.

> > > +\subsubsection{Device Context Fields}\label{sec:Basic Facilities of a Virtio
> > Device / Device Context / Device Context Fields}
> > > +
> > > +\paragraph{PCI Common Configuration Context}
> > > +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context
> > Fields/ PCI Common Configuration Context}
> > > +
> > > +For the field VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG, \field{type}
> > is set to 0x0.

Not sure what does RUNTIME do here. 

> > > +The \field{value} is in format of \field{struct virtio_pci_common_cfg}.
> > > +The \field{length} is the length of \field{struct virtio_pci_common_cfg}.
> > > +
> > > +\paragraph{Device Configuration Layout Context}
> > > +\label{par:Basic Facilities of a Virtio Device / Device Context / Device Context
> > Fields/ Device Configuration Layout Context}
> > > +
> > > +For the field VIRTIO_DEV_CTX_DEV_CFG_LAYOUT, \field{type} is set to 0x1.

This name is quite confusing. I see now you just mean this is device
config?

> > > +The \field{value} is in format of device specific configuration layout listed
> > > +in each of the device's device configuration layout section.
> > > +The \field{length} is the length of the device configuration layout data.
> > 
> > Unclear. I am guessing it's doing things like setting up RO
> > fields? This needs to be specified per device really.
> > Also how some fields behave might depend on features.
> In practice fields in this area do not change a lot, but it can for example the link status/speed of net device.
> So it is not RO per say.
> 
> Regarding it be device specific or just a config_length blob, I think config_length blob is just fine for device_context use.
> This is because there isn’t a need for migration driver to parse any of these fields.

Which driver parses what in your current stack is immaterial.  We need
to document all content, just a length is not going to work.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 7/8] admin: Add write recording commands
  2023-10-09 11:48         ` Parav Pandit
@ 2023-10-09 16:15           ` Michael S. Tsirkin
  2023-10-09 17:22             ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-09 16:15 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Mon, Oct 09, 2023 at 11:48:46AM +0000, Parav Pandit wrote:
> 
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Michael S. Tsirkin
> > Sent: Monday, October 9, 2023 4:28 PM
> 
> > > May be we can use a different verb than record. May be tracking?
> > > Record means, device accumulates these written pages addresses, and when
> > driver queries, it returns these addresses and clears it internally within the
> > device.
> > 
> > For example, what about two writes into same address. Do you get one record
> > or two? What if the length is different?
> One record.
> Since the writes are tracked/recorded at page granularity, even if length is different, it still one entry.

what if writes cover different pages but overlap?
Don't bother answering here, I am implying this needs to be documented.

We could give devices a bit of a freeway here too, e.g. explain that
device can combine two writes into one record but does not have to -
might lead to better data structures internally.


> > 
> > 
> > > > > Such an
> > > > > +address is named as IO virtual address (IOVA).
> > > >
> > > > I don't know what does this have to do with IOVA. For that matter
> > > > everything would have to be "IOVA". Spec calls these physical
> > > > address and let's stick to that.
> > > >
> > > Make sense.
> > > At device level it does not have knowledge of IOVA.
> > > I will rename it.
> > >
> > > >
> > > > > The owner driver enables write
> > > > > +recording for one or more IOVA ranges per device during device
> > > > > +migration flow. The owner driver periodically queries these
> > > > > +written IOVA records from the device.
> > > >
> > > > periodical reads without any indication are the only option then?
> > > >
> > > At least for now, this is starting point. Software stack such as QEMU does it
> > periodically.
> > 
> > So for CPU kvm switched to PML and this seems to work better - it guarantees
> > there's convergence.
> > 
> My guess that PML is better because write protection related faults are removed now.
> The approach here is similar, but there is no PML kind of queue.
> It is relatively manageable to slow down cpu on VMEXIT or other ways, compare to external network.
> 
> And secondly, it is driven by the hypervisor cpu availability for capacity planning etc.
> So its periodic in nature that share similar scheme like PML.

With PML the entries are recorded into two places: dirty bit in PTE+log.
If log fills up there's an exit. I don't exactly get where do you expect
the device to record this apparently unbounded log.

> 
> > > When new use case arise, may be it can be extended.
> > >
> > > > > As the driver reads the written IOVA records,
> > > > > +the device clears those records from the device. Once the device
> > > > > +reports zero or small number of written IOVA records, the device
> > > > > +is set to \field{Stop} or \field{Freeze} mode. Once the device is
> > > > > +set to \field{Stop}
> > > > >  or \field{Freeze} mode, and once all the IOVA records are read,
> > > > > the driver stops  the write recording in the device.
> > > >
> > > >
> > > > it is not great that you are rewriting text you just wrote in patch
> > > > 1 here. pls find a way not to make reviewers read everything twice.
> > > >
> > > There is small duplication of one line explaining mode change, rest is
> > contextual to the write recording.
> > > Merging with text of patch_1 was slightly complicated to read, so one
> > sentence is duplicated.
> > >
> > > I will again check if patch_1 text extension is easier to read.
> > 
> > this is latex, space is ignored. if you are only changing one word just don't move
> > the rest and diff will look sane
> > 
> Yeah, right. I will fix this.
> 
> > > > > @@ -118,6 +119,10 @@ \subsubsection{Device
> > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /  \item
> > > > > Device Context Read Command \item Device Context Write Command
> > > > > \item Device Context Discard Command
> > > > > +\item Device Write Record Capabilities Query Command \item Device
> > > > > +Write Records Start Command \item Device Write Records Stop
> > > > > +Command \item Device Write Records Read Command
> > > > >  \end{enumerate}
> > > > >
> > > > >  These commands are currently only defined for the SR-IOV group type.
> > > > > @@ -307,6 +312,129 @@ \subsubsection{Device
> > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command
> > > > > writes a new device
> > > > context.
> > > > >
> > > > > +\paragraph{Device Write Record Capabilities Query Command}
> > > > > +\label{par:Basic Facilities of a Virtio Device / Device groups /
> > > > > +Group administration commands / Device Migration / Device Write
> > > > > +Record Capabilities Query Command}
> > > > > +
> > > > > +This command reads the device write record capabilities.
> > > > > +For the command
> > VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
> > > > > +\field{opcode} is set to 0xd.
> > > > > +The \field{group_member_id} refers to the member device to be
> > accessed.
> > > > > +
> > > > > +\begin{lstlisting}
> > > > > +struct virtio_admin_cmd_dev_write_record_cap_result {
> > > > > +        le32 supported_iova_page_size_bitmap;
> > > > > +        le32 supported_iova_ranges; }; \end{lstlisting}
> > > > > +
> > > > > +When the command completes successfully,
> > > > > +\field{command_specific_result} is in the format \field{struct
> > > > > +virtio_admin_cmd_dev_write_record_cap_result}
> > > > > +returned by the device. The
> > > > > +\field{supported_iova_page_size_bitmap}
> > > > > +indicates the granularity at which the device can record IOVA ranges.
> > > > > +the minimum granularity can be 4KB. Bit 0 corresponds to 4KB, bit
> > > > > +1 corresponds to 8KB, bit 31 corresponds to 4TB. The device
> > > > > +supports at least
> > > > one page granularity.
> > > > > +The device support one or more IOVA page granularity; for each
> > > > > +IOVA page granularity, the device sets corresponding bit in the
> > > > > +\field{supported_iova_page_size_bitmap}. The
> > > > > +\field{supported_iova_ranges} indicates how many unique (non
> > > > > +overlapping) IOVA ranges can be recorded by the device.
> > > >
> > > > what role does this granularity play? i see no mention of it down the road.
> > > >
> > > The page_size in struct virtio_admin_cmd_write_record_start_data must
> > match to the granularity supplied above.
> > > I missed it. Will add in v2.
> > > This is very useful comment.
> > 
> > Not that it's very clear what does page_size do.
> > 
> Page_size is the granularity on which to record the writes.
> For example, when page_size = 2MB, any writes are aligned to 2MB page boundary.
> If 8KB data is written, only single write record entry reported.
> 
> If the page_size = 4K, two write record entries reported.
> ( I assumed 4K aligned address to keep the example simple).


But you also said internally device maintains a bitmap.
So it will have to work hard to find a set bit in the map then?
Do we want to maybe give device an option to just return
a bitmap and have driver worry about it?

To me this looks like an optimization for when when devices keep writing
to the same page all the time?  Do you have data to show that's commonly
the case?  Instrumenting a driver would be one way to find out.

> > > >
> > > > > +
> > > > > +\paragraph{Device Write Records Start Command} \label{par:Basic
> > > > > +Facilities of a Virtio Device / Device groups / Group
> > > > > +administration commands / Device Migration / Device Write Records
> > > > > +Start Command}
> > > > > +
> > > > > +This command starts the write recording in the device for the
> > > > > +specified IOVA ranges.
> > > > > +
> > > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START,
> > > > > +\field{opcode} is set to 0xe.
> > > > > +The \field{group_member_id} refers to the member device to be
> > accessed.
> > > > > +
> > > > > +The \field{command_specific_data} is in the format \field{struct
> > > > > +virtio_admin_cmd_write_record_start_data}.
> > > > > +
> > > > > +\begin{lstlisting}
> > > > > +struct virtio_admin_cmd_write_record_start_entry {
> > > > > +        le64 iova;
> > > > > +        le64 page_count;
> > > > > +};
> > > > > +
> > > > > +struct virtio_admin_cmd_write_record_start_data {
> > > > > +        le64 page_size;
> > > > > +        le32 count;
> > > > > +        u8 reserved[4];
> > > > > +        struct virtio_admin_cmd_write_record_start_entry
> > > > > +entries[]; };
> > > > > +
> > > > > +\end{lstlisting}
> > > > > +
> > > > > +The \field{count} is set to indicate number of valid \field{entries}.
> > > > > +The \field{iova} indicates the start IOVA address. The
> > > > > +\field{page_count} indicates number of pages of size
> > > > > +\field{page_size} starting from \field{iova} to record for write
> > > > > +reporting. VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > > > > +command contains unique i.e. non overlapping IOVA range entries.
> > > > > +Whenever a memory write occurs by the device in the supplied IOVA
> > > > > +range, the device records the actual IOVA and number of bytes
> > > > > +written to the
> > > > IOVA.
> > > > > +These write records can be read by the the driver using
> > > > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
> > > > > +
> > > > > +This command has no command specific result.
> > > > > +
> > > > > +\paragraph{Device Write Record Stop Command} \label{par:Basic
> > > > > +Facilities of a Virtio Device / Device groups / Group
> > > > > +administration commands / Device Migration / Device Write Record
> > > > > +Stop Command}
> > > > > +
> > > > > +This command stops the write recording in the device for IOVA
> > > > > +ranges which were previously started using
> > > > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > > > > +command.
> > > > > +
> > > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP,
> > > > > +\field{opcode} is set to 0xf.
> > > > > +The \field{group_member_id} refers to the member device to be
> > accessed.
> > > > > +
> > > > > +This command does not have any command specific data.
> > > > > +This command has no command specific result.
> > > > > +
> > > > > +\paragraph{Device Write Records Read Command} \label{par:Basic
> > > > > +Facilities of a Virtio Device / Device groups / Group
> > > > > +administration commands / Device Migration / Device Write Records
> > > > > +Read Command}
> > > > > +
> > > > > +This command reads the device write records for which the write
> > > > > +recording is previously started using
> > > > VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
> > > > > +
> > > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ,
> > > > > +\field{opcode} is set to 0x10.
> > > > > +The \field{group_member_id} refers to the member device to be
> > accessed.
> > > > > +
> > > > > +\begin{lstlisting}
> > > > > +struct virtio_admin_cmd_write_records_read_data {
> > > > > +        le64 iova;
> > > > > +        le64 length;
> > > > > +};
> > > > > +
> > > > > +struct virtio_admin_cmd_dev_write_records_cnt {
> > > > > +        le32 count;
> > > > > +};
> > > > > +
> > > > > +struct virtio_admin_cmd_dev_write_records_result {
> > > > > +        le64 iova_entries[];
> > > > > +};
> > > > > +\end{lstlisting}
> > > > > +
> > > > > +The \field{command_specific_data} is in the format \field{struct
> > > > > +virtio_admin_cmd_write_records_read_data}. The driver sets the
> > > > > +\field {iova} indicating the start IOVA address for up to the
> > > > > +\field{length} number of bytes. The supplied IOVA range same or
> > > > > +smaller than the range supplied when write recording is started
> > > > > +by the driver in VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > command.
> > > >
> > > > Seems pretty sparse. Lots of hypervisors chose to implement a bit
> > > > per page strategy.
> > > This command result addresses will feed into such bit.
> > 
> > Do we want to return it in this format then?
> I think yes, because converting to the bit is easy.
> Reporting bit requires bitmap being function of VM memory and not based on amount of written pages.

So, you are optimizing for when small # of bits are set.
Do you have data to show it's common?

I would maybe structure it like this:

- bitmap used at all times
- log is maintained as long as it's not full
- when log fills up just bitmap is used.

and at this point, we can maybe start with just a bitmap
and add the log optimization separate, and optional?




> > 
> > > >
> > > > > +
> > > > > +When the command completes successfully,
> > > > > +\field{command_specific_result} is in the format \field{struct
> > > > > +virtio_admin_cmd_dev_write_records_result}
> > > > > +and \field{command_specific_result} is in format of \field{struct
> > > > > +virtio_admin_cmd_dev_write_records_cnt} containing number of
> > > > > +write records returned by the device.
> > > >
> > > > what are these records though?
> > > >
> > > It is struct virtio_admin_cmd_dev_write_records_result.
> > > I will rephrase it to link to struct virtio_admin_cmd_dev_write_records_result.
> > >
> > > >
> > > > > When the command completes
> > > > > +successfully, the write records which are returned in the result
> > > > > +are cleared from the device and same records cannot be read
> > > > > +again. When new writes occur at same IOVA range or at different
> > > > > +once, those records can be read as new write records.
> > > >
> > > >
> > > > this last sentence just confuses.
> > > >
> > > How about just keeping below text rewrite?
> > > When the command completes successfully, the write records returned in the
> > result are cleared from the device.
> > 
> > I think we need to explain in more detail what exactly is expected to be
> > recorded and when.
> > 
> Ok. I will give second look now to improve this description.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-09 12:02     ` Parav Pandit
@ 2023-10-09 16:19       ` Michael S. Tsirkin
  2023-10-09 17:21         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-09 16:19 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Mon, Oct 09, 2023 at 12:02:54PM +0000, Parav Pandit wrote:
> Hi Jason,
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, October 9, 2023 2:19 PM
> > 
> > Adding LingShan.
> > 
> > Parav, if you want any specific people to comment, please do cc them.
> > 
> > On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > One or more passthrough PCI VF devices are ubiquitous for virtual
> > > machines usage using generic kernel framework such as vfio [1].
> > 
> > Mentioning a specific subsystem in a specific OS may mislead the user to think
> > it can only work in that setup. Let's not do that, virtio is not only used for Linux
> > and VFIO.
> This is just one example on how these commands are useful.
> It can be useful in more ways too in more OSes too.
> I will drop from the patch commit log and keep as information purpose in cover letter.
> Would that work for you?
> 
> I don’t have any strong opinion to keep it or remove it as most stakeholders has the clear view of requirements now.
> Let me know.

So some people use VFs with VFIO. Hence the module name.  This sentence
by itself seems to have zero value for the spec. Just drop it.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-09 16:19       ` Michael S. Tsirkin
@ 2023-10-09 17:21         ` Parav Pandit
  2023-10-10  8:57           ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-09 17:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, October 9, 2023 9:50 PM

> > > > One or more passthrough PCI VF devices are ubiquitous for virtual
> > > > machines usage using generic kernel framework such as vfio [1].
> > >
> > > Mentioning a specific subsystem in a specific OS may mislead the
> > > user to think it can only work in that setup. Let's not do that,
> > > virtio is not only used for Linux and VFIO.
> > This is just one example on how these commands are useful.
> > It can be useful in more ways too in more OSes too.
> > I will drop from the patch commit log and keep as information purpose in
> cover letter.
> > Would that work for you?
> >
> > I don’t have any strong opinion to keep it or remove it as most stakeholders
> has the clear view of requirements now.
> > Let me know.
> 
> So some people use VFs with VFIO. Hence the module name.  This sentence by
> itself seems to have zero value for the spec. Just drop it.
Ok. Will drop.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-09 15:54       ` Michael S. Tsirkin
@ 2023-10-09 17:22         ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-09 17:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, October 9, 2023 9:25 PM
> 
> On Mon, Oct 09, 2023 at 04:15:01AM +0000, Parav Pandit wrote:
> > Hi Michael,
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > Tsirkin
> > > Sent: Sunday, October 8, 2023 5:12 PM
> >
> > [..]
> > > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline type & Name &
> > > > +Description \\ \hline \hline
> > > > +0x0 & VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG & Provides
> common
> > > > +configuration space of device for PCI transport \\ \hline
> > > > +0x1 & VIRTIO_DEV_CTX_DEV_CFG_LAYOUT & Provides device specific
> > > > +configuration layout \\ \hline
> > > > +0x2 & VIRTIO_DEV_CTX_DEV_FEATURES & Provides device features \\
> > > > +\hline
> > > > +0x3 & VIRTIO_DEV_CTX_PCI_VQ_CFG & Provides Virtqueue
> > > > +configuration for PCI transport \\ \hline
> > > > +0x4 & VIRTIO_DEV_CTX_VQ_SPLIT_RUNTIME_CFG & Provides Queue run
> > > time
> > > > +state \\ \hline
> > > > +0x5 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of
> > > > +virtqueue descriptors owned by device  \\ \hline
> > > > +0x6 - 0xFFFFFFFF & - & Reserved for future types \\ \hline
> > > > +\end{tabularx}
> > >
> > >
> > > I don't think this is enough, e.g. virtio net has internal state
> > > controlled thought CVQ commands. how do you intend to
> > > address/migrate these?
> > >
> > Post this series, the 32-bit type field will be split into two ranges.
> > First range (existing) to cover common content across all device type.
> > Second range to contain device specific content, containing non internal fields
> such as fields setup by the guest directly over CVQ.
> 
> How will all this be added though? You probably have a clear picture in your
> head but I (and likely other tc members) don't.
>
An example is,
Range 0 to 0xffff are reserved for common device type.
0x1_0000 to 0x1_ffff is per device type.
So say for net device,
0x1_0000 represents, RSS configuration.
0x1_0001 will represents, flow filter configuration
0x1_0002 will represents, stats.

For crypto device,
0x1_000 represents session information.

Etc.
 
> > > > +\subsubsection{Device Context Fields}\label{sec:Basic Facilities
> > > > +of a Virtio
> > > Device / Device Context / Device Context Fields}
> > > > +
> > > > +\paragraph{PCI Common Configuration Context} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device Context / Device Context
> > > Fields/ PCI Common Configuration Context}
> > > > +
> > > > +For the field VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG,
> \field{type}
> > > is set to 0x0.
> 
> Not sure what does RUNTIME do here.
> 
Runtime is just the prefix to denote that things like runtime are more probable to change as opposed to config which does not have it.

> > > > +The \field{value} is in format of \field{struct virtio_pci_common_cfg}.
> > > > +The \field{length} is the length of \field{struct virtio_pci_common_cfg}.
> > > > +
> > > > +\paragraph{Device Configuration Layout Context} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device Context / Device Context
> > > Fields/ Device Configuration Layout Context}
> > > > +
> > > > +For the field VIRTIO_DEV_CTX_DEV_CFG_LAYOUT, \field{type} is set to
> 0x1.
> 
> This name is quite confusing. I see now you just mean this is device config?
>
Yes, the "device configuration layout fields".
So just do DEV_CFG?
 
> > > > +The \field{value} is in format of device specific configuration
> > > > +layout listed in each of the device's device configuration layout section.
> > > > +The \field{length} is the length of the device configuration layout data.
> > >
> > > Unclear. I am guessing it's doing things like setting up RO fields?
> > > This needs to be specified per device really.
> > > Also how some fields behave might depend on features.
> > In practice fields in this area do not change a lot, but it can for example the
> link status/speed of net device.
> > So it is not RO per say.
> >
> > Regarding it be device specific or just a config_length blob, I think
> config_length blob is just fine for device_context use.
> > This is because there isn’t a need for migration driver to parse any of these
> fields.
> 
> Which driver parses what in your current stack is immaterial.  We need to
> document all content, just a length is not going to work.

Sure it is immaterial.
The length covers the length of the dev config space data.
I will add the example containing the link.
Since we want to avoid duplicating the content of device config layout, the example will help to make this very clear.



^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 7/8] admin: Add write recording commands
  2023-10-09 16:15           ` Michael S. Tsirkin
@ 2023-10-09 17:22             ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-09 17:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, October 9, 2023 9:46 PM
> 
> On Mon, Oct 09, 2023 at 11:48:46AM +0000, Parav Pandit wrote:
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > Tsirkin
> > > Sent: Monday, October 9, 2023 4:28 PM
> >
> > > > May be we can use a different verb than record. May be tracking?
> > > > Record means, device accumulates these written pages addresses,
> > > > and when
> > > driver queries, it returns these addresses and clears it internally
> > > within the device.
> > >
> > > For example, what about two writes into same address. Do you get one
> > > record or two? What if the length is different?
> > One record.
> > Since the writes are tracked/recorded at page granularity, even if length is
> different, it still one entry.
> 
> what if writes cover different pages but overlap?
From device point of view, for write recording if it spans two pages (due to page_size), they are two write records.

> Don't bother answering here, I am implying this needs to be documented.
> 
> We could give devices a bit of a freeway here too, e.g. explain that device can
> combine two writes into one record but does not have to - might lead to better
> data structures internally.
> 
The device can do so if the write record entry covers written length too.
Since known hypervisor work at constant page size per guest, the suppling of the length was not effective enough.

> 
> > >
> > >
> > > > > > Such an
> > > > > > +address is named as IO virtual address (IOVA).
> > > > >
> > > > > I don't know what does this have to do with IOVA. For that
> > > > > matter everything would have to be "IOVA". Spec calls these
> > > > > physical address and let's stick to that.
> > > > >
> > > > Make sense.
> > > > At device level it does not have knowledge of IOVA.
> > > > I will rename it.
> > > >
> > > > >
> > > > > > The owner driver enables write
> > > > > > +recording for one or more IOVA ranges per device during
> > > > > > +device migration flow. The owner driver periodically queries
> > > > > > +these written IOVA records from the device.
> > > > >
> > > > > periodical reads without any indication are the only option then?
> > > > >
> > > > At least for now, this is starting point. Software stack such as
> > > > QEMU does it
> > > periodically.
> > >
> > > So for CPU kvm switched to PML and this seems to work better - it
> > > guarantees there's convergence.
> > >
> > My guess that PML is better because write protection related faults are
> removed now.
> > The approach here is similar, but there is no PML kind of queue.
> > It is relatively manageable to slow down cpu on VMEXIT or other ways,
> compare to external network.
> >
> > And secondly, it is driven by the hypervisor cpu availability for capacity
> planning etc.
> > So its periodic in nature that share similar scheme like PML.
> 
> With PML the entries are recorded into two places: dirty bit in PTE+log.
> If log fills up there's an exit. I don't exactly get where do you expect the device
> to record this apparently unbounded log.
>
It is abstract enough from spec point of view. Like where the RSS config is stored, where the mac table, vlan tables are stored.
It is similar to that way..

> >
> > > > When new use case arise, may be it can be extended.
> > > >
> > > > > > As the driver reads the written IOVA records,
> > > > > > +the device clears those records from the device. Once the
> > > > > > +device reports zero or small number of written IOVA records,
> > > > > > +the device is set to \field{Stop} or \field{Freeze} mode.
> > > > > > +Once the device is set to \field{Stop}
> > > > > >  or \field{Freeze} mode, and once all the IOVA records are
> > > > > > read, the driver stops  the write recording in the device.
> > > > >
> > > > >
> > > > > it is not great that you are rewriting text you just wrote in
> > > > > patch
> > > > > 1 here. pls find a way not to make reviewers read everything twice.
> > > > >
> > > > There is small duplication of one line explaining mode change,
> > > > rest is
> > > contextual to the write recording.
> > > > Merging with text of patch_1 was slightly complicated to read, so
> > > > one
> > > sentence is duplicated.
> > > >
> > > > I will again check if patch_1 text extension is easier to read.
> > >
> > > this is latex, space is ignored. if you are only changing one word
> > > just don't move the rest and diff will look sane
> > >
> > Yeah, right. I will fix this.
> >
> > > > > > @@ -118,6 +119,10 @@ \subsubsection{Device
> > > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > > \item Device Context Read Command \item Device Context Write
> > > > > > Command \item Device Context Discard Command
> > > > > > +\item Device Write Record Capabilities Query Command \item
> > > > > > +Device Write Records Start Command \item Device Write Records
> > > > > > +Stop Command \item Device Write Records Read Command
> > > > > >  \end{enumerate}
> > > > > >
> > > > > >  These commands are currently only defined for the SR-IOV group
> type.
> > > > > > @@ -307,6 +312,129 @@ \subsubsection{Device
> > > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > > discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE
> command
> > > > > > writes a new device
> > > > > context.
> > > > > >
> > > > > > +\paragraph{Device Write Record Capabilities Query Command}
> > > > > > +\label{par:Basic Facilities of a Virtio Device / Device
> > > > > > +groups / Group administration commands / Device Migration /
> > > > > > +Device Write Record Capabilities Query Command}
> > > > > > +
> > > > > > +This command reads the device write record capabilities.
> > > > > > +For the command
> > > VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
> > > > > > +\field{opcode} is set to 0xd.
> > > > > > +The \field{group_member_id} refers to the member device to be
> > > accessed.
> > > > > > +
> > > > > > +\begin{lstlisting}
> > > > > > +struct virtio_admin_cmd_dev_write_record_cap_result {
> > > > > > +        le32 supported_iova_page_size_bitmap;
> > > > > > +        le32 supported_iova_ranges; }; \end{lstlisting}
> > > > > > +
> > > > > > +When the command completes successfully,
> > > > > > +\field{command_specific_result} is in the format
> > > > > > +\field{struct virtio_admin_cmd_dev_write_record_cap_result}
> > > > > > +returned by the device. The
> > > > > > +\field{supported_iova_page_size_bitmap}
> > > > > > +indicates the granularity at which the device can record IOVA ranges.
> > > > > > +the minimum granularity can be 4KB. Bit 0 corresponds to 4KB,
> > > > > > +bit
> > > > > > +1 corresponds to 8KB, bit 31 corresponds to 4TB. The device
> > > > > > +supports at least
> > > > > one page granularity.
> > > > > > +The device support one or more IOVA page granularity; for
> > > > > > +each IOVA page granularity, the device sets corresponding bit
> > > > > > +in the \field{supported_iova_page_size_bitmap}. The
> > > > > > +\field{supported_iova_ranges} indicates how many unique (non
> > > > > > +overlapping) IOVA ranges can be recorded by the device.
> > > > >
> > > > > what role does this granularity play? i see no mention of it down the
> road.
> > > > >
> > > > The page_size in struct virtio_admin_cmd_write_record_start_data
> > > > must
> > > match to the granularity supplied above.
> > > > I missed it. Will add in v2.
> > > > This is very useful comment.
> > >
> > > Not that it's very clear what does page_size do.
> > >
> > Page_size is the granularity on which to record the writes.
> > For example, when page_size = 2MB, any writes are aligned to 2MB page
> boundary.
> > If 8KB data is written, only single write record entry reported.
> >
> > If the page_size = 4K, two write record entries reported.
> > ( I assumed 4K aligned address to keep the example simple).
> 
> 
> But you also said internally device maintains a bitmap.
Hmm, no. may be my response was confusing.
I replied that bitmap of supported page size, not the bitmap of the written pages.

> So it will have to work hard to find a set bit in the map then?
> Do we want to maybe give device an option to just return a bitmap and have
> driver worry about it?
> 
This would be an entirely different interface.
For 64GB VM, at 2MB page size, 32K bits to process.
And at 4KB page size, 16M bits to process.
And this is regardless of amount of dirty page tracking.

Bigger VMs finds it even more difficult.

> To me this looks like an optimization for when when devices keep writing to the
> same page all the time?  Do you have data to show that's commonly the case?
> Instrumenting a driver would be one way to find out.
>
Often for the devices which do not do zero copy, or uses page caches of block device, the writes are occurring on repeated pages.

[..]
> > > Do we want to return it in this format then?
> > I think yes, because converting to the bit is easy.
> > Reporting bit requires bitmap being function of VM memory and not based
> on amount of written pages.
> 
> So, you are optimizing for when small # of bits are set.
Device can be able to keep track for large amount of memory writes too.

> Do you have data to show it's common?
> 
> I would maybe structure it like this:
> 
> - bitmap used at all times
Bitmap are fine for small VMs as bits are small.
Log is better when VM has large memory where log is maintained only for the written pages, instead of full VM size.

> - log is maintained as long as it's not full
> - when log fills up just bitmap is used.
> 
> and at this point, we can maybe start with just a bitmap and add the log
> optimization separate, and optional?

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-09 10:06     ` Parav Pandit
@ 2023-10-10  5:51       ` Jason Wang
  2023-10-10  7:19         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-10  5:51 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Mon, Oct 9, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, October 9, 2023 2:19 PM
> >
> > Adding LingShan.
> >
> Thanks for adding him.
>
> > Parav, if you want any specific people to comment, please do cc them.
> >
> Sure, will cc them in v2 as now I see there is interest in the review.
>
> > On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > One or more passthrough PCI VF devices are ubiquitous for virtual
> > > machines usage using generic kernel framework such as vfio [1].
> >
> > Mentioning a specific subsystem in a specific OS may mislead the user to think
> > it can only work in that setup. Let's not do that, virtio is not only used for Linux
> > and VFIO.
> >
> Not really. it is an example in the cover letter.
> It is not the only use case.
> A use case gives a crisp clarity of what UAPI it needs to fulfil.
> So I will keep it. It is anyway written as one use case.
>
> > >
> > > A passthrough PCI VF device is fully owned by the virtual machine
> > > device driver.
> >
> > Is this true? Even VFIO needs to mediate PCI stuff. Or how do you define
> > "passthrough" here?
> >
> Other than PCI config registers and due to some legacy, msix.
> The "device interface" side is not mediated.
> The definition of passthrough here is: To not mediate a device type specific and virtio specific interfaces for modern and future devices.

Ok, but what's the difference between "device type specific" and
"virtio specific interfaces". Maybe an example for this?

>
> > > This passthrough device controls its own device reset flow, basic
> > > functionality as PCI VF function level reset
> >
> > How about other PCI stuff? Or Why is FLR special?
> FLR is special for the readers to get the clarity that FLR is also done by the guest driver hence, the device migration commands do not interact/depend with FLR flow.

It's still not clear to me how this is done.

1) guest starts FLR
2) adminq freeze the VF
3) FLR is done

If the freezing doesn't wait for the FLR, does it mean we need to
migrate to a state like FLR is pending? If yes, do we need to migrate
the other sub states like this? If not, why?

>
> >
> > > and rest of the virtio device functionality such as control vq,
> >
> > What do you mean by "rest of"?
> >
> As given in the example cvq.
>
> > Which part is not controlled and why?
> Not controlled because as states, it is passthrough device.
>
> > > config space access, data path descriptors handling.
> > >
> > > Additionally, VM live migration using a precopy method is also widely used.
> >
> > Why is this mentioned here?
> >
> Huh. You should be positive for bringing clarity to the readers on understanding the use case.
> And you seem opposite, but ok.
>
> As stated, it for the reader to understand the use case and see how proposed commands addresses the use case.

The problem is that the hardware features should be designed for a
general purpose instead of a specific technology if it can. The only
missing part for post copy is the page fault.

>
> > >
> > > To support a VM live migration for such passthrough virtio devices,
> > > the owner PCI PF device administers the device migration flow.
> >
> > Well, if this is specific only to PCI SR-IOV, I'd move it to the PCI transport part.
> > But I guess not.
> We took the decision to not do so, for other group commands as well.
> After Michael's suggestion we moved it to group commands.
> So I will not debate this further.
>
> >
> > >
> > > This patch introduces the basic theory of operation which describes
> > > the flow and supporting administration commands.
> > >
> > > [1]
> > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/
> > > include/uapi/linux/vfio.h?h=v6.1.47
> > >
> > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > ---
> > >  admin-cmds-device-migration.tex | 94
> > +++++++++++++++++++++++++++++++++
> > >  admin.tex                       |  1 +
> > >  2 files changed, 95 insertions(+)
> > >  create mode 100644 admin-cmds-device-migration.tex
> > >
> > > diff --git a/admin-cmds-device-migration.tex
> > > b/admin-cmds-device-migration.tex new file mode 100644 index
> > > 0000000..f839af4
> > > --- /dev/null
> > > +++ b/admin-cmds-device-migration.tex
> > > @@ -0,0 +1,94 @@
> > > +\subsubsection{Device Migration}\label{sec:Basic Facilities of a
> > > +Virtio Device / Device groups / Group administration commands /
> > > +Device Migration}
> > > +
> > > +In some systems, there is a need to migrate a running virtual machine
> > > +from one to another system. A running virtual machine has one or more
> > > +passthrough virtio member devices attached to it. A passthrough
> > > +device is entirely operated by the guest virtual machine. For
> > > +example, with the SR-IOV group type, group member (VF) may undergo
> > > +virtio device initialization and reset flow
> >
> > What do you mean by "reset flow"? It looks not like a terminology defined in the
> > PCI spec. And Google gives me nothing about this.
> >
> "reset flow" = virtio specification section 2.4 Device Reset flow.

My git repo show it's still called "device reset" and I see you use
"FLR flow" which is also not very clear to me.

>
> > > and may also undergo PCI function level
> > > +reset(FLR) flow.
> >
> > Why is only FLR special here? I've asked FRS but you ignore the question.
> >
> FLR is special to bring clarity that guest owns the VF doing FLR, hence hypervisor cannot mediate any registers of the VF.

It's not about mediation at all, it's about how the device can
implement what you want here correctly.

See my above question.

>
> > > Such flows must comply to the PCI standard and also
> > > +virtio specification;
> >
> > This seems unnecessary and obvious as it applies to all other PCI and virtio
> > functionality.
> >
> Great. But your comment is contradicts.
>
> > What's more, for the things that need to be synchronized, I don't see any
> > descriptions in this patch. And if it doesn't need, why?
> With which operation should it be synchronized and why?
> Can you please be specific?

See my above question regarding FLR. And it may have others which I
haven't had time to audit.

>
> It is not written in this series, because we believe it must not be synchronized as it is fully controlled by the guest.
>
> >
> > > at the same time such flows must not obstruct
> > > +the device migration flow. In such a scenario, a group owner device
> > > +can provide the administration command interface to facilitate the
> > > +device migration related operations.
> > > +
> > > +When a virtual machine migrates from one hypervisor to another
> > > +hypervisor, these hypervisors are named as source and destination
> > hypervisor respectively.
> > > +In such a scenario, a source hypervisor administers the member device
> > > +to suspend the device and preserves the device context.
> > > +Subsequently, a destination hypervisor administers the member device
> > > +to setup a device context and resumes the member device. The source
> > > +hypervisor reads the member device context and the destination
> > > +hypervisor writes the member device context. The method to transfer
> > > +the member device context from the source to the destination hypervisor is
> > outside the scope of this specification.
> > > +
> > > +The member device can be in any of the three migration modes. The
> > > +owner driver sets the member device in one of the following modes during
> > device migration flow.
> > > +
> > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name &
> > > +Description \\ \hline \hline
> > > +0x0   & Active &
> > > +  It is the default mode after instantiation of the member device. \\
> >
> > I don't think we ever define "instantiation" anywhere.
> >
> Well a transport has implicit definition of the instantiation already.
> May be a text can be added, but don’t see a value in duplicating PCI spec here.

Ok, maybe something like "transport specific instantiation"

>
> > > +\hline
> > > +0x1   & Stop &
> > > + In this mode, the member device does not send any notifications,
> > > +and it does not access any driver memory.
> >
> > What's the meaning of "driver memory"?
> >
> May be guest memory? Or do you suggest a better naming for the memory allocated by the guest driver?

Virtqueue?

>
> > And stop seems to be a source of inflight buffers.
> >
> I didn’t follow it.
> If you mean without stop there are no inflight buffer, then I don’t agree.
> We don’t want to violate the spec by having descriptors with zero size returned.
> Stop is not the source of inflight descriptors.

I think not since you forbid access to the used ring here. So even if
the buffer were processed by the device it can't be added back to the
used ring thus became inflight ones.

>
> There are inflight descriptors with the device that are not yet returned to the driver, and device wont return them as zero size wrong completions.
>
> > > + The member device may receive driver notifications in this mode,
> >
> > What's the meaning of "receive"? For example if the device can still process
> > buffers, "stop" is not accurate.
> >
> Receive means, driver can send the notification as PCIe TLP that device may receive as incoming PCIe TLP.

Ok, so this is the transport level. But the device can keep processing
the queue?

>
> In "stop" mode, the device wont process descriptors.

If the device won't process descriptors, why still allow it to receive
notifications? Or does it really matter if the device can receive or
not here?

>
> > > + the member device context
> >
> > I don't think we define "device context" anywhere.
> >
> It is defined further in the description.

Like this?

"""
 +The member device has a device context which the owner driver can
 +either read or write. The member device context consist of any device
 +specific data which is needed by the device to resume its operation
 +when the device mode
"""

"Any" is probably too hard for vendors to implement. And in patch 3 I
only see virtio device context. Does this mean we don't need transport
(PCI) context at all? If yes, how can it work?

>
> > >and device configuration space may change. \\
> > > +\hline
> >
> > I still don't get why we need a "stop" state in the middle.
> >
> All pci devices which belong to a single guest VM are not stopped atomically.
> Hence, one device which is in freeze mode, may still receive driver notifications from other pci device,

Device may choose to ignore those notifications, no?

> or it may experience a read from the shared memory and get garbage data.

Could you give me an example for this?

> And things can break.
> Hence the stop mode, ensures that all the devices get enough chance to stop themselves, and later when freezed, to not change anything internally.
>
> > > +0x2   & Freeze &
> > > + In this mode, the member device does not accept any driver
> > > +notifications,
> >
> > This is too vague. Is the device allowed to be freezed in the middle of any virtio
> > or PCI operations?
> >
> > For example, in the middle of feature negotiation etc. It may cause
> > implementation specific sub-states which can't be migrated easily.
> >
> Yes. it is allowed in middle of feature negotiation, for sure.
> It is passthrough device, hence hypervisor layer do not get to see sub-state.
>
> Not sure why you comment, why it cannot be migrated easily.
> The device context already covers this sub-state.

1) driver writes driver_features
2) driver sets FEAUTRES_OK

3) device receive driver_features
4) device validating driver_features
5) device clears FEATURES_OK

6) driver read stats and realize FEATURES_OK is being cleared

Is it valid to be frozen of the above? If yes, assuming we are
freezing between 2 and 3, what would a device context read gives us
for driver_features?

>
> > And what's more, the above state machine seems to be virtio specific, but you
> > don't explain the interaction with the device status state machine.
> First, above is not a state machine.

So how do readers know if a state can go to another state and when?

> Second, it is not virtio specific.

It's somehow for sure, for example you said device context need to be
preserved. And as far as I see the device context is all virtio
specific in patch 3.

> It is present in leading OS that has fundamental requirement to support P2P devices.

If it's PCI specific, instead of trying to do a workaround in virtio,
why not invent a mechanism there?

> Third, it is not, interacing with the _actua_ device status.
>
> In "SUSPEND" patch-5, you already asked this question. I assume you asked again so that this series is complete.
>
> > For example,
> > what happens if the driver wants to reset but the device is in stop mode? You
> > told me it is addressed in your series but looks not. Once you try to describe
> > that, you're actually try to connect states between the two state machines.
> >
> As listed in the definition of the stop mode, the device do not act on the incoming writes, it only keep tracks of its internal device context change as part of this.

So only the driver notification is allowed by not config write? What's
the consideration for allowing driver notification?

Let me ask differently, similar to FLR, what happens if the driver
wants a virtio reset but the hypervisor wants to stop or freeze?

> We would enrich the device context for this, but no need to connects the admin mode controlled by the owner device with operational state (device_status) owned by the member device.
>
> > > + it ignores any device configuration space writes,
> >
> > How about read and the device configuration changes?
> >
> As listed, device do not have any changes.
> So device configuration change cannot occur.

It's not necessarily caused by config write, it could be things like
link status or geometry changes that are initiated from the device.

>
> The device requirements cover this content more explicitly:
>
> For the SR-IOV group type, regardless of the member device mode, all the PCI transport level registers
> MUST be always accessible and the member device MUST function the same way for all the PCI transport
> level registers regardless of the member device mode.
>
> > > + the device do not have any changes in the device context. The member
> > > + device is not accessed in the system through the virtio interface.
> > > + \\
> >
> > But accessible via PCI interface?
> >
> Yes, as usual.
>
> > For example, what happens if we want to freeze during FLR? Does the
> > hypervisor need to wait for the FLR to be completed?
> >
> Hypervisor do not need wait for the FLR to be completed.

So does FLR change device context?

>
> > > +\hline
> > > +\hline
> > > +0x03-0xFF   & -    & reserved for future use \\
> > > +\hline
> > > +\end{tabularx}
> > > +
> > > +When the owner driver wants to stop the operation of the device, the
> > > +owner driver sets the device mode to \field{Stop}. Once the device is
> > > +in the \field{Stop} mode, the device does not initiate any
> > > +notifications or does not access any driver memory. Since the member
> > > +driver may be still active which may send further driver
> > > +notifications to the device, the device context may be updated. When
> > > +the member driver has stopped accessing the device, the owner driver
> > > +sets the device to \field{Freeze} mode indicating to the device that
> > > +no more driver access occurs. In the \field{Freeze} mode, no more
> > > +changes occur in the device context. At this point, the device ensures that
> > there will not be any update to the device context.
> >
> > What is missed here are:
> >
> > 1) it is a virtio specific states or not
> It is not.
>
> > 2) if it is a virtio specific state, if or how to synchronize with transport specific
> > interfaces and why
> > 3) can active go directly to freeze and why
> >
> Yes. don’t see a reason to not allow it.
> Active to freeze mode can change is useful on the destination side, where destination hypervisor knows for sure that there is no other entity accessing the device.
> And it needs to setup the device context, it received from the source side.
> So setting freeze mode can be done directly.
>
> > > +
> > > +The member device has a device context which the owner driver can
> > > +either read or write. The member device context consist of any device
> > > +specific data which is needed by the device to resume its operation
> > > +when the device mode
> >
> > This is too vague. There're states that are not suitable for cmd/queue for sure.
> > I'd split it into
> >
> > 1) common states: virtqueue, dirty pages
> > 2) device specific states: defined be each device
> >
> This is theory of operation section. So it capturing such details.
> Actual device context definition is outside of theory, and precise states of virtqueue, device specific, etc are in it.

See my comment above regarding to the device context.

>
> > > +is changed from \field{Stop} to \field{Active} or from \field{Freeze}
> > > +to \field{Active}.
> > > +
> > > +Once the device context is read, it is cleared from the device.
> >
> > This is horrible, it means we can't easily
> >
> > 1) re-try the migration
> > 2) recover from migration failure
> >
> Can you please explain the flow?

When migration fails, management can choose to resume the device(VM)
on the source.

If the state were cleared, it means there's not simple way to resume
the device but restoring the whole context.

What's the consideration for such clearing?

> And which software stack may find this useful?
> Is there any existing software that can utilize it?

Libvirt.

> Why that device context present with the software vanished, in your assumption, if it is?
>
> > > Typically, on
> > > +the source hypervisor, the owner driver reads the device context once
> > > +when the device is in \field{Active} or \field{Stop} mode and later
> > > +once the member device is in \field{Freeze} mode.
> >
> > Why need the read while device context could be changed? Or is the dirty page
> > part of the device context?
> >
> It is not part of the dirty page.
> It needs to read in the active/stop mode, so that it can be shared with destination hypervisor, which will pre-setup the complex context of the device, while it is still running on the source side.

Is such a method used by any hypervisor? If not, let's don't describe
such in-mature optimization in the spec.

>
> > > +
> > > +Typically, the device context is read and written one time on the
> > > +source and the destination hypervisor respectively once the device is
> > > +in \field{Freeze} mode. On the destination hypervisor, after writing
> > > +the device context, when the device mode set to \field{Active}, the
> > > +device uses the most recently set device context and resumes the device
> > operation.
> >
> > There's no context sequence, so this is obvious. It's the semantic of all other
> > existing interfaces.
> >
> Can you please what which existing interfaces do you mean here?

For any common cfg member. E.g queue_addr.

The driver wrote 100 different values to queue_addr and the device
used the value written last time.

>
> > > +
> > > +In an alternative flow, on the source hypervisor the owner driver may
> > > +choose to read the device context first time while the device is in
> > > +\field{Active} mode and second time once the device is in \field{Freeze}
> > mode.
> >
> > Who is going to synchronize the device context with possible configuration from
> > the driver?
> >
> Not sure I understand the question.
> If I understand you right, do you mean that,
> When configuration change is done by the guest driver, how does device context change?
>

Yes.

> If so, device context reading will reflect the new configuration.

How do you do that? For example:

static inline void vp_iowrite64_twopart(u64 val,
                                        __le32 __iomem *lo,
                                        __le32 __iomem *hi)
{
        vp_iowrite32((u32)val, lo);
        vp_iowrite32(val >> 32, hi);
}

Is it ok to be freezed in the middle of two vp_iowrite()?

>
> > > Similarly, on the
> > > +destination hypervisor writes the device context first time while the
> > > +device is still running in \field{Active} mode on the source
> > > +hypervisor and writes the device context second time while the device is in
> > \field{Freeze} mode.
> > > +This flow may result in very short setup time as the device context
> > > +likely have minimal changes from the previously written device context.
> >
> > Is the hypervisor who is in charge of doing the comparison and writing only the
> > delta?
> >
> The spec commands allow to do so. So possibility exists from spec wise.

There are various optimizations for migration for sure, I don't think
mentioning any specific one is good.

Thanks


> In current proposal, there isn’t a need for hypervisor to do so at all.
>
> The destination side device gets to see the new device context and apply the delta.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10  5:51       ` Jason Wang
@ 2023-10-10  7:19         ` Parav Pandit
  2023-10-10 12:41           ` Michael S. Tsirkin
  2023-10-11  3:14           ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-10  7:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 10, 2023 11:21 AM
> 
> On Mon, Oct 9, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, October 9, 2023 2:19 PM
> > >
> > > Adding LingShan.
> > >
> > Thanks for adding him.
> >
> > > Parav, if you want any specific people to comment, please do cc them.
> > >
> > Sure, will cc them in v2 as now I see there is interest in the review.
> >
> > > On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > > One or more passthrough PCI VF devices are ubiquitous for virtual
> > > > machines usage using generic kernel framework such as vfio [1].
> > >
> > > Mentioning a specific subsystem in a specific OS may mislead the
> > > user to think it can only work in that setup. Let's not do that,
> > > virtio is not only used for Linux and VFIO.
> > >
> > Not really. it is an example in the cover letter.
> > It is not the only use case.
> > A use case gives a crisp clarity of what UAPI it needs to fulfil.
> > So I will keep it. It is anyway written as one use case.
> >
> > > >
> > > > A passthrough PCI VF device is fully owned by the virtual machine
> > > > device driver.
> > >
> > > Is this true? Even VFIO needs to mediate PCI stuff. Or how do you
> > > define "passthrough" here?
> > >
> > Other than PCI config registers and due to some legacy, msix.
> > The "device interface" side is not mediated.
> > The definition of passthrough here is: To not mediate a device type specific
> and virtio specific interfaces for modern and future devices.
> 
> Ok, but what's the difference between "device type specific" and "virtio specific
> interfaces". Maybe an example for this?
> 
Virtio device specific means: cvq of crypto device, cvq of net device, flow filter vqs of net device etc.
Virtio specific interface: virtio driver notifications, virtio virtqueue and configuration mediation etc.

> >
> > > > This passthrough device controls its own device reset flow, basic
> > > > functionality as PCI VF function level reset
> > >
> > > How about other PCI stuff? Or Why is FLR special?
> > FLR is special for the readers to get the clarity that FLR is also done by the
> guest driver hence, the device migration commands do not interact/depend
> with FLR flow.
> 
> It's still not clear to me how this is done.
> 
> 1) guest starts FLR
> 2) adminq freeze the VF
> 3) FLR is done
> 
> If the freezing doesn't wait for the FLR, does it mean we need to migrate to a
> state like FLR is pending? If yes, do we need to migrate the other sub states like
> this? If not, why?
> 
In most practical cases #2 followed by #1 should not happen as on the source side the expected is mode change to stop from active.
But ok, since we active to freeze mode change is allowed, lets discuss above.

A device is the single synchronization point for any device reset, FLR or admin command operation.
So, the migration driver do not need to wait for FLR to complete.
When admin cmd freeze the VF it can expect FLR_completed VF.
Secondly since the FLR is local to the source, intermediate sub state does not migrate.

But I agree, it is worth to have the text capturing this.

> >
> > >
> > > > and rest of the virtio device functionality such as control vq,
> > >
> > > What do you mean by "rest of"?
> > >
> > As given in the example cvq.
> >
> > > Which part is not controlled and why?
> > Not controlled because as states, it is passthrough device.
> >
> > > > config space access, data path descriptors handling.
> > > >
> > > > Additionally, VM live migration using a precopy method is also widely
> used.
> > >
> > > Why is this mentioned here?
> > >
> > Huh. You should be positive for bringing clarity to the readers on
> understanding the use case.
> > And you seem opposite, but ok.
> >
> > As stated, it for the reader to understand the use case and see how proposed
> commands addresses the use case.
> 
> The problem is that the hardware features should be designed for a general
> purpose instead of a specific technology if it can. The only missing part for post
> copy is the page fault.
> 
Ok. The use case and requirement of member device passthrough is clear to most reviewers now.
So I will remove it from commit log.

> >
> > > >
> > > > To support a VM live migration for such passthrough virtio
> > > > devices, the owner PCI PF device administers the device migration flow.
> > >
> > > Well, if this is specific only to PCI SR-IOV, I'd move it to the PCI transport
> part.
> > > But I guess not.
> > We took the decision to not do so, for other group commands as well.
> > After Michael's suggestion we moved it to group commands.
> > So I will not debate this further.
> >
> > >
> > > >
> > > > This patch introduces the basic theory of operation which
> > > > describes the flow and supporting administration commands.
> > > >
> > > > [1]
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/t
> > > > ree/
> > > > include/uapi/linux/vfio.h?h=v6.1.47
> > > >
> > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > ---
> > > >  admin-cmds-device-migration.tex | 94
> > > +++++++++++++++++++++++++++++++++
> > > >  admin.tex                       |  1 +
> > > >  2 files changed, 95 insertions(+)  create mode 100644
> > > > admin-cmds-device-migration.tex
> > > >
> > > > diff --git a/admin-cmds-device-migration.tex
> > > > b/admin-cmds-device-migration.tex new file mode 100644 index
> > > > 0000000..f839af4
> > > > --- /dev/null
> > > > +++ b/admin-cmds-device-migration.tex
> > > > @@ -0,0 +1,94 @@
> > > > +\subsubsection{Device Migration}\label{sec:Basic Facilities of a
> > > > +Virtio Device / Device groups / Group administration commands /
> > > > +Device Migration}
> > > > +
> > > > +In some systems, there is a need to migrate a running virtual
> > > > +machine from one to another system. A running virtual machine has
> > > > +one or more passthrough virtio member devices attached to it. A
> > > > +passthrough device is entirely operated by the guest virtual
> > > > +machine. For example, with the SR-IOV group type, group member
> > > > +(VF) may undergo virtio device initialization and reset flow
> > >
> > > What do you mean by "reset flow"? It looks not like a terminology
> > > defined in the PCI spec. And Google gives me nothing about this.
> > >
> > "reset flow" = virtio specification section 2.4 Device Reset flow.
> 
> My git repo show it's still called "device reset" and I see you use "FLR flow"
> which is also not very clear to me.
> 
Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
This section is not normative section, so using an extra word like "flow" does not confuse anyone.
I will link to the section anyway.

> >
> > > > and may also undergo PCI function level
> > > > +reset(FLR) flow.
> > >
> > > Why is only FLR special here? I've asked FRS but you ignore the question.
> > >
> > FLR is special to bring clarity that guest owns the VF doing FLR, hence
> hypervisor cannot mediate any registers of the VF.
> 
> It's not about mediation at all, it's about how the device can implement what
> you want here correctly.
> 
> See my above question.
> 
Ok. it is clear that live migration commands cannot stay on the member device because the member device can undergo device reset and FLR flows owned by the guest.
(and hypervisor is not involved in these two flows, hence the admin command interface is designed such that it can fullfil above requirements).

Theory of operation brings out this clarity. Please notice that it is in introductory section with an example.
Not normative line.

> >
> > > > Such flows must comply to the PCI standard and also
> > > > +virtio specification;
> > >
> > > This seems unnecessary and obvious as it applies to all other PCI
> > > and virtio functionality.
> > >
> > Great. But your comment is contradicts.
> >
> > > What's more, for the things that need to be synchronized, I don't
> > > see any descriptions in this patch. And if it doesn't need, why?
> > With which operation should it be synchronized and why?
> > Can you please be specific?
> 
> See my above question regarding FLR. And it may have others which I haven't
> had time to audit.
> 
Ok. when you get chance to audit, lets discuss that time.

> >
> > It is not written in this series, because we believe it must not be synchronized
> as it is fully controlled by the guest.
> >
> > >
> > > > at the same time such flows must not obstruct
> > > > +the device migration flow. In such a scenario, a group owner
> > > > +device can provide the administration command interface to
> > > > +facilitate the device migration related operations.
> > > > +
> > > > +When a virtual machine migrates from one hypervisor to another
> > > > +hypervisor, these hypervisors are named as source and destination
> > > hypervisor respectively.
> > > > +In such a scenario, a source hypervisor administers the member
> > > > +device to suspend the device and preserves the device context.
> > > > +Subsequently, a destination hypervisor administers the member
> > > > +device to setup a device context and resumes the member device.
> > > > +The source hypervisor reads the member device context and the
> > > > +destination hypervisor writes the member device context. The
> > > > +method to transfer the member device context from the source to
> > > > +the destination hypervisor is
> > > outside the scope of this specification.
> > > > +
> > > > +The member device can be in any of the three migration modes. The
> > > > +owner driver sets the member device in one of the following modes
> > > > +during
> > > device migration flow.
> > > > +
> > > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name &
> > > > +Description \\ \hline \hline
> > > > +0x0   & Active &
> > > > +  It is the default mode after instantiation of the member
> > > > +device. \\
> > >
> > > I don't think we ever define "instantiation" anywhere.
> > >
> > Well a transport has implicit definition of the instantiation already.
> > May be a text can be added, but don’t see a value in duplicating PCI spec
> here.
> 
> Ok, maybe something like "transport specific instantiation"
> 
Ok. that’s a good text. I will change to it.

> >
> > > > +\hline
> > > > +0x1   & Stop &
> > > > + In this mode, the member device does not send any notifications,
> > > > +and it does not access any driver memory.
> > >
> > > What's the meaning of "driver memory"?
> > >
> > May be guest memory? Or do you suggest a better naming for the memory
> allocated by the guest driver?
> 
> Virtqueue?
> 
Virtqueue and any memory referred by the virtqueue.

This is good text, I will change to it.

> >
> > > And stop seems to be a source of inflight buffers.
> > >
> > I didn’t follow it.
> > If you mean without stop there are no inflight buffer, then I don’t agree.
> > We don’t want to violate the spec by having descriptors with zero size
> returned.
> > Stop is not the source of inflight descriptors.
> 
> I think not since you forbid access to the used ring here. So even if the buffer
> were processed by the device it can't be added back to the used ring thus
> became inflight ones.
> 
> >
> > There are inflight descriptors with the device that are not yet returned to the
> driver, and device wont return them as zero size wrong completions.
> >
> > > > + The member device may receive driver notifications in this mode,
> > >
> > > What's the meaning of "receive"? For example if the device can still
> > > process buffers, "stop" is not accurate.
> > >
> > Receive means, driver can send the notification as PCIe TLP that device may
> receive as incoming PCIe TLP.
> 
> Ok, so this is the transport level. But the device can keep processing the queue?
> 
Device cannot process the queue because it does not initiate any read/write towards the virtqueue.

> >
> > In "stop" mode, the device wont process descriptors.
> 
> If the device won't process descriptors, why still allow it to receive notifications?
Because notification may still arrive and if the device may update any counters as part of it which needs to be migrated or store the received notification.

> Or does it really matter if the device can receive or not here?
> 
From device point of view, the device is given the chance to update its device context as part of notifications or access to it.

> >
> > > > + the member device context
> > >
> > > I don't think we define "device context" anywhere.
> > >
> > It is defined further in the description.
> 
> Like this?
> 
> """
>  +The member device has a device context which the owner driver can  +either
> read or write. The member device context consist of any device  +specific data
> which is needed by the device to resume its operation  +when the device mode
> """
> 
Yes.
Further patch-3 adds the device context and also add the link to it in the theory of operation section so reader can read more detail about it.

> "Any" is probably too hard for vendors to implement. And in patch 3 I only see
> virtio device context. Does this mean we don't need transport
> (PCI) context at all? If yes, how can it work?
> 
Right. PCI member device is present at source and destination with its layout, only the virtio device context is transferred.
Which part cannot work?

> >
> > > >and device configuration space may change. \\
> > > > +\hline
> > >
> > > I still don't get why we need a "stop" state in the middle.
> > >
> > All pci devices which belong to a single guest VM are not stopped atomically.
> > Hence, one device which is in freeze mode, may still receive driver
> > notifications from other pci device,
> 
> Device may choose to ignore those notifications, no?
> 
> > or it may experience a read from the shared memory and get garbage data.
> 
> Could you give me an example for this?
> 
Section 2.10 Shared Memory Regions.

> > And things can break.
> > Hence the stop mode, ensures that all the devices get enough chance to stop
> themselves, and later when freezed, to not change anything internally.
> >
> > > > +0x2   & Freeze &
> > > > + In this mode, the member device does not accept any driver
> > > > +notifications,
> > >
> > > This is too vague. Is the device allowed to be freezed in the middle
> > > of any virtio or PCI operations?
> > >
> > > For example, in the middle of feature negotiation etc. It may cause
> > > implementation specific sub-states which can't be migrated easily.
> > >
> > Yes. it is allowed in middle of feature negotiation, for sure.
> > It is passthrough device, hence hypervisor layer do not get to see sub-state.
> >
> > Not sure why you comment, why it cannot be migrated easily.
> > The device context already covers this sub-state.
> 
> 1) driver writes driver_features
> 2) driver sets FEAUTRES_OK
> 
> 3) device receive driver_features
> 4) device validating driver_features
> 5) device clears FEATURES_OK
> 
> 6) driver read stats and realize FEATURES_OK is being cleared
> 
> Is it valid to be frozen of the above? 
No. device mode is frozen when hypervisor is sure that no more access by the guest will be done.
What can happen between #2 and #3, is device mode may change to stop.
And in stop mode, device context would capture #5 or #4, depending where is device at that point.

> >
> > > And what's more, the above state machine seems to be virtio
> > > specific, but you don't explain the interaction with the device status state
> machine.
> > First, above is not a state machine.
> 
> So how do readers know if a state can go to another state and when?
>
Not sure what you mean by reader. Can you please explain.

> > Second, it is not virtio specific.
> 
> It's somehow for sure, for example you said device context need to be
> preserved. And as far as I see the device context is all virtio specific in patch 3.
> 
Sure, device context is virtio specific. :)
Device context will reflect if things changed in the stop mode.

> > It is present in leading OS that has fundamental requirement to support P2P
> devices.
> 
> If it's PCI specific, instead of trying to do a workaround in virtio, why not invent
> a mechanism there?
> 
It is not a workaround in virtio.
It is the way pci p2p devices work for which one needs to be receptive to handle the interaction.


> > Third, it is not, interacing with the _actua_ device status.
> >
> > In "SUSPEND" patch-5, you already asked this question. I assume you asked
> again so that this series is complete.
> >
> > > For example,
> > > what happens if the driver wants to reset but the device is in stop
> > > mode? You told me it is addressed in your series but looks not. Once
> > > you try to describe that, you're actually try to connect states between the
> two state machines.
> > >
> > As listed in the definition of the stop mode, the device do not act on the
> incoming writes, it only keep tracks of its internal device context change as part
> of this.
> 
> So only the driver notification is allowed by not config write? What's the
> consideration for allowing driver notification?
> 
Because for most practical purposes, peer device wants to queue blk, net other requests and not do device configuration.

Do you know any device configuration space which is RW?
For net and blk I recall it as RO?

> Let me ask differently, similar to FLR, what happens if the driver wants a virtio
> reset but the hypervisor wants to stop or freeze?
> 
The device would respond to stop/freeze request when it has internally started the reset, as device is the single synchronization point which knows how to handle both in parallel.

> > We would enrich the device context for this, but no need to connects the
> admin mode controlled by the owner device with operational state
> (device_status) owned by the member device.
> >
> > > > + it ignores any device configuration space writes,
> > >
> > > How about read and the device configuration changes?
> > >
> > As listed, device do not have any changes.
> > So device configuration change cannot occur.
> 
> It's not necessarily caused by config write, it could be things like link status or
> geometry changes that are initiated from the device.
> 
I understand it. Link status was one example, you listed other examples too.
The point is, when in freeze mode, the member device is frozen, hence, device won't initiate those changes.

> >
> > The device requirements cover this content more explicitly:
> >
> > For the SR-IOV group type, regardless of the member device mode, all
> > the PCI transport level registers MUST be always accessible and the
> > member device MUST function the same way for all the PCI transport level
> registers regardless of the member device mode.
> >
> > > > + the device do not have any changes in the device context. The
> > > > + member device is not accessed in the system through the virtio
> interface.
> > > > + \\
> > >
> > > But accessible via PCI interface?
> > >
> > Yes, as usual.
> >
> > > For example, what happens if we want to freeze during FLR? Does the
> > > hypervisor need to wait for the FLR to be completed?
> > >
> > Hypervisor do not need wait for the FLR to be completed.
> 
> So does FLR change device context?
Yes.

> 
> >
> > > > +\hline
> > > > +\hline
> > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > +\hline
> > > > +\end{tabularx}
> > > > +
> > > > +When the owner driver wants to stop the operation of the device,
> > > > +the owner driver sets the device mode to \field{Stop}. Once the
> > > > +device is in the \field{Stop} mode, the device does not initiate
> > > > +any notifications or does not access any driver memory. Since the
> > > > +member driver may be still active which may send further driver
> > > > +notifications to the device, the device context may be updated.
> > > > +When the member driver has stopped accessing the device, the
> > > > +owner driver sets the device to \field{Freeze} mode indicating to
> > > > +the device that no more driver access occurs. In the
> > > > +\field{Freeze} mode, no more changes occur in the device context.
> > > > +At this point, the device ensures that
> > > there will not be any update to the device context.
> > >
> > > What is missed here are:
> > >
> > > 1) it is a virtio specific states or not
> > It is not.
> >
> > > 2) if it is a virtio specific state, if or how to synchronize with
> > > transport specific interfaces and why
> > > 3) can active go directly to freeze and why
> > >
> > Yes. don’t see a reason to not allow it.
> > Active to freeze mode can change is useful on the destination side, where
> destination hypervisor knows for sure that there is no other entity accessing the
> device.
> > And it needs to setup the device context, it received from the source side.
> > So setting freeze mode can be done directly.
> >
> > > > +
> > > > +The member device has a device context which the owner driver can
> > > > +either read or write. The member device context consist of any
> > > > +device specific data which is needed by the device to resume its
> > > > +operation when the device mode
> > >
> > > This is too vague. There're states that are not suitable for cmd/queue for
> sure.
> > > I'd split it into
> > >
> > > 1) common states: virtqueue, dirty pages
> > > 2) device specific states: defined be each device
> > >
> > This is theory of operation section. So it capturing such details.
> > Actual device context definition is outside of theory, and precise states of
> virtqueue, device specific, etc are in it.
> 
> See my comment above regarding to the device context.
> 
I replied above, device context link is added in the patch-3 in the theory of operation.
So reader gets the complete view.

> >
> > > > +is changed from \field{Stop} to \field{Active} or from
> > > > +\field{Freeze} to \field{Active}.
> > > > +
> > > > +Once the device context is read, it is cleared from the device.
> > >
> > > This is horrible, it means we can't easily
> > >
> > > 1) re-try the migration
> > > 2) recover from migration failure
> > >
> > Can you please explain the flow?
> 
> When migration fails, management can choose to resume the device(VM) on
> the source.
> 
ok. This should be possible as the management which has the device context, it can restore it on the source
and move the device mode to active.

> If the state were cleared, it means there's not simple way to resume the device
> but restoring the whole context.
> 
Yes, as you say, by restoring the whole context will suffice this corner/rare case scenario.

> What's the consideration for such clearing?
> 
There are two considerations.
1.  If one does not clear, till how long should it be kept on the device?
2. device context returns incremental value from the previous read. So, it needs to clear it.

> > And which software stack may find this useful?
> > Is there any existing software that can utilize it?
> 
> Libvirt.
> 
Does libvirt restore on migration failure?

> > Why that device context present with the software vanished, in your
> assumption, if it is?
> >
> > > > Typically, on
> > > > +the source hypervisor, the owner driver reads the device context
> > > > +once when the device is in \field{Active} or \field{Stop} mode
> > > > +and later once the member device is in \field{Freeze} mode.
> > >
> > > Why need the read while device context could be changed? Or is the
> > > dirty page part of the device context?
> > >
> > It is not part of the dirty page.
> > It needs to read in the active/stop mode, so that it can be shared with
> destination hypervisor, which will pre-setup the complex context of the device,
> while it is still running on the source side.
> 
> Is such a method used by any hypervisor? 
Yes. qemu which uses vfio interface uses it.

> 
> >
> > > > +
> > > > +Typically, the device context is read and written one time on the
> > > > +source and the destination hypervisor respectively once the
> > > > +device is in \field{Freeze} mode. On the destination hypervisor,
> > > > +after writing the device context, when the device mode set to
> > > > +\field{Active}, the device uses the most recently set device
> > > > +context and resumes the device
> > > operation.
> > >
> > > There's no context sequence, so this is obvious. It's the semantic
> > > of all other existing interfaces.
> > >
> > Can you please what which existing interfaces do you mean here?
> 
> For any common cfg member. E.g queue_addr.
> 
> The driver wrote 100 different values to queue_addr and the device used the
> value written last time.
> 
o.k. I don’t see any problem in stating what is done, which is less vague. 😊

> >
> > > > +
> > > > +In an alternative flow, on the source hypervisor the owner driver
> > > > +may choose to read the device context first time while the device
> > > > +is in \field{Active} mode and second time once the device is in
> > > > +\field{Freeze}
> > > mode.
> > >
> > > Who is going to synchronize the device context with possible
> > > configuration from the driver?
> > >
> > Not sure I understand the question.
> > If I understand you right, do you mean that, When configuration change
> > is done by the guest driver, how does device context change?
> >
> 
> Yes.
> 
> > If so, device context reading will reflect the new configuration.
> 
> How do you do that? For example:
> 
> static inline void vp_iowrite64_twopart(u64 val,
>                                         __le32 __iomem *lo,
>                                         __le32 __iomem *hi) {
>         vp_iowrite32((u32)val, lo);
>         vp_iowrite32(val >> 32, hi);
> }
> 
> Is it ok to be freezed in the middle of two vp_iowrite()?
> 
Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG section captures the partial value.

> >
> > > > Similarly, on the
> > > > +destination hypervisor writes the device context first time while
> > > > +the device is still running in \field{Active} mode on the source
> > > > +hypervisor and writes the device context second time while the
> > > > +device is in
> > > \field{Freeze} mode.
> > > > +This flow may result in very short setup time as the device
> > > > +context likely have minimal changes from the previously written device
> context.
> > >
> > > Is the hypervisor who is in charge of doing the comparison and
> > > writing only the delta?
> > >
> > The spec commands allow to do so. So possibility exists from spec wise.
> 
> There are various optimizations for migration for sure, I don't think mentioning
> any specific one is good.
> 
The text is informative text similar to,

" However, some devices benefit from the ability to find out the amount of available data in the queue without
accessing the virtqueue in memory"

" To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has been negotiated".

Is this the only optimization in virtio? No, but we still mention the rationale of why it exists.
As long as the rationale do not confuse the reader, and adds the value explaining how things work, it is fine to add.
Which is what above few lines did.
So let's keep it.

The easiest is to cut out the whole theory of operation and just write commands like how RSS command did, without even writing a single line about RSS.
I think we can do better explanation than that for new things we add.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-09 14:30       ` Parav Pandit
@ 2023-10-10  8:52         ` Zhu, Lingshan
  2023-10-10  9:58           ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-10  8:52 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/9/2023 10:30 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, October 9, 2023 4:04 PM
>>
>> On 10/8/2023 7:41 PM, Michael S. Tsirkin wrote:
>>> On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
>>>> Define the device context and its fields for purpose of device
>>>> migration. The device context is read and written by the owner driver
>>>> on source and destination hypervisor respectively.
>>>>
>>>> Device context fields will experience a rapid growth post this
>>>> initial version to cover many details of the device.
>>>>
>>>> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
>>>> Signed-off-by: Parav Pandit <parav@nvidia.com>
>>>> Signed-off-by: Satananda Burla <sburla@marvell.com>
>>>> ---
>>>> changelog:
>>>> v0->v1:
>>>> - enrich device context to cover feature bits, device configuration
>>>>     fields
>>>> - corrected alignment of device context fields
>>>> ---
>>>>    content.tex        |   1 +
>>>>    device-context.tex | 142
>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 143 insertions(+)
>>>>    create mode 100644 device-context.tex
>>>>
>>>> diff --git a/content.tex b/content.tex index 0a62dce..2698931 100644
>>>> --- a/content.tex
>>>> +++ b/content.tex
>>>> @@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic Facilities
>> of a Virtio Device / Expo
>>>>    UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>>>
>>>>    \input{admin.tex}
>>>> +\input{device-context.tex}
>>>>
>>>>    \chapter{General Initialization And Device
>>>> Operation}\label{sec:General Initialization And Device Operation}
>>>>
>>>> diff --git a/device-context.tex b/device-context.tex new file mode
>>>> 100644 index 0000000..5611382
>>>> --- /dev/null
>>>> +++ b/device-context.tex
>>>> @@ -0,0 +1,142 @@
>>>> +\section{Device Context}\label{sec:Basic Facilities of a Virtio
>>>> +Device / Device Context}
>>>> +
>>>> +The device context holds the information that a owner driver can use
>>>> +to setup a member device and resume its operation. The device
>>>> +context of a member device is read or written by the owner driver
>>>> +using administration commands.
>>>> +
>>>> +\begin{lstlisting}
>>>> +struct virtio_dev_ctx_field_tlv {
>>>> +        le32 type;
>>>> +        le32 reserved;
>>>> +        le64 length;
>>>> +        u8 value[];
>>>> +};
>>>> +
>>>> +struct virtio_dev_ctx {
>>>> +        le32 field_count;
>>>> +        struct virtio_dev_ctx_field_tlv fields[]; };
>>>> +
>>>> +\end{lstlisting}
>> so this still doesn't work for nested
> In one use case of nesting, that we came across is:
> there is large host_VM which is hosting another guest_VMs.
> In such case, the owner PF is passthrough to this host_VM and current proposed scheme continue to function for nesting as well for nested guest_VMs.
The system admin can choose only passthrough some of the devices for 
nested guests, so passthrough the PF to L1 guest is not a good idea, 
because there can be
many devices still work for the host or L1.
>
> In second use case, where one want to bind only one member device to one VM,
> I think same plumbing can be extended to have another VF, to take the role of migration device instead of owner device.
>
> I don’t see a good way to passthrough and also do in-band migration without lot of device specific trap and emulation.
> I also don’t know the cpu performance numbers with 3 levels of nested page table translation which to my understanding cannot be accelerated by the current cpu.
host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and so on, 
there can be performance overhead, but can be done.

So admin vq migration still don't work for nested, this is surely a blocker.
> Do you know how does it work for Intel x86_64?
> Can it do > 2 level of nested page tables? If no, what is the perf characteristics to expect?
of course that can be done, Page table is not a problem, there are soft 
mmu emulation and viommu, through performance overhead.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-09 17:21         ` Parav Pandit
@ 2023-10-10  8:57           ` Zhu, Lingshan
  2023-10-10  9:40             ` Parav Pandit
  2023-10-11 19:51             ` Michael S. Tsirkin
  0 siblings, 2 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-10  8:57 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/10/2023 1:21 AM, Parav Pandit wrote:
>
>> From: Michael S. Tsirkin <mst@redhat.com>
>> Sent: Monday, October 9, 2023 9:50 PM
>>>>> One or more passthrough PCI VF devices are ubiquitous for virtual
>>>>> machines usage using generic kernel framework such as vfio [1].
>>>> Mentioning a specific subsystem in a specific OS may mislead the
>>>> user to think it can only work in that setup. Let's not do that,
>>>> virtio is not only used for Linux and VFIO.
>>> This is just one example on how these commands are useful.
>>> It can be useful in more ways too in more OSes too.
>>> I will drop from the patch commit log and keep as information purpose in
>> cover letter.
>>> Would that work for you?
>>>
>>> I don’t have any strong opinion to keep it or remove it as most stakeholders
>> has the clear view of requirements now.
>>> Let me know.
>> So some people use VFs with VFIO. Hence the module name.  This sentence by
>> itself seems to have zero value for the spec. Just drop it.
> Ok. Will drop.
So why not build your admin vq live migration on our config space solution,
get out of the troubles, to make your life easier?

Actually you don't see any technical problems in our config space 
proposal, right?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10  8:57           ` Zhu, Lingshan
@ 2023-10-10  9:40             ` Parav Pandit
  2023-10-11 10:25               ` Zhu, Lingshan
  2023-10-11 19:51             ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-10  9:40 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

Hi Lingshan,

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Tuesday, October 10, 2023 2:28 PM
> 
> On 10/10/2023 1:21 AM, Parav Pandit wrote:
> >
> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> Sent: Monday, October 9, 2023 9:50 PM
> >>>>> One or more passthrough PCI VF devices are ubiquitous for virtual
> >>>>> machines usage using generic kernel framework such as vfio [1].
> >>>> Mentioning a specific subsystem in a specific OS may mislead the
> >>>> user to think it can only work in that setup. Let's not do that,
> >>>> virtio is not only used for Linux and VFIO.
> >>> This is just one example on how these commands are useful.
> >>> It can be useful in more ways too in more OSes too.
> >>> I will drop from the patch commit log and keep as information
> >>> purpose in
> >> cover letter.
> >>> Would that work for you?
> >>>
> >>> I don’t have any strong opinion to keep it or remove it as most
> >>> stakeholders
> >> has the clear view of requirements now.
> >>> Let me know.
> >> So some people use VFs with VFIO. Hence the module name.  This
> >> sentence by itself seems to have zero value for the spec. Just drop it.
> > Ok. Will drop.
> So why not build your admin vq live migration on our config space solution, get
> out of the troubles, to make your life easier?
> 
Your this question is completely unrelated to this reply or you misunderstood what dropping commit log means.

Dropping link to vfio does not drop the requirement.
I am ok to drop because requirements are clear of passthrough of member device.
Vfio is not a trouble at all.
Admin command is not a trouble either.

The pure technical reason is: all the functionalities proposed cannot be done in any other existing way.
Why? For below reasons.
1. device context, and write records (aka dirty page addresses) is huge which cannot be shared using config registers at scale of 4000 member devices
2. sharing such large context and write addresses in parallel for multiple devices cannot be done using single register file
3. These registers cannot be residing in the VF because VF can undergo FLR, and device reset which must clear these registers
4. When VF does the DMA, all dma occurs in the guest address space, not in hypervisor space; any flr and device reset must stop such dma.
And device reset and flr are controlled by the guest (not mediated by hypervisor).
5. Any PASID to separate out admin vq on the VF does not work for two reasons.
R_1: device flr and device reset must stop all the dmas.
R_2: PASID by most leading vendors is still not mature enough
R_3: One also needs to do inversion to not expose PASID capability of the member PCI device to not expose 

> Actually you don't see any technical problems in our config space proposal,
> right?
In config registers method, for passthrough I clearly see the technical problems (functional and scale) listed above.
Due to which config registers cannot reside on the VF and cannot scale either.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-10  8:52         ` Zhu, Lingshan
@ 2023-10-10  9:58           ` Parav Pandit
  2023-10-11 10:07             ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-10  9:58 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Tuesday, October 10, 2023 2:22 PM
> 
> On 10/9/2023 10:30 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, October 9, 2023 4:04 PM
> >>
> >> On 10/8/2023 7:41 PM, Michael S. Tsirkin wrote:
> >>> On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
> >>>> Define the device context and its fields for purpose of device
> >>>> migration. The device context is read and written by the owner
> >>>> driver on source and destination hypervisor respectively.
> >>>>
> >>>> Device context fields will experience a rapid growth post this
> >>>> initial version to cover many details of the device.
> >>>>
> >>>> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> >>>> Signed-off-by: Parav Pandit <parav@nvidia.com>
> >>>> Signed-off-by: Satananda Burla <sburla@marvell.com>
> >>>> ---
> >>>> changelog:
> >>>> v0->v1:
> >>>> - enrich device context to cover feature bits, device configuration
> >>>>     fields
> >>>> - corrected alignment of device context fields
> >>>> ---
> >>>>    content.tex        |   1 +
> >>>>    device-context.tex | 142
> >> +++++++++++++++++++++++++++++++++++++++++++++
> >>>>    2 files changed, 143 insertions(+)
> >>>>    create mode 100644 device-context.tex
> >>>>
> >>>> diff --git a/content.tex b/content.tex index 0a62dce..2698931
> >>>> 100644
> >>>> --- a/content.tex
> >>>> +++ b/content.tex
> >>>> @@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic
> >>>> Facilities
> >> of a Virtio Device / Expo
> >>>>    UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
> >>>>
> >>>>    \input{admin.tex}
> >>>> +\input{device-context.tex}
> >>>>
> >>>>    \chapter{General Initialization And Device
> >>>> Operation}\label{sec:General Initialization And Device Operation}
> >>>>
> >>>> diff --git a/device-context.tex b/device-context.tex new file mode
> >>>> 100644 index 0000000..5611382
> >>>> --- /dev/null
> >>>> +++ b/device-context.tex
> >>>> @@ -0,0 +1,142 @@
> >>>> +\section{Device Context}\label{sec:Basic Facilities of a Virtio
> >>>> +Device / Device Context}
> >>>> +
> >>>> +The device context holds the information that a owner driver can
> >>>> +use to setup a member device and resume its operation. The device
> >>>> +context of a member device is read or written by the owner driver
> >>>> +using administration commands.
> >>>> +
> >>>> +\begin{lstlisting}
> >>>> +struct virtio_dev_ctx_field_tlv {
> >>>> +        le32 type;
> >>>> +        le32 reserved;
> >>>> +        le64 length;
> >>>> +        u8 value[];
> >>>> +};
> >>>> +
> >>>> +struct virtio_dev_ctx {
> >>>> +        le32 field_count;
> >>>> +        struct virtio_dev_ctx_field_tlv fields[]; };
> >>>> +
> >>>> +\end{lstlisting}
> >> so this still doesn't work for nested
> > In one use case of nesting, that we came across is:
> > there is large host_VM which is hosting another guest_VMs.
> > In such case, the owner PF is passthrough to this host_VM and current
> proposed scheme continue to function for nesting as well for nested
> guest_VMs.
> The system admin can choose only passthrough some of the devices for nested
> guests, so passthrough the PF to L1 guest is not a good idea, because there can
> be many devices still work for the host or L1.
Possible. One size does not fit all.
What I expressed is most common scenarios that user care about.

> >
> > In second use case, where one want to bind only one member device to
> > one VM, I think same plumbing can be extended to have another VF, to take
> the role of migration device instead of owner device.
> >
> > I don’t see a good way to passthrough and also do in-band migration without
> lot of device specific trap and emulation.
> > I also don’t know the cpu performance numbers with 3 levels of nested page
> table translation which to my understanding cannot be accelerated by the
> current cpu.
> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and so
> on, there can be performance overhead, but can be done.
> 
> So admin vq migration still don't work for nested, this is surely a blocker.
In specific case of member devices are located at different nest level, it does not.

Why prevents you have a peer VF do the role of migration driver?
Basically, what I am proposing is, connect two VFs to the L1 guest. One VF is migration driver, one VF is passthrough to L2 guest.
And same scheme works.

On the other hand,
Many parts of the cpu subsystem such as PML, page tables do not have N level nesting support either.
They all work on top of emulation and pay the price for emulation when nesting is done.
May be that is the first version for virtio too.

I frankly feel that nesting support requires industry level eco system support not just in virtio.
Virtio attempting to focus on nested and having nearly same level performance as bare metal seems farfetched.
Maybe I am wrong, as we have not seen such high perf nested env even with sw based device.

What can be possibly done is, 
1. What admin commands are useful from this series that can be useful for nesting?
2. What admin commands from current series needs extension for nesting?
3. What admin commands do not work at all for nesting, and hence, need to have new commands.

If we can focus on those, maybe we can find common approach that cater to both commands.

> > Do you know how does it work for Intel x86_64?
> > Can it do > 2 level of nested page tables? If no, what is the perf characteristics
> to expect?
> of course that can be done, Page table is not a problem, there are soft mmu
> emulation and viommu, through performance overhead.

Due to the performance overheads, I really doubt any cloud operator would use passthrough virtio device for any sensible workload.
But you may know already how nested performance looks like that may be acceptable to users.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10  7:19         ` Parav Pandit
@ 2023-10-10 12:41           ` Michael S. Tsirkin
  2023-10-10 13:08             ` Parav Pandit
  2023-10-11  3:14           ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-10 12:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Tue, Oct 10, 2023 at 07:19:45AM +0000, Parav Pandit wrote:
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 10, 2023 11:21 AM
> > 
> > On Mon, Oct 9, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, October 9, 2023 2:19 PM
> > > >
> > > > Adding LingShan.
> > > >
> > > Thanks for adding him.
> > >
> > > > Parav, if you want any specific people to comment, please do cc them.
> > > >
> > > Sure, will cc them in v2 as now I see there is interest in the review.
> > >
> > > > On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > > One or more passthrough PCI VF devices are ubiquitous for virtual
> > > > > machines usage using generic kernel framework such as vfio [1].
> > > >
> > > > Mentioning a specific subsystem in a specific OS may mislead the
> > > > user to think it can only work in that setup. Let's not do that,
> > > > virtio is not only used for Linux and VFIO.
> > > >
> > > Not really. it is an example in the cover letter.
> > > It is not the only use case.
> > > A use case gives a crisp clarity of what UAPI it needs to fulfil.
> > > So I will keep it. It is anyway written as one use case.
> > >
> > > > >
> > > > > A passthrough PCI VF device is fully owned by the virtual machine
> > > > > device driver.
> > > >
> > > > Is this true? Even VFIO needs to mediate PCI stuff. Or how do you
> > > > define "passthrough" here?
> > > >
> > > Other than PCI config registers and due to some legacy, msix.
> > > The "device interface" side is not mediated.
> > > The definition of passthrough here is: To not mediate a device type specific
> > and virtio specific interfaces for modern and future devices.
> > 
> > Ok, but what's the difference between "device type specific" and "virtio specific
> > interfaces". Maybe an example for this?
> > 
> Virtio device specific means: cvq of crypto device, cvq of net device, flow filter vqs of net device etc.
> Virtio specific interface: virtio driver notifications, virtio virtqueue and configuration mediation etc.
> 
> > >
> > > > > This passthrough device controls its own device reset flow, basic
> > > > > functionality as PCI VF function level reset
> > > >
> > > > How about other PCI stuff? Or Why is FLR special?
> > > FLR is special for the readers to get the clarity that FLR is also done by the
> > guest driver hence, the device migration commands do not interact/depend
> > with FLR flow.
> > 
> > It's still not clear to me how this is done.
> > 
> > 1) guest starts FLR
> > 2) adminq freeze the VF
> > 3) FLR is done
> > 
> > If the freezing doesn't wait for the FLR, does it mean we need to migrate to a
> > state like FLR is pending? If yes, do we need to migrate the other sub states like
> > this? If not, why?
> > 
> In most practical cases #2 followed by #1 should not happen as on the source side the expected is mode change to stop from active.
> But ok, since we active to freeze mode change is allowed, lets discuss above.
> 
> A device is the single synchronization point for any device reset, FLR or admin command operation.
> So, the migration driver do not need to wait for FLR to complete.
> When admin cmd freeze the VF it can expect FLR_completed VF.
> Secondly since the FLR is local to the source, intermediate sub state does not migrate.
> 
> But I agree, it is worth to have the text capturing this.
> 
> > >
> > > >
> > > > > and rest of the virtio device functionality such as control vq,
> > > >
> > > > What do you mean by "rest of"?
> > > >
> > > As given in the example cvq.
> > >
> > > > Which part is not controlled and why?
> > > Not controlled because as states, it is passthrough device.
> > >
> > > > > config space access, data path descriptors handling.
> > > > >
> > > > > Additionally, VM live migration using a precopy method is also widely
> > used.
> > > >
> > > > Why is this mentioned here?
> > > >
> > > Huh. You should be positive for bringing clarity to the readers on
> > understanding the use case.
> > > And you seem opposite, but ok.
> > >
> > > As stated, it for the reader to understand the use case and see how proposed
> > commands addresses the use case.
> > 
> > The problem is that the hardware features should be designed for a general
> > purpose instead of a specific technology if it can. The only missing part for post
> > copy is the page fault.
> > 
> Ok. The use case and requirement of member device passthrough is clear to most reviewers now.
> So I will remove it from commit log.
> 
> > >
> > > > >
> > > > > To support a VM live migration for such passthrough virtio
> > > > > devices, the owner PCI PF device administers the device migration flow.
> > > >
> > > > Well, if this is specific only to PCI SR-IOV, I'd move it to the PCI transport
> > part.
> > > > But I guess not.
> > > We took the decision to not do so, for other group commands as well.
> > > After Michael's suggestion we moved it to group commands.
> > > So I will not debate this further.
> > >
> > > >
> > > > >
> > > > > This patch introduces the basic theory of operation which
> > > > > describes the flow and supporting administration commands.
> > > > >
> > > > > [1]
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/t
> > > > > ree/
> > > > > include/uapi/linux/vfio.h?h=v6.1.47
> > > > >
> > > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > ---
> > > > >  admin-cmds-device-migration.tex | 94
> > > > +++++++++++++++++++++++++++++++++
> > > > >  admin.tex                       |  1 +
> > > > >  2 files changed, 95 insertions(+)  create mode 100644
> > > > > admin-cmds-device-migration.tex
> > > > >
> > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > b/admin-cmds-device-migration.tex new file mode 100644 index
> > > > > 0000000..f839af4
> > > > > --- /dev/null
> > > > > +++ b/admin-cmds-device-migration.tex
> > > > > @@ -0,0 +1,94 @@
> > > > > +\subsubsection{Device Migration}\label{sec:Basic Facilities of a
> > > > > +Virtio Device / Device groups / Group administration commands /
> > > > > +Device Migration}
> > > > > +
> > > > > +In some systems, there is a need to migrate a running virtual
> > > > > +machine from one to another system. A running virtual machine has
> > > > > +one or more passthrough virtio member devices attached to it. A
> > > > > +passthrough device is entirely operated by the guest virtual
> > > > > +machine. For example, with the SR-IOV group type, group member
> > > > > +(VF) may undergo virtio device initialization and reset flow
> > > >
> > > > What do you mean by "reset flow"? It looks not like a terminology
> > > > defined in the PCI spec. And Google gives me nothing about this.
> > > >
> > > "reset flow" = virtio specification section 2.4 Device Reset flow.
> > 
> > My git repo show it's still called "device reset" and I see you use "FLR flow"
> > which is also not very clear to me.
> > 
> Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> This section is not normative section, so using an extra word like "flow" does not confuse anyone.
> I will link to the section anyway.
> 
> > >
> > > > > and may also undergo PCI function level
> > > > > +reset(FLR) flow.
> > > >
> > > > Why is only FLR special here? I've asked FRS but you ignore the question.
> > > >
> > > FLR is special to bring clarity that guest owns the VF doing FLR, hence
> > hypervisor cannot mediate any registers of the VF.
> > 
> > It's not about mediation at all, it's about how the device can implement what
> > you want here correctly.
> > 
> > See my above question.
> > 
> Ok. it is clear that live migration commands cannot stay on the member device because the member device can undergo device reset and FLR flows owned by the guest.
> (and hypervisor is not involved in these two flows, hence the admin command interface is designed such that it can fullfil above requirements).
> 
> Theory of operation brings out this clarity. Please notice that it is in introductory section with an example.
> Not normative line.

All this does beg the question of how is device undergoing flr though
and that has to be in a normative statement.

> > >
> > > > > Such flows must comply to the PCI standard and also
> > > > > +virtio specification;
> > > >
> > > > This seems unnecessary and obvious as it applies to all other PCI
> > > > and virtio functionality.
> > > >
> > > Great. But your comment is contradicts.
> > >
> > > > What's more, for the things that need to be synchronized, I don't
> > > > see any descriptions in this patch. And if it doesn't need, why?
> > > With which operation should it be synchronized and why?
> > > Can you please be specific?
> > 
> > See my above question regarding FLR. And it may have others which I haven't
> > had time to audit.
> > 
> Ok. when you get chance to audit, lets discuss that time.
> 
> > >
> > > It is not written in this series, because we believe it must not be synchronized
> > as it is fully controlled by the guest.
> > >
> > > >
> > > > > at the same time such flows must not obstruct
> > > > > +the device migration flow. In such a scenario, a group owner
> > > > > +device can provide the administration command interface to
> > > > > +facilitate the device migration related operations.
> > > > > +
> > > > > +When a virtual machine migrates from one hypervisor to another
> > > > > +hypervisor, these hypervisors are named as source and destination
> > > > hypervisor respectively.
> > > > > +In such a scenario, a source hypervisor administers the member
> > > > > +device to suspend the device and preserves the device context.
> > > > > +Subsequently, a destination hypervisor administers the member
> > > > > +device to setup a device context and resumes the member device.
> > > > > +The source hypervisor reads the member device context and the
> > > > > +destination hypervisor writes the member device context. The
> > > > > +method to transfer the member device context from the source to
> > > > > +the destination hypervisor is
> > > > outside the scope of this specification.
> > > > > +
> > > > > +The member device can be in any of the three migration modes. The
> > > > > +owner driver sets the member device in one of the following modes
> > > > > +during
> > > > device migration flow.
> > > > > +
> > > > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name &
> > > > > +Description \\ \hline \hline
> > > > > +0x0   & Active &
> > > > > +  It is the default mode after instantiation of the member
> > > > > +device. \\
> > > >
> > > > I don't think we ever define "instantiation" anywhere.
> > > >
> > > Well a transport has implicit definition of the instantiation already.
> > > May be a text can be added, but don’t see a value in duplicating PCI spec
> > here.
> > 
> > Ok, maybe something like "transport specific instantiation"
> > 
> Ok. that’s a good text. I will change to it.
> 
> > >
> > > > > +\hline
> > > > > +0x1   & Stop &
> > > > > + In this mode, the member device does not send any notifications,
> > > > > +and it does not access any driver memory.
> > > >
> > > > What's the meaning of "driver memory"?
> > > >
> > > May be guest memory? Or do you suggest a better naming for the memory
> > allocated by the guest driver?
> > 
> > Virtqueue?
> > 
> Virtqueue and any memory referred by the virtqueue.
> 
> This is good text, I will change to it.
> 
> > >
> > > > And stop seems to be a source of inflight buffers.
> > > >
> > > I didn’t follow it.
> > > If you mean without stop there are no inflight buffer, then I don’t agree.
> > > We don’t want to violate the spec by having descriptors with zero size
> > returned.
> > > Stop is not the source of inflight descriptors.
> > 
> > I think not since you forbid access to the used ring here. So even if the buffer
> > were processed by the device it can't be added back to the used ring thus
> > became inflight ones.
> > 
> > >
> > > There are inflight descriptors with the device that are not yet returned to the
> > driver, and device wont return them as zero size wrong completions.
> > >
> > > > > + The member device may receive driver notifications in this mode,
> > > >
> > > > What's the meaning of "receive"? For example if the device can still
> > > > process buffers, "stop" is not accurate.
> > > >
> > > Receive means, driver can send the notification as PCIe TLP that device may
> > receive as incoming PCIe TLP.
> > 
> > Ok, so this is the transport level. But the device can keep processing the queue?
> > 
> Device cannot process the queue because it does not initiate any read/write towards the virtqueue.
> 
> > >
> > > In "stop" mode, the device wont process descriptors.
> > 
> > If the device won't process descriptors, why still allow it to receive notifications?
> Because notification may still arrive and if the device may update any counters as part of it which needs to be migrated or store the received notification.
> 
> > Or does it really matter if the device can receive or not here?
> > 
> From device point of view, the device is given the chance to update its device context as part of notifications or access to it.
> 
> > >
> > > > > + the member device context
> > > >
> > > > I don't think we define "device context" anywhere.
> > > >
> > > It is defined further in the description.
> > 
> > Like this?
> > 
> > """
> >  +The member device has a device context which the owner driver can  +either
> > read or write. The member device context consist of any device  +specific data
> > which is needed by the device to resume its operation  +when the device mode
> > """
> > 
> Yes.
> Further patch-3 adds the device context and also add the link to it in the theory of operation section so reader can read more detail about it.

mention this in the commit log pls

> > "Any" is probably too hard for vendors to implement. And in patch 3 I only see
> > virtio device context. Does this mean we don't need transport
> > (PCI) context at all? If yes, how can it work?
> > 
> Right. PCI member device is present at source and destination with its layout, only the virtio device context is transferred.
> Which part cannot work?


wait don't we need to transfer pci state too? how is that migrated?

> > >
> > > > >and device configuration space may change. \\
> > > > > +\hline
> > > >
> > > > I still don't get why we need a "stop" state in the middle.
> > > >
> > > All pci devices which belong to a single guest VM are not stopped atomically.
> > > Hence, one device which is in freeze mode, may still receive driver
> > > notifications from other pci device,
> > 
> > Device may choose to ignore those notifications, no?
> > 
> > > or it may experience a read from the shared memory and get garbage data.
> > 
> > Could you give me an example for this?
> > 
> Section 2.10 Shared Memory Regions.
> 
> > > And things can break.
> > > Hence the stop mode, ensures that all the devices get enough chance to stop
> > themselves, and later when freezed, to not change anything internally.
> > >
> > > > > +0x2   & Freeze &
> > > > > + In this mode, the member device does not accept any driver
> > > > > +notifications,
> > > >
> > > > This is too vague. Is the device allowed to be freezed in the middle
> > > > of any virtio or PCI operations?
> > > >
> > > > For example, in the middle of feature negotiation etc. It may cause
> > > > implementation specific sub-states which can't be migrated easily.
> > > >
> > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > It is passthrough device, hence hypervisor layer do not get to see sub-state.
> > >
> > > Not sure why you comment, why it cannot be migrated easily.
> > > The device context already covers this sub-state.
> > 
> > 1) driver writes driver_features
> > 2) driver sets FEAUTRES_OK
> > 
> > 3) device receive driver_features
> > 4) device validating driver_features
> > 5) device clears FEATURES_OK
> > 
> > 6) driver read stats and realize FEATURES_OK is being cleared
> > 
> > Is it valid to be frozen of the above? 
> No. device mode is frozen when hypervisor is sure that no more access by the guest will be done.
> What can happen between #2 and #3, is device mode may change to stop.
> And in stop mode, device context would capture #5 or #4, depending where is device at that point.
> 
> > >
> > > > And what's more, the above state machine seems to be virtio
> > > > specific, but you don't explain the interaction with the device status state
> > machine.
> > > First, above is not a state machine.
> > 
> > So how do readers know if a state can go to another state and when?
> >
> Not sure what you mean by reader. Can you please explain.
> 
> > > Second, it is not virtio specific.
> > 
> > It's somehow for sure, for example you said device context need to be
> > preserved. And as far as I see the device context is all virtio specific in patch 3.
> > 
> Sure, device context is virtio specific. :)
> Device context will reflect if things changed in the stop mode.
> 
> > > It is present in leading OS that has fundamental requirement to support P2P
> > devices.
> > 
> > If it's PCI specific, instead of trying to do a workaround in virtio, why not invent
> > a mechanism there?
> > 
> It is not a workaround in virtio.
> It is the way pci p2p devices work for which one needs to be receptive to handle the interaction.
> 
> 
> > > Third, it is not, interacing with the _actua_ device status.
> > >
> > > In "SUSPEND" patch-5, you already asked this question. I assume you asked
> > again so that this series is complete.
> > >
> > > > For example,
> > > > what happens if the driver wants to reset but the device is in stop
> > > > mode? You told me it is addressed in your series but looks not. Once
> > > > you try to describe that, you're actually try to connect states between the
> > two state machines.
> > > >
> > > As listed in the definition of the stop mode, the device do not act on the
> > incoming writes, it only keep tracks of its internal device context change as part
> > of this.
> > 
> > So only the driver notification is allowed by not config write? What's the
> > consideration for allowing driver notification?
> > 
> Because for most practical purposes, peer device wants to queue blk, net other requests and not do device configuration.
> 
> Do you know any device configuration space which is RW?
> For net and blk I recall it as RO?

No it isn't. Pls look at the spec if you need to check that ;)


> > Let me ask differently, similar to FLR, what happens if the driver wants a virtio
> > reset but the hypervisor wants to stop or freeze?
> > 
> The device would respond to stop/freeze request when it has internally started the reset, as device is the single synchronization point which knows how to handle both in parallel.
> 
> > > We would enrich the device context for this, but no need to connects the
> > admin mode controlled by the owner device with operational state
> > (device_status) owned by the member device.
> > >
> > > > > + it ignores any device configuration space writes,
> > > >
> > > > How about read and the device configuration changes?
> > > >
> > > As listed, device do not have any changes.
> > > So device configuration change cannot occur.
> > 
> > It's not necessarily caused by config write, it could be things like link status or
> > geometry changes that are initiated from the device.
> > 
> I understand it. Link status was one example, you listed other examples too.
> The point is, when in freeze mode, the member device is frozen, hence, device won't initiate those changes.
> 
> > >
> > > The device requirements cover this content more explicitly:
> > >
> > > For the SR-IOV group type, regardless of the member device mode, all
> > > the PCI transport level registers MUST be always accessible and the
> > > member device MUST function the same way for all the PCI transport level
> > registers regardless of the member device mode.
> > >
> > > > > + the device do not have any changes in the device context. The
> > > > > + member device is not accessed in the system through the virtio
> > interface.
> > > > > + \\
> > > >
> > > > But accessible via PCI interface?
> > > >
> > > Yes, as usual.
> > >
> > > > For example, what happens if we want to freeze during FLR? Does the
> > > > hypervisor need to wait for the FLR to be completed?
> > > >
> > > Hypervisor do not need wait for the FLR to be completed.
> > 
> > So does FLR change device context?
> Yes.
> 
> > 
> > >
> > > > > +\hline
> > > > > +\hline
> > > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > > +\hline
> > > > > +\end{tabularx}
> > > > > +
> > > > > +When the owner driver wants to stop the operation of the device,
> > > > > +the owner driver sets the device mode to \field{Stop}. Once the
> > > > > +device is in the \field{Stop} mode, the device does not initiate
> > > > > +any notifications or does not access any driver memory. Since the
> > > > > +member driver may be still active which may send further driver
> > > > > +notifications to the device, the device context may be updated.
> > > > > +When the member driver has stopped accessing the device, the
> > > > > +owner driver sets the device to \field{Freeze} mode indicating to
> > > > > +the device that no more driver access occurs. In the
> > > > > +\field{Freeze} mode, no more changes occur in the device context.
> > > > > +At this point, the device ensures that
> > > > there will not be any update to the device context.
> > > >
> > > > What is missed here are:
> > > >
> > > > 1) it is a virtio specific states or not
> > > It is not.
> > >
> > > > 2) if it is a virtio specific state, if or how to synchronize with
> > > > transport specific interfaces and why
> > > > 3) can active go directly to freeze and why
> > > >
> > > Yes. don’t see a reason to not allow it.
> > > Active to freeze mode can change is useful on the destination side, where
> > destination hypervisor knows for sure that there is no other entity accessing the
> > device.
> > > And it needs to setup the device context, it received from the source side.
> > > So setting freeze mode can be done directly.
> > >
> > > > > +
> > > > > +The member device has a device context which the owner driver can
> > > > > +either read or write. The member device context consist of any
> > > > > +device specific data which is needed by the device to resume its
> > > > > +operation when the device mode
> > > >
> > > > This is too vague. There're states that are not suitable for cmd/queue for
> > sure.
> > > > I'd split it into
> > > >
> > > > 1) common states: virtqueue, dirty pages
> > > > 2) device specific states: defined be each device
> > > >
> > > This is theory of operation section. So it capturing such details.
> > > Actual device context definition is outside of theory, and precise states of
> > virtqueue, device specific, etc are in it.
> > 
> > See my comment above regarding to the device context.
> > 
> I replied above, device context link is added in the patch-3 in the theory of operation.
> So reader gets the complete view.
> 
> > >
> > > > > +is changed from \field{Stop} to \field{Active} or from
> > > > > +\field{Freeze} to \field{Active}.
> > > > > +
> > > > > +Once the device context is read, it is cleared from the device.
> > > >
> > > > This is horrible, it means we can't easily
> > > >
> > > > 1) re-try the migration
> > > > 2) recover from migration failure
> > > >
> > > Can you please explain the flow?
> > 
> > When migration fails, management can choose to resume the device(VM) on
> > the source.
> > 
> ok. This should be possible as the management which has the device context, it can restore it on the source
> and move the device mode to active.
> 
> > If the state were cleared, it means there's not simple way to resume the device
> > but restoring the whole context.
> > 
> Yes, as you say, by restoring the whole context will suffice this corner/rare case scenario.
> 
> > What's the consideration for such clearing?
> > 
> There are two considerations.
> 1.  If one does not clear, till how long should it be kept on the device?
> 2. device context returns incremental value from the previous read. So, it needs to clear it.
> 
> > > And which software stack may find this useful?
> > > Is there any existing software that can utilize it?
> > 
> > Libvirt.
> > 
> Does libvirt restore on migration failure?

yes

> > > Why that device context present with the software vanished, in your
> > assumption, if it is?
> > >
> > > > > Typically, on
> > > > > +the source hypervisor, the owner driver reads the device context
> > > > > +once when the device is in \field{Active} or \field{Stop} mode
> > > > > +and later once the member device is in \field{Freeze} mode.
> > > >
> > > > Why need the read while device context could be changed? Or is the
> > > > dirty page part of the device context?
> > > >
> > > It is not part of the dirty page.
> > > It needs to read in the active/stop mode, so that it can be shared with
> > destination hypervisor, which will pre-setup the complex context of the device,
> > while it is still running on the source side.
> > 
> > Is such a method used by any hypervisor? 
> Yes. qemu which uses vfio interface uses it.
> 
> > 
> > >
> > > > > +
> > > > > +Typically, the device context is read and written one time on the
> > > > > +source and the destination hypervisor respectively once the
> > > > > +device is in \field{Freeze} mode. On the destination hypervisor,
> > > > > +after writing the device context, when the device mode set to
> > > > > +\field{Active}, the device uses the most recently set device
> > > > > +context and resumes the device
> > > > operation.
> > > >
> > > > There's no context sequence, so this is obvious. It's the semantic
> > > > of all other existing interfaces.
> > > >
> > > Can you please what which existing interfaces do you mean here?
> > 
> > For any common cfg member. E.g queue_addr.
> > 
> > The driver wrote 100 different values to queue_addr and the device used the
> > value written last time.
> > 
> o.k. I don’t see any problem in stating what is done, which is less vague. 😊
> 
> > >
> > > > > +
> > > > > +In an alternative flow, on the source hypervisor the owner driver
> > > > > +may choose to read the device context first time while the device
> > > > > +is in \field{Active} mode and second time once the device is in
> > > > > +\field{Freeze}
> > > > mode.
> > > >
> > > > Who is going to synchronize the device context with possible
> > > > configuration from the driver?
> > > >
> > > Not sure I understand the question.
> > > If I understand you right, do you mean that, When configuration change
> > > is done by the guest driver, how does device context change?
> > >
> > 
> > Yes.
> > 
> > > If so, device context reading will reflect the new configuration.
> > 
> > How do you do that? For example:
> > 
> > static inline void vp_iowrite64_twopart(u64 val,
> >                                         __le32 __iomem *lo,
> >                                         __le32 __iomem *hi) {
> >         vp_iowrite32((u32)val, lo);
> >         vp_iowrite32(val >> 32, hi);
> > }
> > 
> > Is it ok to be freezed in the middle of two vp_iowrite()?
> > 
> Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG section captures the partial value.
> 
> > >
> > > > > Similarly, on the
> > > > > +destination hypervisor writes the device context first time while
> > > > > +the device is still running in \field{Active} mode on the source
> > > > > +hypervisor and writes the device context second time while the
> > > > > +device is in
> > > > \field{Freeze} mode.
> > > > > +This flow may result in very short setup time as the device
> > > > > +context likely have minimal changes from the previously written device
> > context.
> > > >
> > > > Is the hypervisor who is in charge of doing the comparison and
> > > > writing only the delta?
> > > >
> > > The spec commands allow to do so. So possibility exists from spec wise.
> > 
> > There are various optimizations for migration for sure, I don't think mentioning
> > any specific one is good.
> > 
> The text is informative text similar to,
> 
> " However, some devices benefit from the ability to find out the amount of available data in the queue without
> accessing the virtqueue in memory"
> 
> " To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has been negotiated".
> 
> Is this the only optimization in virtio? No, but we still mention the rationale of why it exists.
> As long as the rationale do not confuse the reader, and adds the value explaining how things work, it is fine to add.
> Which is what above few lines did.
> So let's keep it.
> 
> The easiest is to cut out the whole theory of operation and just write commands like how RSS command did, without even writing a single line about RSS.
> I think we can do better explanation than that for new things we add.

yes i find it useful. of course now we are writing it, we also need
it not to be confusing or partial.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10 12:41           ` Michael S. Tsirkin
@ 2023-10-10 13:08             ` Parav Pandit
  2023-10-10 14:00               ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-10 13:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, October 10, 2023 6:11 PM

> All this does beg the question of how is device undergoing flr though and that
> has to be in a normative statement.
> 
Yes, will add.

> > Further patch-3 adds the device context and also add the link to it in the
> theory of operation section so reader can read more detail about it.
> 
> mention this in the commit log pls
> 
Ack. Will add.

> > > "Any" is probably too hard for vendors to implement. And in patch 3
> > > I only see virtio device context. Does this mean we don't need
> > > transport
> > > (PCI) context at all? If yes, how can it work?
> > >
> > Right. PCI member device is present at source and destination with its layout,
> only the virtio device context is transferred.
> > Which part cannot work?
> 
> 
> wait don't we need to transfer pci state too? how is that migrated?
> 
The hypervisor driver composes the vPCI device. So there isn’t a need to migrate the pci state.
Only exception is VIRTIO_PCI_CAP_PCI_CFG, which is covered in this v1.


> > > >
> > > > > >and device configuration space may change. \\
> > > > > > +\hline
> > > > >
> > > > > I still don't get why we need a "stop" state in the middle.
> > > > >
> > > > All pci devices which belong to a single guest VM are not stopped
> atomically.
> > > > Hence, one device which is in freeze mode, may still receive
> > > > driver notifications from other pci device,
> > >
> > > Device may choose to ignore those notifications, no?
> > >
> > > > or it may experience a read from the shared memory and get garbage
> data.
> > >
> > > Could you give me an example for this?
> > >
> > Section 2.10 Shared Memory Regions.
> >
> > > > And things can break.
> > > > Hence the stop mode, ensures that all the devices get enough
> > > > chance to stop
> > > themselves, and later when freezed, to not change anything internally.
> > > >
> > > > > > +0x2   & Freeze &
> > > > > > + In this mode, the member device does not accept any driver
> > > > > > +notifications,
> > > > >
> > > > > This is too vague. Is the device allowed to be freezed in the
> > > > > middle of any virtio or PCI operations?
> > > > >
> > > > > For example, in the middle of feature negotiation etc. It may
> > > > > cause implementation specific sub-states which can't be migrated easily.
> > > > >
> > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > It is passthrough device, hence hypervisor layer do not get to see sub-
> state.
> > > >
> > > > Not sure why you comment, why it cannot be migrated easily.
> > > > The device context already covers this sub-state.
> > >
> > > 1) driver writes driver_features
> > > 2) driver sets FEAUTRES_OK
> > >
> > > 3) device receive driver_features
> > > 4) device validating driver_features
> > > 5) device clears FEATURES_OK
> > >
> > > 6) driver read stats and realize FEATURES_OK is being cleared
> > >
> > > Is it valid to be frozen of the above?
> > No. device mode is frozen when hypervisor is sure that no more access by the
> guest will be done.
> > What can happen between #2 and #3, is device mode may change to stop.
> > And in stop mode, device context would capture #5 or #4, depending where is
> device at that point.
> >
> > > >
> > > > > And what's more, the above state machine seems to be virtio
> > > > > specific, but you don't explain the interaction with the device
> > > > > status state
> > > machine.
> > > > First, above is not a state machine.
> > >
> > > So how do readers know if a state can go to another state and when?
> > >
> > Not sure what you mean by reader. Can you please explain.
> >
> > > > Second, it is not virtio specific.
> > >
> > > It's somehow for sure, for example you said device context need to
> > > be preserved. And as far as I see the device context is all virtio specific in
> patch 3.
> > >
> > Sure, device context is virtio specific. :) Device context will
> > reflect if things changed in the stop mode.
> >
> > > > It is present in leading OS that has fundamental requirement to
> > > > support P2P
> > > devices.
> > >
> > > If it's PCI specific, instead of trying to do a workaround in
> > > virtio, why not invent a mechanism there?
> > >
> > It is not a workaround in virtio.
> > It is the way pci p2p devices work for which one needs to be receptive to
> handle the interaction.
> >
> >
> > > > Third, it is not, interacing with the _actua_ device status.
> > > >
> > > > In "SUSPEND" patch-5, you already asked this question. I assume
> > > > you asked
> > > again so that this series is complete.
> > > >
> > > > > For example,
> > > > > what happens if the driver wants to reset but the device is in
> > > > > stop mode? You told me it is addressed in your series but looks
> > > > > not. Once you try to describe that, you're actually try to
> > > > > connect states between the
> > > two state machines.
> > > > >
> > > > As listed in the definition of the stop mode, the device do not
> > > > act on the
> > > incoming writes, it only keep tracks of its internal device context
> > > change as part of this.
> > >
> > > So only the driver notification is allowed by not config write?
> > > What's the consideration for allowing driver notification?
> > >
> > Because for most practical purposes, peer device wants to queue blk, net
> other requests and not do device configuration.
> >
> > Do you know any device configuration space which is RW?
> > For net and blk I recall it as RO?
> 
> No it isn't. Pls look at the spec if you need to check that ;)
> 
Ok. will check. But regardless, it is fine, because when STOP is done, config writes should not occur anyway.

> 
> > > Let me ask differently, similar to FLR, what happens if the driver
> > > wants a virtio reset but the hypervisor wants to stop or freeze?
> > >
> > The device would respond to stop/freeze request when it has internally
> started the reset, as device is the single synchronization point which knows how
> to handle both in parallel.
> >
> > > > We would enrich the device context for this, but no need to
> > > > connects the
> > > admin mode controlled by the owner device with operational state
> > > (device_status) owned by the member device.
> > > >
> > > > > > + it ignores any device configuration space writes,
> > > > >
> > > > > How about read and the device configuration changes?
> > > > >
> > > > As listed, device do not have any changes.
> > > > So device configuration change cannot occur.
> > >
> > > It's not necessarily caused by config write, it could be things like
> > > link status or geometry changes that are initiated from the device.
> > >
> > I understand it. Link status was one example, you listed other examples too.
> > The point is, when in freeze mode, the member device is frozen, hence,
> device won't initiate those changes.
> >
> > > >
> > > > The device requirements cover this content more explicitly:
> > > >
> > > > For the SR-IOV group type, regardless of the member device mode,
> > > > all the PCI transport level registers MUST be always accessible
> > > > and the member device MUST function the same way for all the PCI
> > > > transport level
> > > registers regardless of the member device mode.
> > > >
> > > > > > + the device do not have any changes in the device context.
> > > > > > + The member device is not accessed in the system through the
> > > > > > + virtio
> > > interface.
> > > > > > + \\
> > > > >
> > > > > But accessible via PCI interface?
> > > > >
> > > > Yes, as usual.
> > > >
> > > > > For example, what happens if we want to freeze during FLR? Does
> > > > > the hypervisor need to wait for the FLR to be completed?
> > > > >
> > > > Hypervisor do not need wait for the FLR to be completed.
> > >
> > > So does FLR change device context?
> > Yes.
> >
> > >
> > > >
> > > > > > +\hline
> > > > > > +\hline
> > > > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > > > +\hline
> > > > > > +\end{tabularx}
> > > > > > +
> > > > > > +When the owner driver wants to stop the operation of the
> > > > > > +device, the owner driver sets the device mode to
> > > > > > +\field{Stop}. Once the device is in the \field{Stop} mode,
> > > > > > +the device does not initiate any notifications or does not
> > > > > > +access any driver memory. Since the member driver may be
> > > > > > +still active which may send further driver notifications to the device,
> the device context may be updated.
> > > > > > +When the member driver has stopped accessing the device, the
> > > > > > +owner driver sets the device to \field{Freeze} mode
> > > > > > +indicating to the device that no more driver access occurs.
> > > > > > +In the \field{Freeze} mode, no more changes occur in the device
> context.
> > > > > > +At this point, the device ensures that
> > > > > there will not be any update to the device context.
> > > > >
> > > > > What is missed here are:
> > > > >
> > > > > 1) it is a virtio specific states or not
> > > > It is not.
> > > >
> > > > > 2) if it is a virtio specific state, if or how to synchronize
> > > > > with transport specific interfaces and why
> > > > > 3) can active go directly to freeze and why
> > > > >
> > > > Yes. don’t see a reason to not allow it.
> > > > Active to freeze mode can change is useful on the destination
> > > > side, where
> > > destination hypervisor knows for sure that there is no other entity
> > > accessing the device.
> > > > And it needs to setup the device context, it received from the source side.
> > > > So setting freeze mode can be done directly.
> > > >
> > > > > > +
> > > > > > +The member device has a device context which the owner driver
> > > > > > +can either read or write. The member device context consist
> > > > > > +of any device specific data which is needed by the device to
> > > > > > +resume its operation when the device mode
> > > > >
> > > > > This is too vague. There're states that are not suitable for
> > > > > cmd/queue for
> > > sure.
> > > > > I'd split it into
> > > > >
> > > > > 1) common states: virtqueue, dirty pages
> > > > > 2) device specific states: defined be each device
> > > > >
> > > > This is theory of operation section. So it capturing such details.
> > > > Actual device context definition is outside of theory, and precise
> > > > states of
> > > virtqueue, device specific, etc are in it.
> > >
> > > See my comment above regarding to the device context.
> > >
> > I replied above, device context link is added in the patch-3 in the theory of
> operation.
> > So reader gets the complete view.
> >
> > > >
> > > > > > +is changed from \field{Stop} to \field{Active} or from
> > > > > > +\field{Freeze} to \field{Active}.
> > > > > > +
> > > > > > +Once the device context is read, it is cleared from the device.
> > > > >
> > > > > This is horrible, it means we can't easily
> > > > >
> > > > > 1) re-try the migration
> > > > > 2) recover from migration failure
> > > > >
> > > > Can you please explain the flow?
> > >
> > > When migration fails, management can choose to resume the device(VM)
> > > on the source.
> > >
> > ok. This should be possible as the management which has the device
> > context, it can restore it on the source and move the device mode to active.
> >
> > > If the state were cleared, it means there's not simple way to resume
> > > the device but restoring the whole context.
> > >
> > Yes, as you say, by restoring the whole context will suffice this corner/rare
> case scenario.
> >
> > > What's the consideration for such clearing?
> > >
> > There are two considerations.
> > 1.  If one does not clear, till how long should it be kept on the device?
> > 2. device context returns incremental value from the previous read. So, it
> needs to clear it.
> >
> > > > And which software stack may find this useful?
> > > > Is there any existing software that can utilize it?
> > >
> > > Libvirt.
> > >
> > Does libvirt restore on migration failure?
> 
> yes
> 
Ok. the management sw has access to the context to restore.
Alternatively, it is the incremental context not available as it is read, but the freeze device still has the frozen context.
So it can be marked active too.
I will double check if I captured this in the normative or not.

> > > > Why that device context present with the software vanished, in
> > > > your
> > > assumption, if it is?
> > > >
> > > > > > Typically, on
> > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > +context once when the device is in \field{Active} or
> > > > > > +\field{Stop} mode and later once the member device is in
> \field{Freeze} mode.
> > > > >
> > > > > Why need the read while device context could be changed? Or is
> > > > > the dirty page part of the device context?
> > > > >
> > > > It is not part of the dirty page.
> > > > It needs to read in the active/stop mode, so that it can be shared
> > > > with
> > > destination hypervisor, which will pre-setup the complex context of
> > > the device, while it is still running on the source side.
> > >
> > > Is such a method used by any hypervisor?
> > Yes. qemu which uses vfio interface uses it.
> >
> > >
> > > >
> > > > > > +
> > > > > > +Typically, the device context is read and written one time on
> > > > > > +the source and the destination hypervisor respectively once
> > > > > > +the device is in \field{Freeze} mode. On the destination
> > > > > > +hypervisor, after writing the device context, when the device
> > > > > > +mode set to \field{Active}, the device uses the most recently
> > > > > > +set device context and resumes the device
> > > > > operation.
> > > > >
> > > > > There's no context sequence, so this is obvious. It's the
> > > > > semantic of all other existing interfaces.
> > > > >
> > > > Can you please what which existing interfaces do you mean here?
> > >
> > > For any common cfg member. E.g queue_addr.
> > >
> > > The driver wrote 100 different values to queue_addr and the device
> > > used the value written last time.
> > >
> > o.k. I don’t see any problem in stating what is done, which is less
> > vague. 😊
> >
> > > >
> > > > > > +
> > > > > > +In an alternative flow, on the source hypervisor the owner
> > > > > > +driver may choose to read the device context first time while
> > > > > > +the device is in \field{Active} mode and second time once the
> > > > > > +device is in \field{Freeze}
> > > > > mode.
> > > > >
> > > > > Who is going to synchronize the device context with possible
> > > > > configuration from the driver?
> > > > >
> > > > Not sure I understand the question.
> > > > If I understand you right, do you mean that, When configuration
> > > > change is done by the guest driver, how does device context change?
> > > >
> > >
> > > Yes.
> > >
> > > > If so, device context reading will reflect the new configuration.
> > >
> > > How do you do that? For example:
> > >
> > > static inline void vp_iowrite64_twopart(u64 val,
> > >                                         __le32 __iomem *lo,
> > >                                         __le32 __iomem *hi) {
> > >         vp_iowrite32((u32)val, lo);
> > >         vp_iowrite32(val >> 32, hi); }
> > >
> > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > >
> > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> section captures the partial value.
> >
> > > >
> > > > > > Similarly, on the
> > > > > > +destination hypervisor writes the device context first time
> > > > > > +while the device is still running in \field{Active} mode on
> > > > > > +the source hypervisor and writes the device context second
> > > > > > +time while the device is in
> > > > > \field{Freeze} mode.
> > > > > > +This flow may result in very short setup time as the device
> > > > > > +context likely have minimal changes from the previously
> > > > > > +written device
> > > context.
> > > > >
> > > > > Is the hypervisor who is in charge of doing the comparison and
> > > > > writing only the delta?
> > > > >
> > > > The spec commands allow to do so. So possibility exists from spec wise.
> > >
> > > There are various optimizations for migration for sure, I don't
> > > think mentioning any specific one is good.
> > >
> > The text is informative text similar to,
> >
> > " However, some devices benefit from the ability to find out the
> > amount of available data in the queue without accessing the virtqueue in
> memory"
> >
> > " To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has
> been negotiated".
> >
> > Is this the only optimization in virtio? No, but we still mention the rationale of
> why it exists.
> > As long as the rationale do not confuse the reader, and adds the value
> explaining how things work, it is fine to add.
> > Which is what above few lines did.
> > So let's keep it.
> >
> > The easiest is to cut out the whole theory of operation and just write
> commands like how RSS command did, without even writing a single line about
> RSS.
> > I think we can do better explanation than that for new things we add.
> 
> yes i find it useful. of course now we are writing it, we also need it not to be
> confusing or partial.
Ok. Thanks.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10 13:08             ` Parav Pandit
@ 2023-10-10 14:00               ` Michael S. Tsirkin
  2023-10-10 14:09                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-10 14:00 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Tue, Oct 10, 2023 at 01:08:22PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Tuesday, October 10, 2023 6:11 PM
> 
> > All this does beg the question of how is device undergoing flr though and that
> > has to be in a normative statement.
> > 
> Yes, will add.
> 
> > > Further patch-3 adds the device context and also add the link to it in the
> > theory of operation section so reader can read more detail about it.
> > 
> > mention this in the commit log pls
> > 
> Ack. Will add.
> 
> > > > "Any" is probably too hard for vendors to implement. And in patch 3
> > > > I only see virtio device context. Does this mean we don't need
> > > > transport
> > > > (PCI) context at all? If yes, how can it work?
> > > >
> > > Right. PCI member device is present at source and destination with its layout,
> > only the virtio device context is transferred.
> > > Which part cannot work?
> > 
> > 
> > wait don't we need to transfer pci state too? how is that migrated?
> > 
> The hypervisor driver composes the vPCI device. So there isn’t a need to migrate the pci state.
> Only exception is VIRTIO_PCI_CAP_PCI_CFG, which is covered in this v1.
> 

yes but what seems implicit is that device is in some reasonable state
when thing thing happens. e.g. are there no limitations at all e.g. in which
order things happen? can you really first configure virtio
then pci config? for sure?

> > > > >
> > > > > > >and device configuration space may change. \\
> > > > > > > +\hline
> > > > > >
> > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > >
> > > > > All pci devices which belong to a single guest VM are not stopped
> > atomically.
> > > > > Hence, one device which is in freeze mode, may still receive
> > > > > driver notifications from other pci device,
> > > >
> > > > Device may choose to ignore those notifications, no?
> > > >
> > > > > or it may experience a read from the shared memory and get garbage
> > data.
> > > >
> > > > Could you give me an example for this?
> > > >
> > > Section 2.10 Shared Memory Regions.
> > >
> > > > > And things can break.
> > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > chance to stop
> > > > themselves, and later when freezed, to not change anything internally.
> > > > >
> > > > > > > +0x2   & Freeze &
> > > > > > > + In this mode, the member device does not accept any driver
> > > > > > > +notifications,
> > > > > >
> > > > > > This is too vague. Is the device allowed to be freezed in the
> > > > > > middle of any virtio or PCI operations?
> > > > > >
> > > > > > For example, in the middle of feature negotiation etc. It may
> > > > > > cause implementation specific sub-states which can't be migrated easily.
> > > > > >
> > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > It is passthrough device, hence hypervisor layer do not get to see sub-
> > state.
> > > > >
> > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > The device context already covers this sub-state.
> > > >
> > > > 1) driver writes driver_features
> > > > 2) driver sets FEAUTRES_OK
> > > >
> > > > 3) device receive driver_features
> > > > 4) device validating driver_features
> > > > 5) device clears FEATURES_OK
> > > >
> > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > >
> > > > Is it valid to be frozen of the above?
> > > No. device mode is frozen when hypervisor is sure that no more access by the
> > guest will be done.
> > > What can happen between #2 and #3, is device mode may change to stop.
> > > And in stop mode, device context would capture #5 or #4, depending where is
> > device at that point.
> > >
> > > > >
> > > > > > And what's more, the above state machine seems to be virtio
> > > > > > specific, but you don't explain the interaction with the device
> > > > > > status state
> > > > machine.
> > > > > First, above is not a state machine.
> > > >
> > > > So how do readers know if a state can go to another state and when?
> > > >
> > > Not sure what you mean by reader. Can you please explain.
> > >
> > > > > Second, it is not virtio specific.
> > > >
> > > > It's somehow for sure, for example you said device context need to
> > > > be preserved. And as far as I see the device context is all virtio specific in
> > patch 3.
> > > >
> > > Sure, device context is virtio specific. :) Device context will
> > > reflect if things changed in the stop mode.
> > >
> > > > > It is present in leading OS that has fundamental requirement to
> > > > > support P2P
> > > > devices.
> > > >
> > > > If it's PCI specific, instead of trying to do a workaround in
> > > > virtio, why not invent a mechanism there?
> > > >
> > > It is not a workaround in virtio.
> > > It is the way pci p2p devices work for which one needs to be receptive to
> > handle the interaction.
> > >
> > >
> > > > > Third, it is not, interacing with the _actua_ device status.
> > > > >
> > > > > In "SUSPEND" patch-5, you already asked this question. I assume
> > > > > you asked
> > > > again so that this series is complete.
> > > > >
> > > > > > For example,
> > > > > > what happens if the driver wants to reset but the device is in
> > > > > > stop mode? You told me it is addressed in your series but looks
> > > > > > not. Once you try to describe that, you're actually try to
> > > > > > connect states between the
> > > > two state machines.
> > > > > >
> > > > > As listed in the definition of the stop mode, the device do not
> > > > > act on the
> > > > incoming writes, it only keep tracks of its internal device context
> > > > change as part of this.
> > > >
> > > > So only the driver notification is allowed by not config write?
> > > > What's the consideration for allowing driver notification?
> > > >
> > > Because for most practical purposes, peer device wants to queue blk, net
> > other requests and not do device configuration.
> > >
> > > Do you know any device configuration space which is RW?
> > > For net and blk I recall it as RO?
> > 
> > No it isn't. Pls look at the spec if you need to check that ;)
> > 
> Ok. will check. But regardless, it is fine, because when STOP is done, config writes should not occur anyway.


i don't see a statement like this but maybe i missed it.

> > 
> > > > Let me ask differently, similar to FLR, what happens if the driver
> > > > wants a virtio reset but the hypervisor wants to stop or freeze?
> > > >
> > > The device would respond to stop/freeze request when it has internally
> > started the reset, as device is the single synchronization point which knows how
> > to handle both in parallel.
> > >
> > > > > We would enrich the device context for this, but no need to
> > > > > connects the
> > > > admin mode controlled by the owner device with operational state
> > > > (device_status) owned by the member device.
> > > > >
> > > > > > > + it ignores any device configuration space writes,
> > > > > >
> > > > > > How about read and the device configuration changes?
> > > > > >
> > > > > As listed, device do not have any changes.
> > > > > So device configuration change cannot occur.
> > > >
> > > > It's not necessarily caused by config write, it could be things like
> > > > link status or geometry changes that are initiated from the device.
> > > >
> > > I understand it. Link status was one example, you listed other examples too.
> > > The point is, when in freeze mode, the member device is frozen, hence,
> > device won't initiate those changes.
> > >
> > > > >
> > > > > The device requirements cover this content more explicitly:
> > > > >
> > > > > For the SR-IOV group type, regardless of the member device mode,
> > > > > all the PCI transport level registers MUST be always accessible
> > > > > and the member device MUST function the same way for all the PCI
> > > > > transport level
> > > > registers regardless of the member device mode.
> > > > >
> > > > > > > + the device do not have any changes in the device context.
> > > > > > > + The member device is not accessed in the system through the
> > > > > > > + virtio
> > > > interface.
> > > > > > > + \\
> > > > > >
> > > > > > But accessible via PCI interface?
> > > > > >
> > > > > Yes, as usual.
> > > > >
> > > > > > For example, what happens if we want to freeze during FLR? Does
> > > > > > the hypervisor need to wait for the FLR to be completed?
> > > > > >
> > > > > Hypervisor do not need wait for the FLR to be completed.
> > > >
> > > > So does FLR change device context?
> > > Yes.
> > >
> > > >
> > > > >
> > > > > > > +\hline
> > > > > > > +\hline
> > > > > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > > > > +\hline
> > > > > > > +\end{tabularx}
> > > > > > > +
> > > > > > > +When the owner driver wants to stop the operation of the
> > > > > > > +device, the owner driver sets the device mode to
> > > > > > > +\field{Stop}. Once the device is in the \field{Stop} mode,
> > > > > > > +the device does not initiate any notifications or does not
> > > > > > > +access any driver memory. Since the member driver may be
> > > > > > > +still active which may send further driver notifications to the device,
> > the device context may be updated.
> > > > > > > +When the member driver has stopped accessing the device, the
> > > > > > > +owner driver sets the device to \field{Freeze} mode
> > > > > > > +indicating to the device that no more driver access occurs.
> > > > > > > +In the \field{Freeze} mode, no more changes occur in the device
> > context.
> > > > > > > +At this point, the device ensures that
> > > > > > there will not be any update to the device context.
> > > > > >
> > > > > > What is missed here are:
> > > > > >
> > > > > > 1) it is a virtio specific states or not
> > > > > It is not.
> > > > >
> > > > > > 2) if it is a virtio specific state, if or how to synchronize
> > > > > > with transport specific interfaces and why
> > > > > > 3) can active go directly to freeze and why
> > > > > >
> > > > > Yes. don’t see a reason to not allow it.
> > > > > Active to freeze mode can change is useful on the destination
> > > > > side, where
> > > > destination hypervisor knows for sure that there is no other entity
> > > > accessing the device.
> > > > > And it needs to setup the device context, it received from the source side.
> > > > > So setting freeze mode can be done directly.
> > > > >
> > > > > > > +
> > > > > > > +The member device has a device context which the owner driver
> > > > > > > +can either read or write. The member device context consist
> > > > > > > +of any device specific data which is needed by the device to
> > > > > > > +resume its operation when the device mode
> > > > > >
> > > > > > This is too vague. There're states that are not suitable for
> > > > > > cmd/queue for
> > > > sure.
> > > > > > I'd split it into
> > > > > >
> > > > > > 1) common states: virtqueue, dirty pages
> > > > > > 2) device specific states: defined be each device
> > > > > >
> > > > > This is theory of operation section. So it capturing such details.
> > > > > Actual device context definition is outside of theory, and precise
> > > > > states of
> > > > virtqueue, device specific, etc are in it.
> > > >
> > > > See my comment above regarding to the device context.
> > > >
> > > I replied above, device context link is added in the patch-3 in the theory of
> > operation.
> > > So reader gets the complete view.
> > >
> > > > >
> > > > > > > +is changed from \field{Stop} to \field{Active} or from
> > > > > > > +\field{Freeze} to \field{Active}.
> > > > > > > +
> > > > > > > +Once the device context is read, it is cleared from the device.
> > > > > >
> > > > > > This is horrible, it means we can't easily
> > > > > >
> > > > > > 1) re-try the migration
> > > > > > 2) recover from migration failure
> > > > > >
> > > > > Can you please explain the flow?
> > > >
> > > > When migration fails, management can choose to resume the device(VM)
> > > > on the source.
> > > >
> > > ok. This should be possible as the management which has the device
> > > context, it can restore it on the source and move the device mode to active.
> > >
> > > > If the state were cleared, it means there's not simple way to resume
> > > > the device but restoring the whole context.
> > > >
> > > Yes, as you say, by restoring the whole context will suffice this corner/rare
> > case scenario.
> > >
> > > > What's the consideration for such clearing?
> > > >
> > > There are two considerations.
> > > 1.  If one does not clear, till how long should it be kept on the device?
> > > 2. device context returns incremental value from the previous read. So, it
> > needs to clear it.
> > >
> > > > > And which software stack may find this useful?
> > > > > Is there any existing software that can utilize it?
> > > >
> > > > Libvirt.
> > > >
> > > Does libvirt restore on migration failure?
> > 
> > yes
> > 
> Ok. the management sw has access to the context to restore.
> Alternatively, it is the incremental context not available as it is read, but the freeze device still has the frozen context.
> So it can be marked active too.
> I will double check if I captured this in the normative or not.

preferable if context is not erased on the device IMHO.
less of a chance of a failure to resume.

> > > > > Why that device context present with the software vanished, in
> > > > > your
> > > > assumption, if it is?
> > > > >
> > > > > > > Typically, on
> > > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > > +context once when the device is in \field{Active} or
> > > > > > > +\field{Stop} mode and later once the member device is in
> > \field{Freeze} mode.
> > > > > >
> > > > > > Why need the read while device context could be changed? Or is
> > > > > > the dirty page part of the device context?
> > > > > >
> > > > > It is not part of the dirty page.
> > > > > It needs to read in the active/stop mode, so that it can be shared
> > > > > with
> > > > destination hypervisor, which will pre-setup the complex context of
> > > > the device, while it is still running on the source side.
> > > >
> > > > Is such a method used by any hypervisor?
> > > Yes. qemu which uses vfio interface uses it.
> > >
> > > >
> > > > >
> > > > > > > +
> > > > > > > +Typically, the device context is read and written one time on
> > > > > > > +the source and the destination hypervisor respectively once
> > > > > > > +the device is in \field{Freeze} mode. On the destination
> > > > > > > +hypervisor, after writing the device context, when the device
> > > > > > > +mode set to \field{Active}, the device uses the most recently
> > > > > > > +set device context and resumes the device
> > > > > > operation.
> > > > > >
> > > > > > There's no context sequence, so this is obvious. It's the
> > > > > > semantic of all other existing interfaces.
> > > > > >
> > > > > Can you please what which existing interfaces do you mean here?
> > > >
> > > > For any common cfg member. E.g queue_addr.
> > > >
> > > > The driver wrote 100 different values to queue_addr and the device
> > > > used the value written last time.
> > > >
> > > o.k. I don’t see any problem in stating what is done, which is less
> > > vague. 😊
> > >
> > > > >
> > > > > > > +
> > > > > > > +In an alternative flow, on the source hypervisor the owner
> > > > > > > +driver may choose to read the device context first time while
> > > > > > > +the device is in \field{Active} mode and second time once the
> > > > > > > +device is in \field{Freeze}
> > > > > > mode.
> > > > > >
> > > > > > Who is going to synchronize the device context with possible
> > > > > > configuration from the driver?
> > > > > >
> > > > > Not sure I understand the question.
> > > > > If I understand you right, do you mean that, When configuration
> > > > > change is done by the guest driver, how does device context change?
> > > > >
> > > >
> > > > Yes.
> > > >
> > > > > If so, device context reading will reflect the new configuration.
> > > >
> > > > How do you do that? For example:
> > > >
> > > > static inline void vp_iowrite64_twopart(u64 val,
> > > >                                         __le32 __iomem *lo,
> > > >                                         __le32 __iomem *hi) {
> > > >         vp_iowrite32((u32)val, lo);
> > > >         vp_iowrite32(val >> 32, hi); }
> > > >
> > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > >
> > > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > section captures the partial value.
> > >
> > > > >
> > > > > > > Similarly, on the
> > > > > > > +destination hypervisor writes the device context first time
> > > > > > > +while the device is still running in \field{Active} mode on
> > > > > > > +the source hypervisor and writes the device context second
> > > > > > > +time while the device is in
> > > > > > \field{Freeze} mode.
> > > > > > > +This flow may result in very short setup time as the device
> > > > > > > +context likely have minimal changes from the previously
> > > > > > > +written device
> > > > context.
> > > > > >
> > > > > > Is the hypervisor who is in charge of doing the comparison and
> > > > > > writing only the delta?
> > > > > >
> > > > > The spec commands allow to do so. So possibility exists from spec wise.
> > > >
> > > > There are various optimizations for migration for sure, I don't
> > > > think mentioning any specific one is good.
> > > >
> > > The text is informative text similar to,
> > >
> > > " However, some devices benefit from the ability to find out the
> > > amount of available data in the queue without accessing the virtqueue in
> > memory"
> > >
> > > " To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has
> > been negotiated".
> > >
> > > Is this the only optimization in virtio? No, but we still mention the rationale of
> > why it exists.
> > > As long as the rationale do not confuse the reader, and adds the value
> > explaining how things work, it is fine to add.
> > > Which is what above few lines did.
> > > So let's keep it.
> > >
> > > The easiest is to cut out the whole theory of operation and just write
> > commands like how RSS command did, without even writing a single line about
> > RSS.
> > > I think we can do better explanation than that for new things we add.
> > 
> > yes i find it useful. of course now we are writing it, we also need it not to be
> > confusing or partial.
> Ok. Thanks.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10 14:00               ` Michael S. Tsirkin
@ 2023-10-10 14:09                 ` Parav Pandit
  2023-10-10 14:55                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-10 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, October 10, 2023 7:30 PM

> > The hypervisor driver composes the vPCI device. So there isn’t a need to
> migrate the pci state.
> > Only exception is VIRTIO_PCI_CAP_PCI_CFG, which is covered in this v1.
> >
> 
> yes but what seems implicit is that device is in some reasonable state when
> thing thing happens. e.g. are there no limitations at all e.g. in which order
> things happen? can you really first configure virtio then pci config? for sure?
> 
First pci config is setup, like bus master enable etc.
After that point, the device is handed to virtio things.

From device context write perspective, I doubt the order matters.
For example, if pci bus master and msix are enabled after device context restore or before would not matter much.
As long as they are done before making the device mode to active.

> > > > > >
> > > > > > > >and device configuration space may change. \\
> > > > > > > > +\hline
> > > > > > >
> > > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > > >
> > > > > > All pci devices which belong to a single guest VM are not
> > > > > > stopped
> > > atomically.
> > > > > > Hence, one device which is in freeze mode, may still receive
> > > > > > driver notifications from other pci device,
> > > > >
> > > > > Device may choose to ignore those notifications, no?
> > > > >
> > > > > > or it may experience a read from the shared memory and get
> > > > > > garbage
> > > data.
> > > > >
> > > > > Could you give me an example for this?
> > > > >
> > > > Section 2.10 Shared Memory Regions.
> > > >
> > > > > > And things can break.
> > > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > > chance to stop
> > > > > themselves, and later when freezed, to not change anything internally.
> > > > > >
> > > > > > > > +0x2   & Freeze &
> > > > > > > > + In this mode, the member device does not accept any
> > > > > > > > +driver notifications,
> > > > > > >
> > > > > > > This is too vague. Is the device allowed to be freezed in
> > > > > > > the middle of any virtio or PCI operations?
> > > > > > >
> > > > > > > For example, in the middle of feature negotiation etc. It
> > > > > > > may cause implementation specific sub-states which can't be
> migrated easily.
> > > > > > >
> > > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > > It is passthrough device, hence hypervisor layer do not get to
> > > > > > see sub-
> > > state.
> > > > > >
> > > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > > The device context already covers this sub-state.
> > > > >
> > > > > 1) driver writes driver_features
> > > > > 2) driver sets FEAUTRES_OK
> > > > >
> > > > > 3) device receive driver_features
> > > > > 4) device validating driver_features
> > > > > 5) device clears FEATURES_OK
> > > > >
> > > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > > >
> > > > > Is it valid to be frozen of the above?
> > > > No. device mode is frozen when hypervisor is sure that no more
> > > > access by the
> > > guest will be done.
> > > > What can happen between #2 and #3, is device mode may change to stop.
> > > > And in stop mode, device context would capture #5 or #4, depending
> > > > where is
> > > device at that point.
> > > >
> > > > > >
> > > > > > > And what's more, the above state machine seems to be virtio
> > > > > > > specific, but you don't explain the interaction with the
> > > > > > > device status state
> > > > > machine.
> > > > > > First, above is not a state machine.
> > > > >
> > > > > So how do readers know if a state can go to another state and when?
> > > > >
> > > > Not sure what you mean by reader. Can you please explain.
> > > >
> > > > > > Second, it is not virtio specific.
> > > > >
> > > > > It's somehow for sure, for example you said device context need
> > > > > to be preserved. And as far as I see the device context is all
> > > > > virtio specific in
> > > patch 3.
> > > > >
> > > > Sure, device context is virtio specific. :) Device context will
> > > > reflect if things changed in the stop mode.
> > > >
> > > > > > It is present in leading OS that has fundamental requirement
> > > > > > to support P2P
> > > > > devices.
> > > > >
> > > > > If it's PCI specific, instead of trying to do a workaround in
> > > > > virtio, why not invent a mechanism there?
> > > > >
> > > > It is not a workaround in virtio.
> > > > It is the way pci p2p devices work for which one needs to be
> > > > receptive to
> > > handle the interaction.
> > > >
> > > >
> > > > > > Third, it is not, interacing with the _actua_ device status.
> > > > > >
> > > > > > In "SUSPEND" patch-5, you already asked this question. I
> > > > > > assume you asked
> > > > > again so that this series is complete.
> > > > > >
> > > > > > > For example,
> > > > > > > what happens if the driver wants to reset but the device is
> > > > > > > in stop mode? You told me it is addressed in your series but
> > > > > > > looks not. Once you try to describe that, you're actually
> > > > > > > try to connect states between the
> > > > > two state machines.
> > > > > > >
> > > > > > As listed in the definition of the stop mode, the device do
> > > > > > not act on the
> > > > > incoming writes, it only keep tracks of its internal device
> > > > > context change as part of this.
> > > > >
> > > > > So only the driver notification is allowed by not config write?
> > > > > What's the consideration for allowing driver notification?
> > > > >
> > > > Because for most practical purposes, peer device wants to queue
> > > > blk, net
> > > other requests and not do device configuration.
> > > >
> > > > Do you know any device configuration space which is RW?
> > > > For net and blk I recall it as RO?
> > >
> > > No it isn't. Pls look at the spec if you need to check that ;)
> > >
> > Ok. will check. But regardless, it is fine, because when STOP is done, config
> writes should not occur anyway.
> 
> 
> i don't see a statement like this but maybe i missed it.
>
I am missing it, will add.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10 14:09                 ` Parav Pandit
@ 2023-10-10 14:55                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-10 14:55 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Tue, Oct 10, 2023 at 02:09:27PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Tuesday, October 10, 2023 7:30 PM
> 
> > > The hypervisor driver composes the vPCI device. So there isn’t a need to
> > migrate the pci state.
> > > Only exception is VIRTIO_PCI_CAP_PCI_CFG, which is covered in this v1.
> > >
> > 
> > yes but what seems implicit is that device is in some reasonable state when
> > thing thing happens. e.g. are there no limitations at all e.g. in which order
> > things happen? can you really first configure virtio then pci config? for sure?
> > 
> First pci config is setup, like bus master enable etc.
> After that point, the device is handed to virtio things.
> 
> From device context write perspective, I doubt the order matters.
> For example, if pci bus master and msix are enabled after device context restore or before would not matter much.
> As long as they are done before making the device mode to active.

whatever the requirements, document them.

> > > > > > >
> > > > > > > > >and device configuration space may change. \\
> > > > > > > > > +\hline
> > > > > > > >
> > > > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > > > >
> > > > > > > All pci devices which belong to a single guest VM are not
> > > > > > > stopped
> > > > atomically.
> > > > > > > Hence, one device which is in freeze mode, may still receive
> > > > > > > driver notifications from other pci device,
> > > > > >
> > > > > > Device may choose to ignore those notifications, no?
> > > > > >
> > > > > > > or it may experience a read from the shared memory and get
> > > > > > > garbage
> > > > data.
> > > > > >
> > > > > > Could you give me an example for this?
> > > > > >
> > > > > Section 2.10 Shared Memory Regions.
> > > > >
> > > > > > > And things can break.
> > > > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > > > chance to stop
> > > > > > themselves, and later when freezed, to not change anything internally.
> > > > > > >
> > > > > > > > > +0x2   & Freeze &
> > > > > > > > > + In this mode, the member device does not accept any
> > > > > > > > > +driver notifications,
> > > > > > > >
> > > > > > > > This is too vague. Is the device allowed to be freezed in
> > > > > > > > the middle of any virtio or PCI operations?
> > > > > > > >
> > > > > > > > For example, in the middle of feature negotiation etc. It
> > > > > > > > may cause implementation specific sub-states which can't be
> > migrated easily.
> > > > > > > >
> > > > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > > > It is passthrough device, hence hypervisor layer do not get to
> > > > > > > see sub-
> > > > state.
> > > > > > >
> > > > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > > > The device context already covers this sub-state.
> > > > > >
> > > > > > 1) driver writes driver_features
> > > > > > 2) driver sets FEAUTRES_OK
> > > > > >
> > > > > > 3) device receive driver_features
> > > > > > 4) device validating driver_features
> > > > > > 5) device clears FEATURES_OK
> > > > > >
> > > > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > > > >
> > > > > > Is it valid to be frozen of the above?
> > > > > No. device mode is frozen when hypervisor is sure that no more
> > > > > access by the
> > > > guest will be done.
> > > > > What can happen between #2 and #3, is device mode may change to stop.
> > > > > And in stop mode, device context would capture #5 or #4, depending
> > > > > where is
> > > > device at that point.
> > > > >
> > > > > > >
> > > > > > > > And what's more, the above state machine seems to be virtio
> > > > > > > > specific, but you don't explain the interaction with the
> > > > > > > > device status state
> > > > > > machine.
> > > > > > > First, above is not a state machine.
> > > > > >
> > > > > > So how do readers know if a state can go to another state and when?
> > > > > >
> > > > > Not sure what you mean by reader. Can you please explain.
> > > > >
> > > > > > > Second, it is not virtio specific.
> > > > > >
> > > > > > It's somehow for sure, for example you said device context need
> > > > > > to be preserved. And as far as I see the device context is all
> > > > > > virtio specific in
> > > > patch 3.
> > > > > >
> > > > > Sure, device context is virtio specific. :) Device context will
> > > > > reflect if things changed in the stop mode.
> > > > >
> > > > > > > It is present in leading OS that has fundamental requirement
> > > > > > > to support P2P
> > > > > > devices.
> > > > > >
> > > > > > If it's PCI specific, instead of trying to do a workaround in
> > > > > > virtio, why not invent a mechanism there?
> > > > > >
> > > > > It is not a workaround in virtio.
> > > > > It is the way pci p2p devices work for which one needs to be
> > > > > receptive to
> > > > handle the interaction.
> > > > >
> > > > >
> > > > > > > Third, it is not, interacing with the _actua_ device status.
> > > > > > >
> > > > > > > In "SUSPEND" patch-5, you already asked this question. I
> > > > > > > assume you asked
> > > > > > again so that this series is complete.
> > > > > > >
> > > > > > > > For example,
> > > > > > > > what happens if the driver wants to reset but the device is
> > > > > > > > in stop mode? You told me it is addressed in your series but
> > > > > > > > looks not. Once you try to describe that, you're actually
> > > > > > > > try to connect states between the
> > > > > > two state machines.
> > > > > > > >
> > > > > > > As listed in the definition of the stop mode, the device do
> > > > > > > not act on the
> > > > > > incoming writes, it only keep tracks of its internal device
> > > > > > context change as part of this.
> > > > > >
> > > > > > So only the driver notification is allowed by not config write?
> > > > > > What's the consideration for allowing driver notification?
> > > > > >
> > > > > Because for most practical purposes, peer device wants to queue
> > > > > blk, net
> > > > other requests and not do device configuration.
> > > > >
> > > > > Do you know any device configuration space which is RW?
> > > > > For net and blk I recall it as RO?
> > > >
> > > > No it isn't. Pls look at the spec if you need to check that ;)
> > > >
> > > Ok. will check. But regardless, it is fine, because when STOP is done, config
> > writes should not occur anyway.
> > 
> > 
> > i don't see a statement like this but maybe i missed it.
> >
> I am missing it, will add.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10  7:19         ` Parav Pandit
  2023-10-10 12:41           ` Michael S. Tsirkin
@ 2023-10-11  3:14           ` Jason Wang
  2023-10-11  6:02             ` Michael S. Tsirkin
  2023-10-11 10:47             ` Parav Pandit
  1 sibling, 2 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-11  3:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Tue, Oct 10, 2023 at 3:19 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 10, 2023 11:21 AM
> >
> > On Mon, Oct 9, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, October 9, 2023 2:19 PM
> > > >
> > > > Adding LingShan.
> > > >
> > > Thanks for adding him.
> > >
> > > > Parav, if you want any specific people to comment, please do cc them.
> > > >
> > > Sure, will cc them in v2 as now I see there is interest in the review.
> > >
> > > > On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > > One or more passthrough PCI VF devices are ubiquitous for virtual
> > > > > machines usage using generic kernel framework such as vfio [1].
> > > >
> > > > Mentioning a specific subsystem in a specific OS may mislead the
> > > > user to think it can only work in that setup. Let's not do that,
> > > > virtio is not only used for Linux and VFIO.
> > > >
> > > Not really. it is an example in the cover letter.
> > > It is not the only use case.
> > > A use case gives a crisp clarity of what UAPI it needs to fulfil.
> > > So I will keep it. It is anyway written as one use case.
> > >
> > > > >
> > > > > A passthrough PCI VF device is fully owned by the virtual machine
> > > > > device driver.
> > > >
> > > > Is this true? Even VFIO needs to mediate PCI stuff. Or how do you
> > > > define "passthrough" here?
> > > >
> > > Other than PCI config registers and due to some legacy, msix.
> > > The "device interface" side is not mediated.
> > > The definition of passthrough here is: To not mediate a device type specific
> > and virtio specific interfaces for modern and future devices.
> >
> > Ok, but what's the difference between "device type specific" and "virtio specific
> > interfaces". Maybe an example for this?
> >
> Virtio device specific means: cvq of crypto device, cvq of net device, flow filter vqs of net device etc.
> Virtio specific interface: virtio driver notifications, virtio virtqueue and configuration mediation etc.
>
> > >
> > > > > This passthrough device controls its own device reset flow, basic
> > > > > functionality as PCI VF function level reset
> > > >
> > > > How about other PCI stuff? Or Why is FLR special?
> > > FLR is special for the readers to get the clarity that FLR is also done by the
> > guest driver hence, the device migration commands do not interact/depend
> > with FLR flow.
> >
> > It's still not clear to me how this is done.
> >
> > 1) guest starts FLR
> > 2) adminq freeze the VF
> > 3) FLR is done
> >
> > If the freezing doesn't wait for the FLR, does it mean we need to migrate to a
> > state like FLR is pending? If yes, do we need to migrate the other sub states like
> > this? If not, why?
> >
> In most practical cases #2 followed by #1 should not happen as on the source side the expected is mode change to stop from active.

How does the hypervisor know if a guest is doing what without trapping?

> But ok, since we active to freeze mode change is allowed, lets discuss above.
>
> A device is the single synchronization point for any device reset, FLR or admin command operation.

So you agree we need synchronization? And I'm not sure I get the
meaning of synchronization point, do you mean the synchronization
between freeze/stop and virtio facilities?

> So, the migration driver do not need to wait for FLR to complete.

I'm confused, you said below that device context could be changed by FLR.

If FLR needs to clear device context, we can have a race where device
context is cleared when we are trying to read it?

> When admin cmd freeze the VF it can expect FLR_completed VF.

We need to explain why and how about the resume? For example, is
resuming required to wait for the completion of FLR, if not, why?

> Secondly since the FLR is local to the source, intermediate sub state does not migrate.
>
> But I agree, it is worth to have the text capturing this.
>
> > >
> > > >
> > > > > and rest of the virtio device functionality such as control vq,
> > > >
> > > > What do you mean by "rest of"?
> > > >
> > > As given in the example cvq.
> > >
> > > > Which part is not controlled and why?
> > > Not controlled because as states, it is passthrough device.
> > >
> > > > > config space access, data path descriptors handling.
> > > > >
> > > > > Additionally, VM live migration using a precopy method is also widely
> > used.
> > > >
> > > > Why is this mentioned here?
> > > >
> > > Huh. You should be positive for bringing clarity to the readers on
> > understanding the use case.
> > > And you seem opposite, but ok.
> > >
> > > As stated, it for the reader to understand the use case and see how proposed
> > commands addresses the use case.
> >
> > The problem is that the hardware features should be designed for a general
> > purpose instead of a specific technology if it can. The only missing part for post
> > copy is the page fault.
> >
> Ok. The use case and requirement of member device passthrough is clear to most reviewers now.

In another thread you are saying that the PCI composition is done by
hypervisor, so passthrough is really confusing at least for me.

> So I will remove it from commit log.
>
> > >
> > > > >
> > > > > To support a VM live migration for such passthrough virtio
> > > > > devices, the owner PCI PF device administers the device migration flow.
> > > >
> > > > Well, if this is specific only to PCI SR-IOV, I'd move it to the PCI transport
> > part.
> > > > But I guess not.
> > > We took the decision to not do so, for other group commands as well.
> > > After Michael's suggestion we moved it to group commands.
> > > So I will not debate this further.
> > >
> > > >
> > > > >
> > > > > This patch introduces the basic theory of operation which
> > > > > describes the flow and supporting administration commands.
> > > > >
> > > > > [1]
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/t
> > > > > ree/
> > > > > include/uapi/linux/vfio.h?h=v6.1.47
> > > > >
> > > > > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > > > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > > > ---
> > > > >  admin-cmds-device-migration.tex | 94
> > > > +++++++++++++++++++++++++++++++++
> > > > >  admin.tex                       |  1 +
> > > > >  2 files changed, 95 insertions(+)  create mode 100644
> > > > > admin-cmds-device-migration.tex
> > > > >
> > > > > diff --git a/admin-cmds-device-migration.tex
> > > > > b/admin-cmds-device-migration.tex new file mode 100644 index
> > > > > 0000000..f839af4
> > > > > --- /dev/null
> > > > > +++ b/admin-cmds-device-migration.tex
> > > > > @@ -0,0 +1,94 @@
> > > > > +\subsubsection{Device Migration}\label{sec:Basic Facilities of a
> > > > > +Virtio Device / Device groups / Group administration commands /
> > > > > +Device Migration}
> > > > > +
> > > > > +In some systems, there is a need to migrate a running virtual
> > > > > +machine from one to another system. A running virtual machine has
> > > > > +one or more passthrough virtio member devices attached to it. A
> > > > > +passthrough device is entirely operated by the guest virtual
> > > > > +machine. For example, with the SR-IOV group type, group member
> > > > > +(VF) may undergo virtio device initialization and reset flow
> > > >
> > > > What do you mean by "reset flow"? It looks not like a terminology
> > > > defined in the PCI spec. And Google gives me nothing about this.
> > > >
> > > "reset flow" = virtio specification section 2.4 Device Reset flow.
> >
> > My git repo show it's still called "device reset" and I see you use "FLR flow"
> > which is also not very clear to me.
> >
> Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> This section is not normative section, so using an extra word like "flow" does not confuse anyone.
> I will link to the section anyway.

Probably, but you mention FLR flow as well.

>
> > >
> > > > > and may also undergo PCI function level
> > > > > +reset(FLR) flow.
> > > >
> > > > Why is only FLR special here? I've asked FRS but you ignore the question.
> > > >
> > > FLR is special to bring clarity that guest owns the VF doing FLR, hence
> > hypervisor cannot mediate any registers of the VF.
> >
> > It's not about mediation at all, it's about how the device can implement what
> > you want here correctly.
> >
> > See my above question.
> >
> Ok. it is clear that live migration commands cannot stay on the member device because the member device can undergo device reset and FLR flows owned by the guest.

I disagree, hypervisors can emulate FLR and never send FLR to real devices.

> (and hypervisor is not involved in these two flows, hence the admin command interface is designed such that it can fullfil above requirements).
>
> Theory of operation brings out this clarity. Please notice that it is in introductory section with an example.
> Not normative line.
>
> > >
> > > > > Such flows must comply to the PCI standard and also
> > > > > +virtio specification;
> > > >
> > > > This seems unnecessary and obvious as it applies to all other PCI
> > > > and virtio functionality.
> > > >
> > > Great. But your comment is contradicts.
> > >
> > > > What's more, for the things that need to be synchronized, I don't
> > > > see any descriptions in this patch. And if it doesn't need, why?
> > > With which operation should it be synchronized and why?
> > > Can you please be specific?
> >
> > See my above question regarding FLR. And it may have others which I haven't
> > had time to audit.
> >
> Ok. when you get chance to audit, lets discuss that time.

Well, I'm not the author of this series, it should be your job
otherwise it would be too late.

For example, how is the power management interaction with the freeze/stop?

>
> > >
> > > It is not written in this series, because we believe it must not be synchronized
> > as it is fully controlled by the guest.
> > >
> > > >
> > > > > at the same time such flows must not obstruct
> > > > > +the device migration flow. In such a scenario, a group owner
> > > > > +device can provide the administration command interface to
> > > > > +facilitate the device migration related operations.
> > > > > +
> > > > > +When a virtual machine migrates from one hypervisor to another
> > > > > +hypervisor, these hypervisors are named as source and destination
> > > > hypervisor respectively.
> > > > > +In such a scenario, a source hypervisor administers the member
> > > > > +device to suspend the device and preserves the device context.
> > > > > +Subsequently, a destination hypervisor administers the member
> > > > > +device to setup a device context and resumes the member device.
> > > > > +The source hypervisor reads the member device context and the
> > > > > +destination hypervisor writes the member device context. The
> > > > > +method to transfer the member device context from the source to
> > > > > +the destination hypervisor is
> > > > outside the scope of this specification.
> > > > > +
> > > > > +The member device can be in any of the three migration modes. The
> > > > > +owner driver sets the member device in one of the following modes
> > > > > +during
> > > > device migration flow.
> > > > > +
> > > > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name &
> > > > > +Description \\ \hline \hline
> > > > > +0x0   & Active &
> > > > > +  It is the default mode after instantiation of the member
> > > > > +device. \\
> > > >
> > > > I don't think we ever define "instantiation" anywhere.
> > > >
> > > Well a transport has implicit definition of the instantiation already.
> > > May be a text can be added, but don’t see a value in duplicating PCI spec
> > here.
> >
> > Ok, maybe something like "transport specific instantiation"
> >
> Ok. that’s a good text. I will change to it.
>
> > >
> > > > > +\hline
> > > > > +0x1   & Stop &
> > > > > + In this mode, the member device does not send any notifications,
> > > > > +and it does not access any driver memory.
> > > >
> > > > What's the meaning of "driver memory"?
> > > >
> > > May be guest memory? Or do you suggest a better naming for the memory
> > allocated by the guest driver?
> >
> > Virtqueue?
> >
> Virtqueue and any memory referred by the virtqueue.
>
> This is good text, I will change to it.
>
> > >
> > > > And stop seems to be a source of inflight buffers.
> > > >
> > > I didn’t follow it.
> > > If you mean without stop there are no inflight buffer, then I don’t agree.
> > > We don’t want to violate the spec by having descriptors with zero size
> > returned.
> > > Stop is not the source of inflight descriptors.
> >
> > I think not since you forbid access to the used ring here. So even if the buffer
> > were processed by the device it can't be added back to the used ring thus
> > became inflight ones.
> >
> > >
> > > There are inflight descriptors with the device that are not yet returned to the
> > driver, and device wont return them as zero size wrong completions.
> > >
> > > > > + The member device may receive driver notifications in this mode,
> > > >
> > > > What's the meaning of "receive"? For example if the device can still
> > > > process buffers, "stop" is not accurate.
> > > >
> > > Receive means, driver can send the notification as PCIe TLP that device may
> > receive as incoming PCIe TLP.
> >
> > Ok, so this is the transport level. But the device can keep processing the queue?
> >
> Device cannot process the queue because it does not initiate any read/write towards the virtqueue.

Read/Write only results in a driver noticeable behaviour, it doesn't
mean the device can't process the buffers.  For example, devices can
keep processing available buffers and make them as inflight ones.

>
> > >
> > > In "stop" mode, the device wont process descriptors.
> >
> > If the device won't process descriptors, why still allow it to receive notifications?
> Because notification may still arrive and if the device may update any counters as part of

Which counters did you mean here?

> it which needs to be migrated or store the received notification.
>
> > Or does it really matter if the device can receive or not here?
> >
> From device point of view, the device is given the chance to update its device context as part of notifications or access to it.

This is in conflict with what you said above " Device cannot process
the queue ..."

Maybe you can give a concrete example.

>
> > >
> > > > > + the member device context
> > > >
> > > > I don't think we define "device context" anywhere.
> > > >
> > > It is defined further in the description.
> >
> > Like this?
> >
> > """
> >  +The member device has a device context which the owner driver can  +either
> > read or write. The member device context consist of any device  +specific data
> > which is needed by the device to resume its operation  +when the device mode
> > """
> >
> Yes.
> Further patch-3 adds the device context and also add the link to it in the theory of operation section so reader can read more detail about it.
>
> > "Any" is probably too hard for vendors to implement. And in patch 3 I only see
> > virtio device context. Does this mean we don't need transport
> > (PCI) context at all? If yes, how can it work?
> >
> Right. PCI member device is present at source and destination with its layout, only the virtio device context is transferred.
> Which part cannot work?

It is explained in another thread where you are saying the PCI
requires mediation. I think any author should not ignore such
important assumptions in both the change log and the patch.

And again, the more I review the more I see how narrow this series can be used:

1) Only works for SR-IOV member device like VF
2) Mediate PCI but not virtio which is tricky
3) Can only work for a specific BAR/capability register layout

Only 1) is described in the change log.

The other important assumptions like 2) and 3) are not documented
anywhere. And this patch never explains why 2) and 3) is needed or why
it can be used for subsystems other than VFIO/Linux.

>
> > >
> > > > >and device configuration space may change. \\
> > > > > +\hline
> > > >
> > > > I still don't get why we need a "stop" state in the middle.
> > > >
> > > All pci devices which belong to a single guest VM are not stopped atomically.
> > > Hence, one device which is in freeze mode, may still receive driver
> > > notifications from other pci device,
> >
> > Device may choose to ignore those notifications, no?
> >
> > > or it may experience a read from the shared memory and get garbage data.
> >
> > Could you give me an example for this?
> >
> Section 2.10 Shared Memory Regions.

How can it experience a read in this case?

Btw, shared regions are tricky for hardware.

>
> > > And things can break.
> > > Hence the stop mode, ensures that all the devices get enough chance to stop
> > themselves, and later when freezed, to not change anything internally.
> > >
> > > > > +0x2   & Freeze &
> > > > > + In this mode, the member device does not accept any driver
> > > > > +notifications,
> > > >
> > > > This is too vague. Is the device allowed to be freezed in the middle
> > > > of any virtio or PCI operations?
> > > >
> > > > For example, in the middle of feature negotiation etc. It may cause
> > > > implementation specific sub-states which can't be migrated easily.
> > > >
> > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > It is passthrough device, hence hypervisor layer do not get to see sub-state.
> > >
> > > Not sure why you comment, why it cannot be migrated easily.
> > > The device context already covers this sub-state.
> >
> > 1) driver writes driver_features
> > 2) driver sets FEAUTRES_OK
> >
> > 3) device receive driver_features
> > 4) device validating driver_features
> > 5) device clears FEATURES_OK
> >
> > 6) driver read stats and realize FEATURES_OK is being cleared
> >
> > Is it valid to be frozen of the above?
> No. device mode is frozen when hypervisor is sure that no more access by the guest will be done.

How, you don't trap so 1) and 2) are posted, how can hypervisor know
if there's inflight transactions to any registers?

> What can happen between #2 and #3, is device mode may change to stop.

Why can't be freezed in this case? It's really hard to deduce why it
can't just from your above descriptions.

Even if it had, is it even possible to list all the places where
freezing is prohibited? We don't want to end up with a spec that is
hard to implement or leave the vendor to figure out those tricky
parts.

> And in stop mode, device context would capture #5 or #4, depending where is device at that point.
>
> > >
> > > > And what's more, the above state machine seems to be virtio
> > > > specific, but you don't explain the interaction with the device status state
> > machine.
> > > First, above is not a state machine.
> >
> > So how do readers know if a state can go to another state and when?
> >
> Not sure what you mean by reader. Can you please explain.

The people who read virtio spec.

>
> > > Second, it is not virtio specific.
> >
> > It's somehow for sure, for example you said device context need to be
> > preserved. And as far as I see the device context is all virtio specific in patch 3.
> >
> Sure, device context is virtio specific. :)
> Device context will reflect if things changed in the stop mode.
>
> > > It is present in leading OS that has fundamental requirement to support P2P
> > devices.
> >
> > If it's PCI specific, instead of trying to do a workaround in virtio, why not invent
> > a mechanism there?
> >
> It is not a workaround in virtio.
> It is the way pci p2p devices work for which one needs to be receptive to handle the interaction.
>
>
> > > Third, it is not, interacing with the _actua_ device status.
> > >
> > > In "SUSPEND" patch-5, you already asked this question. I assume you asked
> > again so that this series is complete.
> > >
> > > > For example,
> > > > what happens if the driver wants to reset but the device is in stop
> > > > mode? You told me it is addressed in your series but looks not. Once
> > > > you try to describe that, you're actually try to connect states between the
> > two state machines.
> > > >
> > > As listed in the definition of the stop mode, the device do not act on the
> > incoming writes, it only keep tracks of its internal device context change as part
> > of this.
> >
> > So only the driver notification is allowed by not config write? What's the
> > consideration for allowing driver notification?
> >
> Because for most practical purposes, peer device wants to queue blk, net other requests and not do device configuration.

You forbid the device to process the queue but only allow the
notification. How can the device queue those requests? The device can
just do the available buffer check after resume, then it's all fine.

>
> Do you know any device configuration space which is RW?
> For net and blk I recall it as RO?

For example, WCE. What's more important, the spec allows config space
to be RW, so even if there's no examples before, it doesn't mean we
won't have a RW in the future.

>
> > Let me ask differently, similar to FLR, what happens if the driver wants a virtio
> > reset but the hypervisor wants to stop or freeze?
> >
> The device would respond to stop/freeze request when it has internally started the reset, as device is the single synchronization point which knows how to handle both in parallel.

Let's define the synchronization point first. And it demonstrates at
least devices need to synchronize between the free/stop and virtio
device status machine which is not as easy as what is done in this
patch.

>
> > > We would enrich the device context for this, but no need to connects the
> > admin mode controlled by the owner device with operational state
> > (device_status) owned by the member device.
> > >
> > > > > + it ignores any device configuration space writes,
> > > >
> > > > How about read and the device configuration changes?
> > > >
> > > As listed, device do not have any changes.
> > > So device configuration change cannot occur.
> >
> > It's not necessarily caused by config write, it could be things like link status or
> > geometry changes that are initiated from the device.
> >
> I understand it. Link status was one example, you listed other examples too.
> The point is, when in freeze mode, the member device is frozen, hence, device won't initiate those changes.
>
> > >
> > > The device requirements cover this content more explicitly:
> > >
> > > For the SR-IOV group type, regardless of the member device mode, all
> > > the PCI transport level registers MUST be always accessible and the
> > > member device MUST function the same way for all the PCI transport level
> > registers regardless of the member device mode.
> > >
> > > > > + the device do not have any changes in the device context. The
> > > > > + member device is not accessed in the system through the virtio
> > interface.
> > > > > + \\
> > > >
> > > > But accessible via PCI interface?
> > > >
> > > Yes, as usual.
> > >
> > > > For example, what happens if we want to freeze during FLR? Does the
> > > > hypervisor need to wait for the FLR to be completed?
> > > >
> > > Hypervisor do not need wait for the FLR to be completed.
> >
> > So does FLR change device context?
> Yes.

So this implies the freeze needs to wait for FLR otherwise device
context may change.

>
> >
> > >
> > > > > +\hline
> > > > > +\hline
> > > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > > +\hline
> > > > > +\end{tabularx}
> > > > > +
> > > > > +When the owner driver wants to stop the operation of the device,
> > > > > +the owner driver sets the device mode to \field{Stop}. Once the
> > > > > +device is in the \field{Stop} mode, the device does not initiate
> > > > > +any notifications or does not access any driver memory. Since the
> > > > > +member driver may be still active which may send further driver
> > > > > +notifications to the device, the device context may be updated.
> > > > > +When the member driver has stopped accessing the device, the
> > > > > +owner driver sets the device to \field{Freeze} mode indicating to
> > > > > +the device that no more driver access occurs. In the
> > > > > +\field{Freeze} mode, no more changes occur in the device context.
> > > > > +At this point, the device ensures that
> > > > there will not be any update to the device context.
> > > >
> > > > What is missed here are:
> > > >
> > > > 1) it is a virtio specific states or not
> > > It is not.
> > >
> > > > 2) if it is a virtio specific state, if or how to synchronize with
> > > > transport specific interfaces and why
> > > > 3) can active go directly to freeze and why
> > > >
> > > Yes. don’t see a reason to not allow it.
> > > Active to freeze mode can change is useful on the destination side, where
> > destination hypervisor knows for sure that there is no other entity accessing the
> > device.
> > > And it needs to setup the device context, it received from the source side.
> > > So setting freeze mode can be done directly.
> > >
> > > > > +
> > > > > +The member device has a device context which the owner driver can
> > > > > +either read or write. The member device context consist of any
> > > > > +device specific data which is needed by the device to resume its
> > > > > +operation when the device mode
> > > >
> > > > This is too vague. There're states that are not suitable for cmd/queue for
> > sure.
> > > > I'd split it into
> > > >
> > > > 1) common states: virtqueue, dirty pages
> > > > 2) device specific states: defined be each device
> > > >
> > > This is theory of operation section. So it capturing such details.
> > > Actual device context definition is outside of theory, and precise states of
> > virtqueue, device specific, etc are in it.
> >
> > See my comment above regarding to the device context.
> >
> I replied above, device context link is added in the patch-3 in the theory of operation.
> So reader gets the complete view.
>
> > >
> > > > > +is changed from \field{Stop} to \field{Active} or from
> > > > > +\field{Freeze} to \field{Active}.
> > > > > +
> > > > > +Once the device context is read, it is cleared from the device.
> > > >
> > > > This is horrible, it means we can't easily
> > > >
> > > > 1) re-try the migration
> > > > 2) recover from migration failure
> > > >
> > > Can you please explain the flow?
> >
> > When migration fails, management can choose to resume the device(VM) on
> > the source.
> >
> ok. This should be possible as the management which has the device context, it can restore it on the source
> and move the device mode to active.
>
> > If the state were cleared, it means there's not simple way to resume the device
> > but restoring the whole context.
> >
> Yes, as you say, by restoring the whole context will suffice this corner/rare case scenario.
>
> > What's the consideration for such clearing?
> >
> There are two considerations.
> 1.  If one does not clear, till how long should it be kept on the device?

Until virtio reset, this is how virtio works now. I've pointed out
that it may cause extra troubles when trying to resume, but you don't
tell me what's wrong to keep that?

> 2. device context returns incremental value from the previous read. So, it needs to clear it.

I don't understand here. This is not the case for most of the devices.

>
> > > And which software stack may find this useful?
> > > Is there any existing software that can utilize it?
> >
> > Libvirt.
> >
> Does libvirt restore on migration failure?

Yes.

>
> > > Why that device context present with the software vanished, in your
> > assumption, if it is?
> > >
> > > > > Typically, on
> > > > > +the source hypervisor, the owner driver reads the device context
> > > > > +once when the device is in \field{Active} or \field{Stop} mode
> > > > > +and later once the member device is in \field{Freeze} mode.
> > > >
> > > > Why need the read while device context could be changed? Or is the
> > > > dirty page part of the device context?
> > > >
> > > It is not part of the dirty page.
> > > It needs to read in the active/stop mode, so that it can be shared with
> > destination hypervisor, which will pre-setup the complex context of the device,
> > while it is still running on the source side.
> >
> > Is such a method used by any hypervisor?
> Yes. qemu which uses vfio interface uses it.

Ok, such software technology could be used for all types of devices, I
don't see any advantages to mention it here unless it's unique to
virtio.

>
> >
> > >
> > > > > +
> > > > > +Typically, the device context is read and written one time on the
> > > > > +source and the destination hypervisor respectively once the
> > > > > +device is in \field{Freeze} mode. On the destination hypervisor,
> > > > > +after writing the device context, when the device mode set to
> > > > > +\field{Active}, the device uses the most recently set device
> > > > > +context and resumes the device
> > > > operation.
> > > >
> > > > There's no context sequence, so this is obvious. It's the semantic
> > > > of all other existing interfaces.
> > > >
> > > Can you please what which existing interfaces do you mean here?
> >
> > For any common cfg member. E.g queue_addr.
> >
> > The driver wrote 100 different values to queue_addr and the device used the
> > value written last time.
> >
> o.k. I don’t see any problem in stating what is done, which is less vague. 😊
>
> > >
> > > > > +
> > > > > +In an alternative flow, on the source hypervisor the owner driver
> > > > > +may choose to read the device context first time while the device
> > > > > +is in \field{Active} mode and second time once the device is in
> > > > > +\field{Freeze}
> > > > mode.
> > > >
> > > > Who is going to synchronize the device context with possible
> > > > configuration from the driver?
> > > >
> > > Not sure I understand the question.
> > > If I understand you right, do you mean that, When configuration change
> > > is done by the guest driver, how does device context change?
> > >
> >
> > Yes.
> >
> > > If so, device context reading will reflect the new configuration.
> >
> > How do you do that? For example:
> >
> > static inline void vp_iowrite64_twopart(u64 val,
> >                                         __le32 __iomem *lo,
> >                                         __le32 __iomem *hi) {
> >         vp_iowrite32((u32)val, lo);
> >         vp_iowrite32(val >> 32, hi);
> > }
> >
> > Is it ok to be freezed in the middle of two vp_iowrite()?
> >
> Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG section captures the partial value.

There's no way for the device to know whether or not it's a partial
value or not. No?

>
> > >
> > > > > Similarly, on the
> > > > > +destination hypervisor writes the device context first time while
> > > > > +the device is still running in \field{Active} mode on the source
> > > > > +hypervisor and writes the device context second time while the
> > > > > +device is in
> > > > \field{Freeze} mode.
> > > > > +This flow may result in very short setup time as the device
> > > > > +context likely have minimal changes from the previously written device
> > context.
> > > >
> > > > Is the hypervisor who is in charge of doing the comparison and
> > > > writing only the delta?
> > > >
> > > The spec commands allow to do so. So possibility exists from spec wise.
> >
> > There are various optimizations for migration for sure, I don't think mentioning
> > any specific one is good.
> >
> The text is informative text similar to,
>
> " However, some devices benefit from the ability to find out the amount of available data in the queue without
> accessing the virtqueue in memory"
>
> " To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has been negotiated".
>
> Is this the only optimization in virtio? No, but we still mention the rationale of why it exists.

The above is a good example as it explain VIRTIO_F_NOTIFICATION_DATA
is the only way without accessing the virtqueue. But this is not the
case of migration. You said it's just a possibility but not a must
which is not the case for VIRTIO_F_NOTIFICATION_DATA.

Thanks


> As long as the rationale do not confuse the reader, and adds the value explaining how things work, it is fine to add.
> Which is what above few lines did.
> So let's keep it.
>
> The easiest is to cut out the whole theory of operation and just write commands like how RSS command did, without even writing a single line about RSS.
> I think we can do better explanation than that for new things we add.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-11  3:14           ` Jason Wang
@ 2023-10-11  6:02             ` Michael S. Tsirkin
  2023-10-11 10:47             ` Parav Pandit
  1 sibling, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-11  6:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Wed, Oct 11, 2023 at 11:14:14AM +0800, Jason Wang wrote:
> > > > > What's more, for the things that need to be synchronized, I don't
> > > > > see any descriptions in this patch. And if it doesn't need, why?
> > > > With which operation should it be synchronized and why?
> > > > Can you please be specific?
> > >
> > > See my above question regarding FLR. And it may have others which I haven't
> > > had time to audit.
> > >
> > Ok. when you get chance to audit, lets discuss that time.
> 
> Well, I'm not the author of this series, it should be your job
> otherwise it would be too late.
> 
> For example, how is the power management interaction with the freeze/stop?

Right. I think in the same way this allows passthrough BAR access
it should allow passthrough config access, that is document what
exactly is the requirement. is just atomicity enough? and should
we require something from driver too?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-10  9:58           ` Parav Pandit
@ 2023-10-11 10:07             ` Zhu, Lingshan
  2023-10-11 10:54               ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-11 10:07 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/10/2023 5:58 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Tuesday, October 10, 2023 2:22 PM
>>
>> On 10/9/2023 10:30 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, October 9, 2023 4:04 PM
>>>>
>>>> On 10/8/2023 7:41 PM, Michael S. Tsirkin wrote:
>>>>> On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
>>>>>> Define the device context and its fields for purpose of device
>>>>>> migration. The device context is read and written by the owner
>>>>>> driver on source and destination hypervisor respectively.
>>>>>>
>>>>>> Device context fields will experience a rapid growth post this
>>>>>> initial version to cover many details of the device.
>>>>>>
>>>>>> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
>>>>>> Signed-off-by: Parav Pandit <parav@nvidia.com>
>>>>>> Signed-off-by: Satananda Burla <sburla@marvell.com>
>>>>>> ---
>>>>>> changelog:
>>>>>> v0->v1:
>>>>>> - enrich device context to cover feature bits, device configuration
>>>>>>      fields
>>>>>> - corrected alignment of device context fields
>>>>>> ---
>>>>>>     content.tex        |   1 +
>>>>>>     device-context.tex | 142
>>>> +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>     2 files changed, 143 insertions(+)
>>>>>>     create mode 100644 device-context.tex
>>>>>>
>>>>>> diff --git a/content.tex b/content.tex index 0a62dce..2698931
>>>>>> 100644
>>>>>> --- a/content.tex
>>>>>> +++ b/content.tex
>>>>>> @@ -503,6 +503,7 @@ \section{Exporting Objects}\label{sec:Basic
>>>>>> Facilities
>>>> of a Virtio Device / Expo
>>>>>>     UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>>>>>
>>>>>>     \input{admin.tex}
>>>>>> +\input{device-context.tex}
>>>>>>
>>>>>>     \chapter{General Initialization And Device
>>>>>> Operation}\label{sec:General Initialization And Device Operation}
>>>>>>
>>>>>> diff --git a/device-context.tex b/device-context.tex new file mode
>>>>>> 100644 index 0000000..5611382
>>>>>> --- /dev/null
>>>>>> +++ b/device-context.tex
>>>>>> @@ -0,0 +1,142 @@
>>>>>> +\section{Device Context}\label{sec:Basic Facilities of a Virtio
>>>>>> +Device / Device Context}
>>>>>> +
>>>>>> +The device context holds the information that a owner driver can
>>>>>> +use to setup a member device and resume its operation. The device
>>>>>> +context of a member device is read or written by the owner driver
>>>>>> +using administration commands.
>>>>>> +
>>>>>> +\begin{lstlisting}
>>>>>> +struct virtio_dev_ctx_field_tlv {
>>>>>> +        le32 type;
>>>>>> +        le32 reserved;
>>>>>> +        le64 length;
>>>>>> +        u8 value[];
>>>>>> +};
>>>>>> +
>>>>>> +struct virtio_dev_ctx {
>>>>>> +        le32 field_count;
>>>>>> +        struct virtio_dev_ctx_field_tlv fields[]; };
>>>>>> +
>>>>>> +\end{lstlisting}
>>>> so this still doesn't work for nested
>>> In one use case of nesting, that we came across is:
>>> there is large host_VM which is hosting another guest_VMs.
>>> In such case, the owner PF is passthrough to this host_VM and current
>> proposed scheme continue to function for nesting as well for nested
>> guest_VMs.
>> The system admin can choose only passthrough some of the devices for nested
>> guests, so passthrough the PF to L1 guest is not a good idea, because there can
>> be many devices still work for the host or L1.
> Possible. One size does not fit all.
> What I expressed is most common scenarios that user care about.
don't block existing usecases, don't break the userspace, nested is common.
>
>>> In second use case, where one want to bind only one member device to
>>> one VM, I think same plumbing can be extended to have another VF, to take
>> the role of migration device instead of owner device.
>>> I don’t see a good way to passthrough and also do in-band migration without
>> lot of device specific trap and emulation.
>>> I also don’t know the cpu performance numbers with 3 levels of nested page
>> table translation which to my understanding cannot be accelerated by the
>> current cpu.
>> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and so
>> on, there can be performance overhead, but can be done.
>>
>> So admin vq migration still don't work for nested, this is surely a blocker.
> In specific case of member devices are located at different nest level, it does not.
so you got the point, so this series should not be merged.
>
> Why prevents you have a peer VF do the role of migration driver?
> Basically, what I am proposing is, connect two VFs to the L1 guest. One VF is migration driver, one VF is passthrough to L2 guest.
> And same scheme works.
A peer VF? A management VF? still break the existing usecase. and how do 
you transfer ownership of L2 VF from PF to L1 VF?
>
> On the other hand,
> Many parts of the cpu subsystem such as PML, page tables do not have N level nesting support either.
page tables could be emulated, as showed to you before, just PA to VA, 
nested PA to nested VA
> They all work on top of emulation and pay the price for emulation when nesting is done.
> May be that is the first version for virtio too.
there are performance overhead, but can be done.
>
> I frankly feel that nesting support requires industry level eco system support not just in virtio.
> Virtio attempting to focus on nested and having nearly same level performance as bare metal seems farfetched.
> Maybe I am wrong, as we have not seen such high perf nested env even with sw based device.
>
> What can be possibly done is,
> 1. What admin commands are useful from this series that can be useful for nesting?
> 2. What admin commands from current series needs extension for nesting?
> 3. What admin commands do not work at all for nesting, and hence, need to have new commands.
>
> If we can focus on those, maybe we can find common approach that cater to both commands.
virtio support nested now, dont let your admin vq LM break this.
>
>>> Do you know how does it work for Intel x86_64?
>>> Can it do > 2 level of nested page tables? If no, what is the perf characteristics
>> to expect?
>> of course that can be done, Page table is not a problem, there are soft mmu
>> emulation and viommu, through performance overhead.
> Due to the performance overheads, I really doubt any cloud operator would use passthrough virtio device for any sensible workload.
> But you may know already how nested performance looks like that may be acceptable to users.
Many tenants run their nested cluster. Don't break this.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10  9:40             ` Parav Pandit
@ 2023-10-11 10:25               ` Zhu, Lingshan
  2023-10-11 11:43                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-11 10:25 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/10/2023 5:40 PM, Parav Pandit wrote:
> Hi Lingshan,
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Tuesday, October 10, 2023 2:28 PM
>>
>> On 10/10/2023 1:21 AM, Parav Pandit wrote:
>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>> Sent: Monday, October 9, 2023 9:50 PM
>>>>>>> One or more passthrough PCI VF devices are ubiquitous for virtual
>>>>>>> machines usage using generic kernel framework such as vfio [1].
>>>>>> Mentioning a specific subsystem in a specific OS may mislead the
>>>>>> user to think it can only work in that setup. Let's not do that,
>>>>>> virtio is not only used for Linux and VFIO.
>>>>> This is just one example on how these commands are useful.
>>>>> It can be useful in more ways too in more OSes too.
>>>>> I will drop from the patch commit log and keep as information
>>>>> purpose in
>>>> cover letter.
>>>>> Would that work for you?
>>>>>
>>>>> I don’t have any strong opinion to keep it or remove it as most
>>>>> stakeholders
>>>> has the clear view of requirements now.
>>>>> Let me know.
>>>> So some people use VFs with VFIO. Hence the module name.  This
>>>> sentence by itself seems to have zero value for the spec. Just drop it.
>>> Ok. Will drop.
>> So why not build your admin vq live migration on our config space solution, get
>> out of the troubles, to make your life easier?
>>
> Your this question is completely unrelated to this reply or you misunderstood what dropping commit log means.
if you can rebase admin vq LM on our basic facilities, I think you dont 
need to talk about vfio in the first place,
so I ask you to re-consider Jason's proposal.
>
> Dropping link to vfio does not drop the requirement.
> I am ok to drop because requirements are clear of passthrough of member device.
> Vfio is not a trouble at all.
> Admin command is not a trouble either.
>
> The pure technical reason is: all the functionalities proposed cannot be done in any other existing way.
> Why? For below reasons.
> 1. device context, and write records (aka dirty page addresses) is huge which cannot be shared using config registers at scale of 4000 member devices
dirty page tracking will be implmemented in V2, actually I have the 
patch right now.
inflight descriptor tracking will be implemented by Eugenio in V2.
There are no scale problem as I repeated for many time, they are 
per-device basic facilities, just migrate the VF by its own facility,
so there are no 40000 member devices, this is not per PF.

The device context can be read from config space or trapped, like shadow 
control vq which is already done, that is basic virtualization.
If you want to migrate device context, you need to specify device 
context for every type of device, net maybe easy, how do you see virtio-fs?
And we are migrating stateless devices, or no? How do you migrate 
virtio-fs?
> 2. sharing such large context and write addresses in parallel for multiple devices cannot be done using single register file
see above
> 3. These registers cannot be residing in the VF because VF can undergo FLR, and device reset which must clear these registers

do you mean you want to audit all PCI features? When FLR, the device is rested, do you expect a device remember anything after FLR?
Do you want to trap FLR? Why?

Why FLR block or conflict with live migration?

> 4. When VF does the DMA, all dma occurs in the guest address space, not in hypervisor space; any flr and device reset must stop such dma.
> And device reset and flr are controlled by the guest (not mediated by hypervisor).
if the guest reset the device, it is totally reasonable operation, and 
the guest own the risk, right?
and still, do you want to audit every PCI features? at least you didn't 
do that in your series.
For migration, you know the hypervisor takes the ownership of the device 
in the stop_window.
> 5. Any PASID to separate out admin vq on the VF does not work for two reasons.
> R_1: device flr and device reset must stop all the dmas.
> R_2: PASID by most leading vendors is still not mature enough
> R_3: One also needs to do inversion to not expose PASID capability of the member PCI device to not expose
see above and what if guest shutdown? the same answer, right?
>
>> Actually you don't see any technical problems in our config space proposal,
>> right?
> In config registers method, for passthrough I clearly see the technical problems (functional and scale) listed above.
> Due to which config registers cannot reside on the VF and cannot scale either.
so see above answers.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-11  3:14           ` Jason Wang
  2023-10-11  6:02             ` Michael S. Tsirkin
@ 2023-10-11 10:47             ` Parav Pandit
  2023-10-11 20:14               ` Michael S. Tsirkin
  2023-10-13  1:15               ` Jason Wang
  1 sibling, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-11 10:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, October 11, 2023 8:44 AM
> 
> On Tue, Oct 10, 2023 at 3:19 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, October 10, 2023 11:21 AM
> > >
> > > On Mon, Oct 9, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, October 9, 2023 2:19 PM
> > > > >
> > > > > Adding LingShan.
> > > > >
> > > > Thanks for adding him.
> > > >
> > > > > Parav, if you want any specific people to comment, please do cc them.
> > > > >
> > > > Sure, will cc them in v2 as now I see there is interest in the review.
> > > >
> > > > > On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > > >
> > > > > > One or more passthrough PCI VF devices are ubiquitous for
> > > > > > virtual machines usage using generic kernel framework such as vfio [1].
> > > > >
> > > > > Mentioning a specific subsystem in a specific OS may mislead the
> > > > > user to think it can only work in that setup. Let's not do that,
> > > > > virtio is not only used for Linux and VFIO.
> > > > >
> > > > Not really. it is an example in the cover letter.
> > > > It is not the only use case.
> > > > A use case gives a crisp clarity of what UAPI it needs to fulfil.
> > > > So I will keep it. It is anyway written as one use case.
> > > >
> > > > > >
> > > > > > A passthrough PCI VF device is fully owned by the virtual
> > > > > > machine device driver.
> > > > >
> > > > > Is this true? Even VFIO needs to mediate PCI stuff. Or how do
> > > > > you define "passthrough" here?
> > > > >
> > > > Other than PCI config registers and due to some legacy, msix.
> > > > The "device interface" side is not mediated.
> > > > The definition of passthrough here is: To not mediate a device
> > > > type specific
> > > and virtio specific interfaces for modern and future devices.
> > >
> > > Ok, but what's the difference between "device type specific" and
> > > "virtio specific interfaces". Maybe an example for this?
> > >
> > Virtio device specific means: cvq of crypto device, cvq of net device, flow filter
> vqs of net device etc.
> > Virtio specific interface: virtio driver notifications, virtio virtqueue and
> configuration mediation etc.
> >
> > > >
> > > > > > This passthrough device controls its own device reset flow,
> > > > > > basic functionality as PCI VF function level reset
> > > > >
> > > > > How about other PCI stuff? Or Why is FLR special?
> > > > FLR is special for the readers to get the clarity that FLR is also
> > > > done by the
> > > guest driver hence, the device migration commands do not
> > > interact/depend with FLR flow.
> > >
> > > It's still not clear to me how this is done.
> > >
> > > 1) guest starts FLR
> > > 2) adminq freeze the VF
> > > 3) FLR is done
> > >
> > > If the freezing doesn't wait for the FLR, does it mean we need to
> > > migrate to a state like FLR is pending? If yes, do we need to
> > > migrate the other sub states like this? If not, why?
> > >
> > In most practical cases #2 followed by #1 should not happen as on the source
> side the expected is mode change to stop from active.
> 
> How does the hypervisor know if a guest is doing what without trapping?
>
Hypervisor does not know. The device knows being the recipient of #1 and #2.
 
> > But ok, since we active to freeze mode change is allowed, lets discuss above.
> >
> > A device is the single synchronization point for any device reset, FLR or admin
> command operation.
> 
> So you agree we need synchronization? And I'm not sure I get the meaning of
> synchronization point, do you mean the synchronization between freeze/stop
> and virtio facilities?
>
Synchronization means, handling two events in parallel such as FLR and other.
 
> > So, the migration driver do not need to wait for FLR to complete.
> 
> I'm confused, you said below that device context could be changed by FLR.
> 
Yes.
> If FLR needs to clear device context, we can have a race where device context is
> cleared when we are trying to read it?
> 
I didn’t say clear the context.
FLR updates the device context.
Device is serving the device context read write commands, serving FLR, answering mode change command,
So device knows the best how to avoid any race.

> > When admin cmd freeze the VF it can expect FLR_completed VF.
> 
> We need to explain why and how about the resume? For example, is resuming
> required to wait for the completion of FLR, if not, why?
> 
> > Secondly since the FLR is local to the source, intermediate sub state does not
> migrate.
> >
> > But I agree, it is worth to have the text capturing this.
> >
> > > >
> > > > >
> > > > > > and rest of the virtio device functionality such as control
> > > > > > vq,
> > > > >
> > > > > What do you mean by "rest of"?
> > > > >
> > > > As given in the example cvq.
> > > >
> > > > > Which part is not controlled and why?
> > > > Not controlled because as states, it is passthrough device.
> > > >
> > > > > > config space access, data path descriptors handling.
> > > > > >
> > > > > > Additionally, VM live migration using a precopy method is also
> > > > > > widely
> > > used.
> > > > >
> > > > > Why is this mentioned here?
> > > > >
> > > > Huh. You should be positive for bringing clarity to the readers on
> > > understanding the use case.
> > > > And you seem opposite, but ok.
> > > >
> > > > As stated, it for the reader to understand the use case and see
> > > > how proposed
> > > commands addresses the use case.
> > >
> > > The problem is that the hardware features should be designed for a
> > > general purpose instead of a specific technology if it can. The only
> > > missing part for post copy is the page fault.
> > >
> > Ok. The use case and requirement of member device passthrough is clear to
> most reviewers now.
> 
> In another thread you are saying that the PCI composition is done by hypervisor,
> so passthrough is really confusing at least for me.
>
I explained there what vPCI composition is done there.
PCI config space and msix side of composition is done.
The whole virtio interface is not composed.
 
> > Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> > This section is not normative section, so using an extra word like "flow" does
> not confuse anyone.
> > I will link to the section anyway.
> 
> Probably, but you mention FLR flow as well.
As I said, not repeating the PCIe spec here. The reader knows what FLR of the PCIe transport.

> 
> >
> > > >
> > > > > > and may also undergo PCI function level
> > > > > > +reset(FLR) flow.
> > > > >
> > > > > Why is only FLR special here? I've asked FRS but you ignore the question.
> > > > >
> > > > FLR is special to bring clarity that guest owns the VF doing FLR,
> > > > hence
> > > hypervisor cannot mediate any registers of the VF.
> > >
> > > It's not about mediation at all, it's about how the device can
> > > implement what you want here correctly.
> > >
> > > See my above question.
> > >
> > Ok. it is clear that live migration commands cannot stay on the member device
> because the member device can undergo device reset and FLR flows owned by
> the guest.
> 
> I disagree, hypervisors can emulate FLR and never send FLR to real devices.
> 
That would be some other trap alternative that needs to dissect the device and build infrastructure for such dissection is not desired in the listed use case.
Here we are addressing the requirement of passthrough the device.

So your disagreement is fine for non-passthrough devices.

> > (and hypervisor is not involved in these two flows, hence the admin command
> interface is designed such that it can fullfil above requirements).
> >
> > Theory of operation brings out this clarity. Please notice that it is in
> introductory section with an example.
> > Not normative line.
> >
> > > >
> > > > > > Such flows must comply to the PCI standard and also
> > > > > > +virtio specification;
> > > > >
> > > > > This seems unnecessary and obvious as it applies to all other
> > > > > PCI and virtio functionality.
> > > > >
> > > > Great. But your comment is contradicts.
> > > >
> > > > > What's more, for the things that need to be synchronized, I
> > > > > don't see any descriptions in this patch. And if it doesn't need, why?
> > > > With which operation should it be synchronized and why?
> > > > Can you please be specific?
> > >
> > > See my above question regarding FLR. And it may have others which I
> > > haven't had time to audit.
> > >
> > Ok. when you get chance to audit, lets discuss that time.
> 
> Well, I'm not the author of this series, it should be your job otherwise it would
> be too late.
> 
As author, what we think, I will cover. If you have specific points to add value, please share, I will look into it.

> For example, how is the power management interaction with the freeze/stop?
>
Power management is owned by the guest, like any other virtio interface.
So freeze/stop do not interfere with it.
 
> >
> > > >
> > > > It is not written in this series, because we believe it must not
> > > > be synchronized
> > > as it is fully controlled by the guest.
> > > >
> > > > >
> > > > > > at the same time such flows must not obstruct
> > > > > > +the device migration flow. In such a scenario, a group owner
> > > > > > +device can provide the administration command interface to
> > > > > > +facilitate the device migration related operations.
> > > > > > +
> > > > > > +When a virtual machine migrates from one hypervisor to
> > > > > > +another hypervisor, these hypervisors are named as source and
> > > > > > +destination
> > > > > hypervisor respectively.
> > > > > > +In such a scenario, a source hypervisor administers the
> > > > > > +member device to suspend the device and preserves the device
> context.
> > > > > > +Subsequently, a destination hypervisor administers the member
> > > > > > +device to setup a device context and resumes the member device.
> > > > > > +The source hypervisor reads the member device context and the
> > > > > > +destination hypervisor writes the member device context. The
> > > > > > +method to transfer the member device context from the source
> > > > > > +to the destination hypervisor is
> > > > > outside the scope of this specification.
> > > > > > +
> > > > > > +The member device can be in any of the three migration modes.
> > > > > > +The owner driver sets the member device in one of the
> > > > > > +following modes during
> > > > > device migration flow.
> > > > > > +
> > > > > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name
> > > > > > +& Description \\ \hline \hline
> > > > > > +0x0   & Active &
> > > > > > +  It is the default mode after instantiation of the member
> > > > > > +device. \\
> > > > >
> > > > > I don't think we ever define "instantiation" anywhere.
> > > > >
> > > > Well a transport has implicit definition of the instantiation already.
> > > > May be a text can be added, but don’t see a value in duplicating
> > > > PCI spec
> > > here.
> > >
> > > Ok, maybe something like "transport specific instantiation"
> > >
> > Ok. that’s a good text. I will change to it.
> >
> > > >
> > > > > > +\hline
> > > > > > +0x1   & Stop &
> > > > > > + In this mode, the member device does not send any
> > > > > > +notifications, and it does not access any driver memory.
> > > > >
> > > > > What's the meaning of "driver memory"?
> > > > >
> > > > May be guest memory? Or do you suggest a better naming for the
> > > > memory
> > > allocated by the guest driver?
> > >
> > > Virtqueue?
> > >
> > Virtqueue and any memory referred by the virtqueue.
> >
> > This is good text, I will change to it.
> >
> > > >
> > > > > And stop seems to be a source of inflight buffers.
> > > > >
> > > > I didn’t follow it.
> > > > If you mean without stop there are no inflight buffer, then I don’t agree.
> > > > We don’t want to violate the spec by having descriptors with zero
> > > > size
> > > returned.
> > > > Stop is not the source of inflight descriptors.
> > >
> > > I think not since you forbid access to the used ring here. So even
> > > if the buffer were processed by the device it can't be added back to
> > > the used ring thus became inflight ones.
> > >
> > > >
> > > > There are inflight descriptors with the device that are not yet
> > > > returned to the
> > > driver, and device wont return them as zero size wrong completions.
> > > >
> > > > > > + The member device may receive driver notifications in this
> > > > > > + mode,
> > > > >
> > > > > What's the meaning of "receive"? For example if the device can
> > > > > still process buffers, "stop" is not accurate.
> > > > >
> > > > Receive means, driver can send the notification as PCIe TLP that
> > > > device may
> > > receive as incoming PCIe TLP.
> > >
> > > Ok, so this is the transport level. But the device can keep processing the
> queue?
> > >
> > Device cannot process the queue because it does not initiate any read/write
> towards the virtqueue.
> 
> Read/Write only results in a driver noticeable behaviour, it doesn't mean the
> device can't process the buffers.  For example, devices can keep processing
> available buffers and make them as inflight ones.
> 
The idea is to stop the device and prepare for the migration, so the command to do so.
Otherwise just the keep the device in active mode and avoid the complications.

> >
> > > >
> > > > In "stop" mode, the device wont process descriptors.
> > >
> > > If the device won't process descriptors, why still allow it to receive
> notifications?
> > Because notification may still arrive and if the device may update any
> > counters as part of
> 
> Which counters did you mean here?
>
The counter that Xuan is adding and any other state that device may have to update as result of driver notification.
For example caching the posted avail index in the notification.
 
> > it which needs to be migrated or store the received notification.
> >
> > > Or does it really matter if the device can receive or not here?
> > >
> > From device point of view, the device is given the chance to update its device
> context as part of notifications or access to it.
> 
> This is in conflict with what you said above " Device cannot process the queue
> ..."
> 
No, it does not.
Device context is updated within the device without accessing the queue memory of the guest.

> Maybe you can give a concrete example.
> 
The above one.

> >
> > > >
> > > > > > + the member device context
> > > > >
> > > > > I don't think we define "device context" anywhere.
> > > > >
> > > > It is defined further in the description.
> > >
> > > Like this?
> > >
> > > """
> > >  +The member device has a device context which the owner driver can
> > > +either read or write. The member device context consist of any
> > > device  +specific data which is needed by the device to resume its
> > > operation  +when the device mode """
> > >
> > Yes.
> > Further patch-3 adds the device context and also add the link to it in the
> theory of operation section so reader can read more detail about it.
> >
> > > "Any" is probably too hard for vendors to implement. And in patch 3
> > > I only see virtio device context. Does this mean we don't need
> > > transport
> > > (PCI) context at all? If yes, how can it work?
> > >
> > Right. PCI member device is present at source and destination with its layout,
> only the virtio device context is transferred.
> > Which part cannot work?
> 
> It is explained in another thread where you are saying the PCI requires
> mediation. I think any author should not ignore such important assumptions in
> both the change log and the patch.
> 
> And again, the more I review the more I see how narrow this series can be used:
>
I explained this before and also covered in the cover letter.
 
> 1) Only works for SR-IOV member device like VF
It can be extended to SIOV member device in future.
Today these are the only type of member device virtio has.

> 2) Mediate PCI but not virtio which is tricky
> 3) Can only work for a specific BAR/capability register layout
> 
> Only 1) is described in the change log.
> 
> The other important assumptions like 2) and 3) are not documented anywhere.
> And this patch never explains why 2) and 3) is needed or why it can be used for
> subsystems other than VFIO/Linux.
>
Since I am not mentioning vfio now, I will refrain from mentioning others as well. :)
 
> >
> > > >
> > > > > >and device configuration space may change. \\
> > > > > > +\hline
> > > > >
> > > > > I still don't get why we need a "stop" state in the middle.
> > > > >
> > > > All pci devices which belong to a single guest VM are not stopped
> atomically.
> > > > Hence, one device which is in freeze mode, may still receive
> > > > driver notifications from other pci device,
> > >
> > > Device may choose to ignore those notifications, no?
> > >
> > > > or it may experience a read from the shared memory and get garbage
> data.
> > >
> > > Could you give me an example for this?
> > >
> > Section 2.10 Shared Memory Regions.
> 
> How can it experience a read in this case?
>
MMIO read/write can be initiated by the peer device while the device is in stopped state.
 
> Btw, shared regions are tricky for hardware.
> 
> >
> > > > And things can break.
> > > > Hence the stop mode, ensures that all the devices get enough
> > > > chance to stop
> > > themselves, and later when freezed, to not change anything internally.
> > > >
> > > > > > +0x2   & Freeze &
> > > > > > + In this mode, the member device does not accept any driver
> > > > > > +notifications,
> > > > >
> > > > > This is too vague. Is the device allowed to be freezed in the
> > > > > middle of any virtio or PCI operations?
> > > > >
> > > > > For example, in the middle of feature negotiation etc. It may
> > > > > cause implementation specific sub-states which can't be migrated easily.
> > > > >
> > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > It is passthrough device, hence hypervisor layer do not get to see sub-
> state.
> > > >
> > > > Not sure why you comment, why it cannot be migrated easily.
> > > > The device context already covers this sub-state.
> > >
> > > 1) driver writes driver_features
> > > 2) driver sets FEAUTRES_OK
> > >
> > > 3) device receive driver_features
> > > 4) device validating driver_features
> > > 5) device clears FEATURES_OK
> > >
> > > 6) driver read stats and realize FEATURES_OK is being cleared
> > >
> > > Is it valid to be frozen of the above?
> > No. device mode is frozen when hypervisor is sure that no more access by the
> guest will be done.
> 
> How, you don't trap so 1) and 2) are posted, how can hypervisor know if there's
> inflight transactions to any registers?
> 
Because hypervisor has stopped the vcpus which are issuing them.

> > What can happen between #2 and #3, is device mode may change to stop.
> 
> Why can't be freezed in this case? It's really hard to deduce why it can't just
> from your above descriptions.
>
On the source hypervisor, the mode changes are active->stop->freeze.
Hence when freeze is done, the hypervisor knows that all inflight has been stopped by now.
 
> Even if it had, is it even possible to list all the places where freezing is
> prohibited? We don't want to end up with a spec that is hard to implement or
> leave the vendor to figure out those tricky parts.
>
The general idea is not prohibiting the freeze/stop mode.
If the device needs more time, let device take time to do it.

 
> > And in stop mode, device context would capture #5 or #4, depending where is
> device at that point.
> >
> > > >
> > > > > And what's more, the above state machine seems to be virtio
> > > > > specific, but you don't explain the interaction with the device
> > > > > status state
> > > machine.
> > > > First, above is not a state machine.
> > >
> > > So how do readers know if a state can go to another state and when?
> > >
> > Not sure what you mean by reader. Can you please explain.
> 
> The people who read virtio spec.
> 
So question is "how reader knows if a state can go to another state and when"?
It is described and listed in the table, when a mode can change.

> > > So only the driver notification is allowed by not config write?
> > > What's the consideration for allowing driver notification?
> > >
> > Because for most practical purposes, peer device wants to queue blk, net
> other requests and not do device configuration.
> 
> You forbid the device to process the queue but only allow the notification. How
> can the device queue those requests? The device can just do the available
> buffer check after resume, then it's all fine.
>
Device can always decide to not queue the request and do the available buffer check later.
The peer device may read also from MMIO space.

So the intermediate step covers this aspect where device_type specific plumbing is not done.
Its generic. A device may choose to omit such doorbells as well as long as it knows it can resume.
 
> >
> > Do you know any device configuration space which is RW?
> > For net and blk I recall it as RO?
> 
> For example, WCE. What's more important, the spec allows config space to be
> RW, so even if there's no examples before, it doesn't mean we won't have a RW
> in the future.
> 
Ok.

> >
> > > Let me ask differently, similar to FLR, what happens if the driver
> > > wants a virtio reset but the hypervisor wants to stop or freeze?
> > >
> > The device would respond to stop/freeze request when it has internally
> started the reset, as device is the single synchronization point which knows how
> to handle both in parallel.
> 
> Let's define the synchronization point first. And it demonstrates at least devices
> need to synchronize between the free/stop and virtio device status machine
> which is not as easy as what is done in this patch.
>
Synchronization point = device.

> >
> > > > We would enrich the device context for this, but no need to
> > > > connects the
> > > admin mode controlled by the owner device with operational state
> > > (device_status) owned by the member device.
> > > >
> > > > > > + it ignores any device configuration space writes,
> > > > >
> > > > > How about read and the device configuration changes?
> > > > >
> > > > As listed, device do not have any changes.
> > > > So device configuration change cannot occur.
> > >
> > > It's not necessarily caused by config write, it could be things like
> > > link status or geometry changes that are initiated from the device.
> > >
> > I understand it. Link status was one example, you listed other examples too.
> > The point is, when in freeze mode, the member device is frozen, hence,
> device won't initiate those changes.
> >
> > > >
> > > > The device requirements cover this content more explicitly:
> > > >
> > > > For the SR-IOV group type, regardless of the member device mode,
> > > > all the PCI transport level registers MUST be always accessible
> > > > and the member device MUST function the same way for all the PCI
> > > > transport level
> > > registers regardless of the member device mode.
> > > >
> > > > > > + the device do not have any changes in the device context.
> > > > > > + The member device is not accessed in the system through the
> > > > > > + virtio
> > > interface.
> > > > > > + \\
> > > > >
> > > > > But accessible via PCI interface?
> > > > >
> > > > Yes, as usual.
> > > >
> > > > > For example, what happens if we want to freeze during FLR? Does
> > > > > the hypervisor need to wait for the FLR to be completed?
> > > > >
> > > > Hypervisor do not need wait for the FLR to be completed.
> > >
> > > So does FLR change device context?
> > Yes.
> 
> So this implies the freeze needs to wait for FLR otherwise device context may
> change.
> 
Device context can change anytime and reflect what is latest.
I will update the patches to reflect that device is the single synchronization point serving flr, mode changes.

> >
> > >
> > > >
> > > > > > +\hline
> > > > > > +\hline
> > > > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > > > +\hline
> > > > > > +\end{tabularx}
> > > > > > +
> > > > > > +When the owner driver wants to stop the operation of the
> > > > > > +device, the owner driver sets the device mode to
> > > > > > +\field{Stop}. Once the device is in the \field{Stop} mode,
> > > > > > +the device does not initiate any notifications or does not
> > > > > > +access any driver memory. Since the member driver may be
> > > > > > +still active which may send further driver notifications to the device,
> the device context may be updated.
> > > > > > +When the member driver has stopped accessing the device, the
> > > > > > +owner driver sets the device to \field{Freeze} mode
> > > > > > +indicating to the device that no more driver access occurs.
> > > > > > +In the \field{Freeze} mode, no more changes occur in the device
> context.
> > > > > > +At this point, the device ensures that
> > > > > there will not be any update to the device context.
> > > > >
> > > > > What is missed here are:
> > > > >
> > > > > 1) it is a virtio specific states or not
> > > > It is not.
> > > >
> > > > > 2) if it is a virtio specific state, if or how to synchronize
> > > > > with transport specific interfaces and why
> > > > > 3) can active go directly to freeze and why
> > > > >
> > > > Yes. don’t see a reason to not allow it.
> > > > Active to freeze mode can change is useful on the destination
> > > > side, where
> > > destination hypervisor knows for sure that there is no other entity
> > > accessing the device.
> > > > And it needs to setup the device context, it received from the source side.
> > > > So setting freeze mode can be done directly.
> > > >
> > > > > > +
> > > > > > +The member device has a device context which the owner driver
> > > > > > +can either read or write. The member device context consist
> > > > > > +of any device specific data which is needed by the device to
> > > > > > +resume its operation when the device mode
> > > > >
> > > > > This is too vague. There're states that are not suitable for
> > > > > cmd/queue for
> > > sure.
> > > > > I'd split it into
> > > > >
> > > > > 1) common states: virtqueue, dirty pages
> > > > > 2) device specific states: defined be each device
> > > > >
> > > > This is theory of operation section. So it capturing such details.
> > > > Actual device context definition is outside of theory, and precise
> > > > states of
> > > virtqueue, device specific, etc are in it.
> > >
> > > See my comment above regarding to the device context.
> > >
> > I replied above, device context link is added in the patch-3 in the theory of
> operation.
> > So reader gets the complete view.
> >
> > > >
> > > > > > +is changed from \field{Stop} to \field{Active} or from
> > > > > > +\field{Freeze} to \field{Active}.
> > > > > > +
> > > > > > +Once the device context is read, it is cleared from the device.
> > > > >
> > > > > This is horrible, it means we can't easily
> > > > >
> > > > > 1) re-try the migration
> > > > > 2) recover from migration failure
> > > > >
> > > > Can you please explain the flow?
> > >
> > > When migration fails, management can choose to resume the device(VM)
> > > on the source.
> > >
> > ok. This should be possible as the management which has the device
> > context, it can restore it on the source and move the device mode to active.
> >
> > > If the state were cleared, it means there's not simple way to resume
> > > the device but restoring the whole context.
> > >
> > Yes, as you say, by restoring the whole context will suffice this corner/rare
> case scenario.
> >
> > > What's the consideration for such clearing?
> > >
> > There are two considerations.
> > 1.  If one does not clear, till how long should it be kept on the device?
> 
> Until virtio reset, this is how virtio works now. I've pointed out that it may cause
> extra troubles when trying to resume, but you don't tell me what's wrong to
> keep that?
> 
If kept, hypervisor may not be able to decide when to change the mode from active->stop.
We can opt for a mode where full device context is read in each mode without clearing it.
But than it can be very specific to a version of qemu, which we are avoiding it here.

> > 2. device context returns incremental value from the previous read. So, it
> needs to clear it.
> 
> I don't understand here. This is not the case for most of the devices.
>
Not sure which devices you mean here with "most of the devices".
Device context functions like a write record pages (aka dirty pages).
Whatever is already returned is/should not be repeated in subsequent reads, though device can choose to do so.
 
> >
> > > > And which software stack may find this useful?
> > > > Is there any existing software that can utilize it?
> > >
> > > Libvirt.
> > >
> > Does libvirt restore on migration failure?
> 
> Yes.
> 
Ok. the device will be able to resume when it is marked active.
The device context returned  is the incremental delta as explained above.

> >
> > > > Why that device context present with the software vanished, in
> > > > your
> > > assumption, if it is?
> > > >
> > > > > > Typically, on
> > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > +context once when the device is in \field{Active} or
> > > > > > +\field{Stop} mode and later once the member device is in
> \field{Freeze} mode.
> > > > >
> > > > > Why need the read while device context could be changed? Or is
> > > > > the dirty page part of the device context?
> > > > >
> > > > It is not part of the dirty page.
> > > > It needs to read in the active/stop mode, so that it can be shared
> > > > with
> > > destination hypervisor, which will pre-setup the complex context of
> > > the device, while it is still running on the source side.
> > >
> > > Is such a method used by any hypervisor?
> > Yes. qemu which uses vfio interface uses it.
> 
> Ok, such software technology could be used for all types of devices, I don't see
> any advantages to mention it here unless it's unique to virtio.
> 
It is theory of operation that brings the clarity and rationale.
So I will keep it.

> >
> > >
> > > >
> > > > > > +
> > > > > > +Typically, the device context is read and written one time on
> > > > > > +the source and the destination hypervisor respectively once
> > > > > > +the device is in \field{Freeze} mode. On the destination
> > > > > > +hypervisor, after writing the device context, when the device
> > > > > > +mode set to \field{Active}, the device uses the most recently
> > > > > > +set device context and resumes the device
> > > > > operation.
> > > > >
> > > > > There's no context sequence, so this is obvious. It's the
> > > > > semantic of all other existing interfaces.
> > > > >
> > > > Can you please what which existing interfaces do you mean here?
> > >
> > > For any common cfg member. E.g queue_addr.
> > >
> > > The driver wrote 100 different values to queue_addr and the device
> > > used the value written last time.
> > >
> > o.k. I don’t see any problem in stating what is done, which is less
> > vague. 😊
> >
> > > >
> > > > > > +
> > > > > > +In an alternative flow, on the source hypervisor the owner
> > > > > > +driver may choose to read the device context first time while
> > > > > > +the device is in \field{Active} mode and second time once the
> > > > > > +device is in \field{Freeze}
> > > > > mode.
> > > > >
> > > > > Who is going to synchronize the device context with possible
> > > > > configuration from the driver?
> > > > >
> > > > Not sure I understand the question.
> > > > If I understand you right, do you mean that, When configuration
> > > > change is done by the guest driver, how does device context change?
> > > >
> > >
> > > Yes.
> > >
> > > > If so, device context reading will reflect the new configuration.
> > >
> > > How do you do that? For example:
> > >
> > > static inline void vp_iowrite64_twopart(u64 val,
> > >                                         __le32 __iomem *lo,
> > >                                         __le32 __iomem *hi) {
> > >         vp_iowrite32((u32)val, lo);
> > >         vp_iowrite32(val >> 32, hi); }
> > >
> > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > >
> > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> section captures the partial value.
> 
> There's no way for the device to know whether or not it's a partial value or not.
> No?
> 
Device does not need to know, because when the guest vm and the device is resumed on the destination, it the guest vm will continue with writing the 2nd part.

> >
> > > >
> > > > > > Similarly, on the
> > > > > > +destination hypervisor writes the device context first time
> > > > > > +while the device is still running in \field{Active} mode on
> > > > > > +the source hypervisor and writes the device context second
> > > > > > +time while the device is in
> > > > > \field{Freeze} mode.
> > > > > > +This flow may result in very short setup time as the device
> > > > > > +context likely have minimal changes from the previously
> > > > > > +written device
> > > context.
> > > > >
> > > > > Is the hypervisor who is in charge of doing the comparison and
> > > > > writing only the delta?
> > > > >
> > > > The spec commands allow to do so. So possibility exists from spec wise.
> > >
> > > There are various optimizations for migration for sure, I don't
> > > think mentioning any specific one is good.
> > >
> > The text is informative text similar to,
> >
> > " However, some devices benefit from the ability to find out the
> > amount of available data in the queue without accessing the virtqueue in
> memory"
> >
> > " To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has
> been negotiated".
> >
> > Is this the only optimization in virtio? No, but we still mention the rationale of
> why it exists.
> 
> The above is a good example as it explain VIRTIO_F_NOTIFICATION_DATA is the
> only way without accessing the virtqueue. But this is not the case of migration.
> You said it's just a possibility but not a must which is not the case for
> VIRTIO_F_NOTIFICATION_DATA.
> 
It is one of the optimization apart. The comparison is of one_of_example or not.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-11 10:07             ` Zhu, Lingshan
@ 2023-10-11 10:54               ` Parav Pandit
  2023-10-11 19:54                 ` Michael S. Tsirkin
  2023-10-12 10:00                 ` Zhu, Lingshan
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-11 10:54 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, October 11, 2023 3:38 PM
> 

> >> The system admin can choose only passthrough some of the devices for
> >> nested guests, so passthrough the PF to L1 guest is not a good idea,
> >> because there can be many devices still work for the host or L1.
> > Possible. One size does not fit all.
> > What I expressed is most common scenarios that user care about.
> don't block existing usecases, don't break the userspace, nested is common.
Nothing is broken as virtio spec do not have any single construct to support migration.
If nested is common, can you share the performance number with real virtio device with/without 2 level nesting?
I frankly don’t know how they look like.

> >
> >>> In second use case, where one want to bind only one member device to
> >>> one VM, I think same plumbing can be extended to have another VF, to
> >>> take
> >> the role of migration device instead of owner device.
> >>> I don’t see a good way to passthrough and also do in-band migration
> >>> without
> >> lot of device specific trap and emulation.
> >>> I also don’t know the cpu performance numbers with 3 levels of
> >>> nested page
> >> table translation which to my understanding cannot be accelerated by
> >> the current cpu.
> >> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and so
> on,
> >> there can be performance overhead, but can be done.
> >>
> >> So admin vq migration still don't work for nested, this is surely a blocker.
> > In specific case of member devices are located at different nest level, it does
> not.
> so you got the point, so this series should not be merged.
> >
> > Why prevents you have a peer VF do the role of migration driver?
> > Basically, what I am proposing is, connect two VFs to the L1 guest. One VF is
> migration driver, one VF is passthrough to L2 guest.
> > And same scheme works.
> A peer VF? A management VF? still break the existing usecase. and how do you
> transfer ownership of L2 VF from PF to L1 VF?

A peer management VF which services admin command (like PF).
Ownership of admin command is delegated to the management VF.

> >
> > On the other hand,
> > Many parts of the cpu subsystem such as PML, page tables do not have N
> level nesting support either.
> page tables could be emulated, as showed to you before, just PA to VA, nested
> PA to nested VA
> > They all work on top of emulation and pay the price for emulation when
> nesting is done.
> > May be that is the first version for virtio too.
> there are performance overhead, but can be done.
> >
> > I frankly feel that nesting support requires industry level eco system support
> not just in virtio.
> > Virtio attempting to focus on nested and having nearly same level
> performance as bare metal seems farfetched.
> > Maybe I am wrong, as we have not seen such high perf nested env even with
> sw based device.
> >
> > What can be possibly done is,
> > 1. What admin commands are useful from this series that can be useful for
> nesting?
> > 2. What admin commands from current series needs extension for nesting?
> > 3. What admin commands do not work at all for nesting, and hence, need to
> have new commands.
> >
> > If we can focus on those, maybe we can find common approach that cater to
> both commands.
> virtio support nested now, dont let your admin vq LM break this.
New spec addition is not breaking existing virtio implementation in sw.
New spec additions of owner and member devices do not apply to non member and non owner devices.

> >
> >>> Do you know how does it work for Intel x86_64?
> >>> Can it do > 2 level of nested page tables? If no, what is the perf
> >>> characteristics
> >> to expect?
> >> of course that can be done, Page table is not a problem, there are
> >> soft mmu emulation and viommu, through performance overhead.
> > Due to the performance overheads, I really doubt any cloud operator would
> use passthrough virtio device for any sensible workload.
> > But you may know already how nested performance looks like that may be
> acceptable to users.
> Many tenants run their nested cluster. Don't break this.
How new spec addition such as crypto device addition broke net device?
Or how net vq interrupt moderation breaks existing sw?
It does not.
They are driven through their own feature bits and admin command capabilities.
It does not break any existing deployments.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-11 10:25               ` Zhu, Lingshan
@ 2023-10-11 11:43                 ` Parav Pandit
  2023-10-12 10:21                   ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-11 11:43 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, October 11, 2023 3:55 PM

> >>>>> I don’t have any strong opinion to keep it or remove it as most
> >>>>> stakeholders
> >>>> has the clear view of requirements now.
> >>>>> Let me know.
> >>>> So some people use VFs with VFIO. Hence the module name.  This
> >>>> sentence by itself seems to have zero value for the spec. Just drop it.
> >>> Ok. Will drop.
> >> So why not build your admin vq live migration on our config space
> >> solution, get out of the troubles, to make your life easier?
> >>
> > Your this question is completely unrelated to this reply or you misunderstood
> what dropping commit log means.
> if you can rebase admin vq LM on our basic facilities, I think you dont need to
> talk about vfio in the first place, so I ask you to re-consider Jason's proposal.
I don’t really know why you are upset with the vfio term.
It is the use case of the cloud operator and it is listed to indicate how proposal fits in a such use case.
If for some reason, you don’t like vfio, fine. Ignore it and move on.

I already answered that I will remove from the commit log, because the requirements are well understood now by the committee.

Your comment is again unrelated (repeated) to your past two questions.

I explained you the technical problem that admin command (not admin VQ) of basic facilities cannot be done using config registers without any mediation layer.

> >
> > Dropping link to vfio does not drop the requirement.
> > I am ok to drop because requirements are clear of passthrough of member
> device.
> > Vfio is not a trouble at all.
> > Admin command is not a trouble either.
> >
> > The pure technical reason is: all the functionalities proposed cannot be done
> in any other existing way.
> > Why? For below reasons.
> > 1. device context, and write records (aka dirty page addresses) is
> > huge which cannot be shared using config registers at scale of 4000
> > member devices
> dirty page tracking will be implmemented in V2, actually I have the patch right
> now.
That is yet again the invitation to non_colloboration mode.
Without reviewing, v0 and v1, you want to show dirty page tracking in some other way.

But ok, that is your non_coperative mode of working. Cannot help further.

> inflight descriptor tracking will be implemented by Eugenio in V2.
When we have near complete proposal from two device vendors, you want to push something to unknown future without reviewing the work; 
does not make sense.

You are still in the mode of _take_ what we did with near zero explanation.
You asked question of why passthrough proposal cannot advantage of in_band config registers.
I explained technical reason listed here.

So please don’t jump to conclusions before finishing the discussion on how both side can take advantage of each other.

Lets please do that.

> There are no scale problem as I repeated for many time, they are per-device
> basic facilities, just migrate the VF by its own facility, so there are no 40000
> member devices, this is not per PF.
> 
I explained that device reset, flr etc flow cannot work when controlling and controlled functions are single entity for passthrough mode.
The scale problem is, one needs to duplicate the registers on each VF.
The industry is moving away from the register interface in many _real_ hw devices implementation.
Some of the examples are IMS, SIOV, NVMe and more.

> The device context can be read from config space or trapped, like shadow
There are 1 million flows of the net device flow filters in progress.
Each flow is 64B in size.
Total size is 64MB.
I don’t see how one can read such amount of memory using config registers.

> control vq which is already done, that is basic virtualization.

There is nothing like "basic virtualization".
What is proposed here is fulfilling the requirement of passthrough mode.

Your comment is implying, "I don’t care for passthrough requirements, do non_passthrough".

The discussion should be,
How can we leverage common framework for passthrough and mediated mode?
Can we? If so, which are the pieces?

For me it is frankly very weird to take native virtio member device, convert into a medicated device using a giant software, and after that convolution get virtio device.
But for nested case you have the use case.
So if we focus positively on how two use cases can use some common functionality, that will be great.

> If you want to migrate device context, you need to specify device context for
> every type of device, net maybe easy, how do you see virtio-fs?
Virtio-fs will have its on device context too.
Every device has some sort of backend in varied degree.
Net being widely used and moderate complex device.
Fs being slightly stateful but less complex than net, as it has far less control operations.
In fact virtio-fs device already discusses the migrating the device side state, as listed in device context.
So virtio-fs device will have its own device-context defined.

The infrastructure and basic facilities are setup in this series, that one can easily extend for all the current and new device types.

> And we are migrating stateless devices, or no? How do you migrate virtio-fs?
> > 2. sharing such large context and write addresses in parallel for
> > multiple devices cannot be done using single register file
> see above
> > 3. These registers cannot be residing in the VF because VF can undergo
> > FLR, and device reset which must clear these registers
> 
> do you mean you want to audit all PCI features? When FLR, the device is rested,

> do you expect a device remember anything after FLR?
Not at all. VF member device will not remember anything after FLR.
> Do you want to trap FLR? Why?
This proposal does _not_ want to trap the FLR in the hypervisor virtio driver.

When one does the mediation-based design, it must trap/emulate/fake the FLR.
It helps to address the case of nested as you mentioned.
> 
> Why FLR block or conflict with live migration?
It does not block or conflict.

The whole point is, when you put live migration functionality on the VF itself, you just cannot FLR this device.
One must trap the FLR and do fake FLR and build the whole infrastructure to not FLR The device.
Above is not passthrough device.

> 
> > 4. When VF does the DMA, all dma occurs in the guest address space, not in
> hypervisor space; any flr and device reset must stop such dma.
> > And device reset and flr are controlled by the guest (not mediated by
> hypervisor).
> if the guest reset the device, it is totally reasonable operation, and the guest
> own the risk, right?
Sure, but the guest still expects its dirty pages and device context to be migrated across device_reset.
Device_reset will lose all this information within the device if done without mediation and special care.

So, to avoid that now one needs to have fake device reset too and build that infrastructure to not reset.

The passthrough proposal fundamental concept is: 

all the native virtio functionalities are between guest driver and the actual device.

> and still, do you want to audit every PCI features? at least you didn't do that in
> your series.
Can you please list which PCI features audit you are talking about?

Keep in mind, that will all the mediation, one now must equally audit all this giant software stack too.
So maybe it is fine for those who are ok with it.

> For migration, you know the hypervisor takes the ownership of the device in the
> stop_window.
I do not know what stop_window means.
Do you mean stop_copy of vfio or it is qemu term?

> > 5. Any PASID to separate out admin vq on the VF does not work for two
> reasons.
> > R_1: device flr and device reset must stop all the dmas.
> > R_2: PASID by most leading vendors is still not mature enough
> > R_3: One also needs to do inversion to not expose PASID capability of
> > the member PCI device to not expose
> see above and what if guest shutdown? the same answer, right?
Not sure, I follow.
If the guest shutdown, the guest specific shutdown APIs are called.

With passthrough device, R_1 just works as is.
R_3 is not needed as they are directly given to the guest.
R_2 platform dependency is not needed either.

> >
> >> Actually you don't see any technical problems in our config space
> >> proposal, right?
> > In config registers method, for passthrough I clearly see the technical
> problems (functional and scale) listed above.
> > Due to which config registers cannot reside on the VF and cannot scale either.
> so see above answers.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-10  8:57           ` Zhu, Lingshan
  2023-10-10  9:40             ` Parav Pandit
@ 2023-10-11 19:51             ` Michael S. Tsirkin
  2023-10-12 10:23               ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-11 19:51 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Tue, Oct 10, 2023 at 04:57:52PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/10/2023 1:21 AM, Parav Pandit wrote:
> > 
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, October 9, 2023 9:50 PM
> > > > > > One or more passthrough PCI VF devices are ubiquitous for virtual
> > > > > > machines usage using generic kernel framework such as vfio [1].
> > > > > Mentioning a specific subsystem in a specific OS may mislead the
> > > > > user to think it can only work in that setup. Let's not do that,
> > > > > virtio is not only used for Linux and VFIO.
> > > > This is just one example on how these commands are useful.
> > > > It can be useful in more ways too in more OSes too.
> > > > I will drop from the patch commit log and keep as information purpose in
> > > cover letter.
> > > > Would that work for you?
> > > > 
> > > > I don’t have any strong opinion to keep it or remove it as most stakeholders
> > > has the clear view of requirements now.
> > > > Let me know.
> > > So some people use VFs with VFIO. Hence the module name.  This sentence by
> > > itself seems to have zero value for the spec. Just drop it.
> > Ok. Will drop.
> So

This is apropos what?

> why not build your admin vq live migration on our config space solution,
> get out of the troubles, to make your life easier?
> Actually you don't see any technical problems in our config space proposal,
> right?

The status bit one? You enumerated some reasons yourself did you not?
And I sent some right when you asked for help seeing more... or did it go right past?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-11 10:54               ` Parav Pandit
@ 2023-10-11 19:54                 ` Michael S. Tsirkin
  2023-10-12 10:00                 ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-11 19:54 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 11, 2023 at 10:54:39AM +0000, Parav Pandit wrote:
> 
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Wednesday, October 11, 2023 3:38 PM
> > 
> 
> > >> The system admin can choose only passthrough some of the devices for
> > >> nested guests, so passthrough the PF to L1 guest is not a good idea,
> > >> because there can be many devices still work for the host or L1.
> > > Possible. One size does not fit all.
> > > What I expressed is most common scenarios that user care about.
> > don't block existing usecases, don't break the userspace, nested is common.
> Nothing is broken as virtio spec do not have any single construct to support migration.
> If nested is common, can you share the performance number with real virtio device with/without 2 level nesting?

Nested is a use case relevant for virualization. Performance numbers on
current hardware and current market analysis are really beside the
point.

> I frankly don’t know how they look like.
> 
> > >
> > >>> In second use case, where one want to bind only one member device to
> > >>> one VM, I think same plumbing can be extended to have another VF, to
> > >>> take
> > >> the role of migration device instead of owner device.
> > >>> I don’t see a good way to passthrough and also do in-band migration
> > >>> without
> > >> lot of device specific trap and emulation.
> > >>> I also don’t know the cpu performance numbers with 3 levels of
> > >>> nested page
> > >> table translation which to my understanding cannot be accelerated by
> > >> the current cpu.
> > >> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and so
> > on,
> > >> there can be performance overhead, but can be done.
> > >>
> > >> So admin vq migration still don't work for nested, this is surely a blocker.
> > > In specific case of member devices are located at different nest level, it does
> > not.
> > so you got the point, so this series should not be merged.
> > >
> > > Why prevents you have a peer VF do the role of migration driver?
> > > Basically, what I am proposing is, connect two VFs to the L1 guest. One VF is
> > migration driver, one VF is passthrough to L2 guest.
> > > And same scheme works.
> > A peer VF? A management VF? still break the existing usecase. and how do you
> > transfer ownership of L2 VF from PF to L1 VF?
> 
> A peer management VF which services admin command (like PF).
> Ownership of admin command is delegated to the management VF.

That sounds really awkward.

> > >
> > > On the other hand,
> > > Many parts of the cpu subsystem such as PML, page tables do not have N
> > level nesting support either.
> > page tables could be emulated, as showed to you before, just PA to VA, nested
> > PA to nested VA
> > > They all work on top of emulation and pay the price for emulation when
> > nesting is done.
> > > May be that is the first version for virtio too.
> > there are performance overhead, but can be done.
> > >
> > > I frankly feel that nesting support requires industry level eco system support
> > not just in virtio.
> > > Virtio attempting to focus on nested and having nearly same level
> > performance as bare metal seems farfetched.
> > > Maybe I am wrong, as we have not seen such high perf nested env even with
> > sw based device.
> > >
> > > What can be possibly done is,
> > > 1. What admin commands are useful from this series that can be useful for
> > nesting?
> > > 2. What admin commands from current series needs extension for nesting?
> > > 3. What admin commands do not work at all for nesting, and hence, need to
> > have new commands.
> > >
> > > If we can focus on those, maybe we can find common approach that cater to
> > both commands.
> > virtio support nested now, dont let your admin vq LM break this.
> New spec addition is not breaking existing virtio implementation in sw.
> New spec additions of owner and member devices do not apply to non member and non owner devices.
> 
> > >
> > >>> Do you know how does it work for Intel x86_64?
> > >>> Can it do > 2 level of nested page tables? If no, what is the perf
> > >>> characteristics
> > >> to expect?
> > >> of course that can be done, Page table is not a problem, there are
> > >> soft mmu emulation and viommu, through performance overhead.
> > > Due to the performance overheads, I really doubt any cloud operator would
> > use passthrough virtio device for any sensible workload.
> > > But you may know already how nested performance looks like that may be
> > acceptable to users.
> > Many tenants run their nested cluster. Don't break this.
> How new spec addition such as crypto device addition broke net device?
> Or how net vq interrupt moderation breaks existing sw?
> It does not.
> They are driven through their own feature bits and admin command capabilities.
> It does not break any existing deployments.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-11 10:47             ` Parav Pandit
@ 2023-10-11 20:14               ` Michael S. Tsirkin
  2023-10-12 10:21                 ` Parav Pandit
  2023-10-13  1:15               ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-11 20:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Wed, Oct 11, 2023 at 10:47:23AM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, October 11, 2023 8:44 AM
> > 
> > On Tue, Oct 10, 2023 at 3:19 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, October 10, 2023 11:21 AM
> > > >
> > > > On Mon, Oct 9, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, October 9, 2023 2:19 PM
> > > > > >
> > > > > > Adding LingShan.
> > > > > >
> > > > > Thanks for adding him.
> > > > >
> > > > > > Parav, if you want any specific people to comment, please do cc them.
> > > > > >
> > > > > Sure, will cc them in v2 as now I see there is interest in the review.
> > > > >
> > > > > > On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > > > >
> > > > > > > One or more passthrough PCI VF devices are ubiquitous for
> > > > > > > virtual machines usage using generic kernel framework such as vfio [1].
> > > > > >
> > > > > > Mentioning a specific subsystem in a specific OS may mislead the
> > > > > > user to think it can only work in that setup. Let's not do that,
> > > > > > virtio is not only used for Linux and VFIO.
> > > > > >
> > > > > Not really. it is an example in the cover letter.
> > > > > It is not the only use case.
> > > > > A use case gives a crisp clarity of what UAPI it needs to fulfil.
> > > > > So I will keep it. It is anyway written as one use case.
> > > > >
> > > > > > >
> > > > > > > A passthrough PCI VF device is fully owned by the virtual
> > > > > > > machine device driver.
> > > > > >
> > > > > > Is this true? Even VFIO needs to mediate PCI stuff. Or how do
> > > > > > you define "passthrough" here?
> > > > > >
> > > > > Other than PCI config registers and due to some legacy, msix.
> > > > > The "device interface" side is not mediated.
> > > > > The definition of passthrough here is: To not mediate a device
> > > > > type specific
> > > > and virtio specific interfaces for modern and future devices.
> > > >
> > > > Ok, but what's the difference between "device type specific" and
> > > > "virtio specific interfaces". Maybe an example for this?
> > > >
> > > Virtio device specific means: cvq of crypto device, cvq of net device, flow filter
> > vqs of net device etc.
> > > Virtio specific interface: virtio driver notifications, virtio virtqueue and
> > configuration mediation etc.
> > >
> > > > >
> > > > > > > This passthrough device controls its own device reset flow,
> > > > > > > basic functionality as PCI VF function level reset
> > > > > >
> > > > > > How about other PCI stuff? Or Why is FLR special?
> > > > > FLR is special for the readers to get the clarity that FLR is also
> > > > > done by the
> > > > guest driver hence, the device migration commands do not
> > > > interact/depend with FLR flow.
> > > >
> > > > It's still not clear to me how this is done.
> > > >
> > > > 1) guest starts FLR
> > > > 2) adminq freeze the VF
> > > > 3) FLR is done
> > > >
> > > > If the freezing doesn't wait for the FLR, does it mean we need to
> > > > migrate to a state like FLR is pending? If yes, do we need to
> > > > migrate the other sub states like this? If not, why?
> > > >
> > > In most practical cases #2 followed by #1 should not happen as on the source
> > side the expected is mode change to stop from active.
> > 
> > How does the hypervisor know if a guest is doing what without trapping?
> >
> Hypervisor does not know. The device knows being the recipient of #1 and #2.
>  
> > > But ok, since we active to freeze mode change is allowed, lets discuss above.
> > >
> > > A device is the single synchronization point for any device reset, FLR or admin
> > command operation.
> > 
> > So you agree we need synchronization? And I'm not sure I get the meaning of
> > synchronization point, do you mean the synchronization between freeze/stop
> > and virtio facilities?
> >
> Synchronization means, handling two events in parallel such as FLR and other.
>  
> > > So, the migration driver do not need to wait for FLR to complete.
> > 
> > I'm confused, you said below that device context could be changed by FLR.
> > 
> Yes.
> > If FLR needs to clear device context, we can have a race where device context is
> > cleared when we are trying to read it?
> > 
> I didn’t say clear the context.
> FLR updates the device context.
> Device is serving the device context read write commands, serving FLR, answering mode change command,
> So device knows the best how to avoid any race.

Heh well but if drivers depend on specific behaviour then we really
need to document that in the spec.

> > > When admin cmd freeze the VF it can expect FLR_completed VF.
> > 
> > We need to explain why and how about the resume? For example, is resuming
> > required to wait for the completion of FLR, if not, why?
> > 
> > > Secondly since the FLR is local to the source, intermediate sub state does not
> > migrate.
> > >
> > > But I agree, it is worth to have the text capturing this.
> > >
> > > > >
> > > > > >
> > > > > > > and rest of the virtio device functionality such as control
> > > > > > > vq,
> > > > > >
> > > > > > What do you mean by "rest of"?
> > > > > >
> > > > > As given in the example cvq.
> > > > >
> > > > > > Which part is not controlled and why?
> > > > > Not controlled because as states, it is passthrough device.
> > > > >
> > > > > > > config space access, data path descriptors handling.
> > > > > > >
> > > > > > > Additionally, VM live migration using a precopy method is also
> > > > > > > widely
> > > > used.
> > > > > >
> > > > > > Why is this mentioned here?
> > > > > >
> > > > > Huh. You should be positive for bringing clarity to the readers on
> > > > understanding the use case.
> > > > > And you seem opposite, but ok.
> > > > >
> > > > > As stated, it for the reader to understand the use case and see
> > > > > how proposed
> > > > commands addresses the use case.
> > > >
> > > > The problem is that the hardware features should be designed for a
> > > > general purpose instead of a specific technology if it can. The only
> > > > missing part for post copy is the page fault.
> > > >
> > > Ok. The use case and requirement of member device passthrough is clear to
> > most reviewers now.
> > 
> > In another thread you are saying that the PCI composition is done by hypervisor,
> > so passthrough is really confusing at least for me.
> >
> I explained there what vPCI composition is done there.
> PCI config space and msix side of composition is done.
> The whole virtio interface is not composed.
>  
> > > Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> > > This section is not normative section, so using an extra word like "flow" does
> > not confuse anyone.
> > > I will link to the section anyway.
> > 
> > Probably, but you mention FLR flow as well.
> As I said, not repeating the PCIe spec here. The reader knows what FLR of the PCIe transport.

What I worry about however, is what happens for example if FLR
is triggered while an admin command is in progress.
This applies to things like legacy admin commands by the way.


> > 
> > >
> > > > >
> > > > > > > and may also undergo PCI function level
> > > > > > > +reset(FLR) flow.
> > > > > >
> > > > > > Why is only FLR special here? I've asked FRS but you ignore the question.
> > > > > >
> > > > > FLR is special to bring clarity that guest owns the VF doing FLR,
> > > > > hence
> > > > hypervisor cannot mediate any registers of the VF.
> > > >
> > > > It's not about mediation at all, it's about how the device can
> > > > implement what you want here correctly.
> > > >
> > > > See my above question.
> > > >
> > > Ok. it is clear that live migration commands cannot stay on the member device
> > because the member device can undergo device reset and FLR flows owned by
> > the guest.
> > 
> > I disagree, hypervisors can emulate FLR and never send FLR to real devices.
> > 
> That would be some other trap alternative that needs to dissect the device and build infrastructure for such dissection is not desired in the listed use case.
> Here we are addressing the requirement of passthrough the device.
> 
> So your disagreement is fine for non-passthrough devices.
> 
> > > (and hypervisor is not involved in these two flows, hence the admin command
> > interface is designed such that it can fullfil above requirements).
> > >
> > > Theory of operation brings out this clarity. Please notice that it is in
> > introductory section with an example.
> > > Not normative line.
> > >
> > > > >
> > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > +virtio specification;
> > > > > >
> > > > > > This seems unnecessary and obvious as it applies to all other
> > > > > > PCI and virtio functionality.
> > > > > >
> > > > > Great. But your comment is contradicts.
> > > > >
> > > > > > What's more, for the things that need to be synchronized, I
> > > > > > don't see any descriptions in this patch. And if it doesn't need, why?
> > > > > With which operation should it be synchronized and why?
> > > > > Can you please be specific?
> > > >
> > > > See my above question regarding FLR. And it may have others which I
> > > > haven't had time to audit.
> > > >
> > > Ok. when you get chance to audit, lets discuss that time.
> > 
> > Well, I'm not the author of this series, it should be your job otherwise it would
> > be too late.
> > 
> As author, what we think, I will cover. If you have specific points to add value, please share, I will look into it.
> 
> > For example, how is the power management interaction with the freeze/stop?
> >
> Power management is owned by the guest, like any other virtio interface.
> So freeze/stop do not interfere with it.

I am not sure what exactly all this means though.
Should be clarified, in some way.

> > >
> > > > >
> > > > > It is not written in this series, because we believe it must not
> > > > > be synchronized
> > > > as it is fully controlled by the guest.
> > > > >
> > > > > >
> > > > > > > at the same time such flows must not obstruct
> > > > > > > +the device migration flow. In such a scenario, a group owner
> > > > > > > +device can provide the administration command interface to
> > > > > > > +facilitate the device migration related operations.
> > > > > > > +
> > > > > > > +When a virtual machine migrates from one hypervisor to
> > > > > > > +another hypervisor, these hypervisors are named as source and
> > > > > > > +destination
> > > > > > hypervisor respectively.
> > > > > > > +In such a scenario, a source hypervisor administers the
> > > > > > > +member device to suspend the device and preserves the device
> > context.
> > > > > > > +Subsequently, a destination hypervisor administers the member
> > > > > > > +device to setup a device context and resumes the member device.
> > > > > > > +The source hypervisor reads the member device context and the
> > > > > > > +destination hypervisor writes the member device context. The
> > > > > > > +method to transfer the member device context from the source
> > > > > > > +to the destination hypervisor is
> > > > > > outside the scope of this specification.
> > > > > > > +
> > > > > > > +The member device can be in any of the three migration modes.
> > > > > > > +The owner driver sets the member device in one of the
> > > > > > > +following modes during
> > > > > > device migration flow.
> > > > > > > +
> > > > > > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name
> > > > > > > +& Description \\ \hline \hline
> > > > > > > +0x0   & Active &
> > > > > > > +  It is the default mode after instantiation of the member
> > > > > > > +device. \\
> > > > > >
> > > > > > I don't think we ever define "instantiation" anywhere.
> > > > > >
> > > > > Well a transport has implicit definition of the instantiation already.
> > > > > May be a text can be added, but don’t see a value in duplicating
> > > > > PCI spec
> > > > here.
> > > >
> > > > Ok, maybe something like "transport specific instantiation"
> > > >
> > > Ok. that’s a good text. I will change to it.
> > >
> > > > >
> > > > > > > +\hline
> > > > > > > +0x1   & Stop &
> > > > > > > + In this mode, the member device does not send any
> > > > > > > +notifications, and it does not access any driver memory.
> > > > > >
> > > > > > What's the meaning of "driver memory"?
> > > > > >
> > > > > May be guest memory? Or do you suggest a better naming for the
> > > > > memory
> > > > allocated by the guest driver?
> > > >
> > > > Virtqueue?
> > > >
> > > Virtqueue and any memory referred by the virtqueue.
> > >
> > > This is good text, I will change to it.
> > >
> > > > >
> > > > > > And stop seems to be a source of inflight buffers.
> > > > > >
> > > > > I didn’t follow it.
> > > > > If you mean without stop there are no inflight buffer, then I don’t agree.
> > > > > We don’t want to violate the spec by having descriptors with zero
> > > > > size
> > > > returned.
> > > > > Stop is not the source of inflight descriptors.
> > > >
> > > > I think not since you forbid access to the used ring here. So even
> > > > if the buffer were processed by the device it can't be added back to
> > > > the used ring thus became inflight ones.
> > > >
> > > > >
> > > > > There are inflight descriptors with the device that are not yet
> > > > > returned to the
> > > > driver, and device wont return them as zero size wrong completions.
> > > > >
> > > > > > > + The member device may receive driver notifications in this
> > > > > > > + mode,
> > > > > >
> > > > > > What's the meaning of "receive"? For example if the device can
> > > > > > still process buffers, "stop" is not accurate.
> > > > > >
> > > > > Receive means, driver can send the notification as PCIe TLP that
> > > > > device may
> > > > receive as incoming PCIe TLP.
> > > >
> > > > Ok, so this is the transport level. But the device can keep processing the
> > queue?
> > > >
> > > Device cannot process the queue because it does not initiate any read/write
> > towards the virtqueue.
> > 
> > Read/Write only results in a driver noticeable behaviour, it doesn't mean the
> > device can't process the buffers.  For example, devices can keep processing
> > available buffers and make them as inflight ones.
> > 
> The idea is to stop the device and prepare for the migration, so the command to do so.
> Otherwise just the keep the device in active mode and avoid the complications.
> 
> > >
> > > > >
> > > > > In "stop" mode, the device wont process descriptors.
> > > >
> > > > If the device won't process descriptors, why still allow it to receive
> > notifications?
> > > Because notification may still arrive and if the device may update any
> > > counters as part of
> > 
> > Which counters did you mean here?
> >
> The counter that Xuan is adding and any other state that device may have to update as result of driver notification.
> For example caching the posted avail index in the notification.
>  
> > > it which needs to be migrated or store the received notification.
> > >
> > > > Or does it really matter if the device can receive or not here?
> > > >
> > > From device point of view, the device is given the chance to update its device
> > context as part of notifications or access to it.
> > 
> > This is in conflict with what you said above " Device cannot process the queue
> > ..."
> > 
> No, it does not.
> Device context is updated within the device without accessing the queue memory of the guest.
> 
> > Maybe you can give a concrete example.
> > 
> The above one.
> 
> > >
> > > > >
> > > > > > > + the member device context
> > > > > >
> > > > > > I don't think we define "device context" anywhere.
> > > > > >
> > > > > It is defined further in the description.
> > > >
> > > > Like this?
> > > >
> > > > """
> > > >  +The member device has a device context which the owner driver can
> > > > +either read or write. The member device context consist of any
> > > > device  +specific data which is needed by the device to resume its
> > > > operation  +when the device mode """
> > > >
> > > Yes.
> > > Further patch-3 adds the device context and also add the link to it in the
> > theory of operation section so reader can read more detail about it.
> > >
> > > > "Any" is probably too hard for vendors to implement. And in patch 3
> > > > I only see virtio device context. Does this mean we don't need
> > > > transport
> > > > (PCI) context at all? If yes, how can it work?
> > > >
> > > Right. PCI member device is present at source and destination with its layout,
> > only the virtio device context is transferred.
> > > Which part cannot work?
> > 
> > It is explained in another thread where you are saying the PCI requires
> > mediation. I think any author should not ignore such important assumptions in
> > both the change log and the patch.
> > 
> > And again, the more I review the more I see how narrow this series can be used:
> >
> I explained this before and also covered in the cover letter.
>  
> > 1) Only works for SR-IOV member device like VF
> It can be extended to SIOV member device in future.
> Today these are the only type of member device virtio has.
> 
> > 2) Mediate PCI but not virtio which is tricky
> > 3) Can only work for a specific BAR/capability register layout
> > 
> > Only 1) is described in the change log.
> > 
> > The other important assumptions like 2) and 3) are not documented anywhere.
> > And this patch never explains why 2) and 3) is needed or why it can be used for
> > subsystems other than VFIO/Linux.
> >
> Since I am not mentioning vfio now, I will refrain from mentioning others as well. :)
>  
> > >
> > > > >
> > > > > > >and device configuration space may change. \\
> > > > > > > +\hline
> > > > > >
> > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > >
> > > > > All pci devices which belong to a single guest VM are not stopped
> > atomically.
> > > > > Hence, one device which is in freeze mode, may still receive
> > > > > driver notifications from other pci device,
> > > >
> > > > Device may choose to ignore those notifications, no?
> > > >
> > > > > or it may experience a read from the shared memory and get garbage
> > data.
> > > >
> > > > Could you give me an example for this?
> > > >
> > > Section 2.10 Shared Memory Regions.
> > 
> > How can it experience a read in this case?
> >
> MMIO read/write can be initiated by the peer device while the device is in stopped state.

worth mentioning

> > Btw, shared regions are tricky for hardware.
> > 
> > >
> > > > > And things can break.
> > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > chance to stop
> > > > themselves, and later when freezed, to not change anything internally.
> > > > >
> > > > > > > +0x2   & Freeze &
> > > > > > > + In this mode, the member device does not accept any driver
> > > > > > > +notifications,
> > > > > >
> > > > > > This is too vague. Is the device allowed to be freezed in the
> > > > > > middle of any virtio or PCI operations?
> > > > > >
> > > > > > For example, in the middle of feature negotiation etc. It may
> > > > > > cause implementation specific sub-states which can't be migrated easily.
> > > > > >
> > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > It is passthrough device, hence hypervisor layer do not get to see sub-
> > state.
> > > > >
> > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > The device context already covers this sub-state.
> > > >
> > > > 1) driver writes driver_features
> > > > 2) driver sets FEAUTRES_OK
> > > >
> > > > 3) device receive driver_features
> > > > 4) device validating driver_features
> > > > 5) device clears FEATURES_OK
> > > >
> > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > >
> > > > Is it valid to be frozen of the above?
> > > No. device mode is frozen when hypervisor is sure that no more access by the
> > guest will be done.
> > 
> > How, you don't trap so 1) and 2) are posted, how can hypervisor know if there's
> > inflight transactions to any registers?
> > 
> Because hypervisor has stopped the vcpus which are issuing them.
> 
> > > What can happen between #2 and #3, is device mode may change to stop.
> > 
> > Why can't be freezed in this case? It's really hard to deduce why it can't just
> > from your above descriptions.
> >
> On the source hypervisor, the mode changes are active->stop->freeze.
> Hence when freeze is done, the hypervisor knows that all inflight has been stopped by now.
>  
> > Even if it had, is it even possible to list all the places where freezing is
> > prohibited? We don't want to end up with a spec that is hard to implement or
> > leave the vendor to figure out those tricky parts.
> >
> The general idea is not prohibiting the freeze/stop mode.
> If the device needs more time, let device take time to do it.
> 
>  
> > > And in stop mode, device context would capture #5 or #4, depending where is
> > device at that point.
> > >
> > > > >
> > > > > > And what's more, the above state machine seems to be virtio
> > > > > > specific, but you don't explain the interaction with the device
> > > > > > status state
> > > > machine.
> > > > > First, above is not a state machine.
> > > >
> > > > So how do readers know if a state can go to another state and when?
> > > >
> > > Not sure what you mean by reader. Can you please explain.
> > 
> > The people who read virtio spec.
> > 
> So question is "how reader knows if a state can go to another state and when"?
> It is described and listed in the table, when a mode can change.
> 
> > > > So only the driver notification is allowed by not config write?
> > > > What's the consideration for allowing driver notification?
> > > >
> > > Because for most practical purposes, peer device wants to queue blk, net
> > other requests and not do device configuration.
> > 
> > You forbid the device to process the queue but only allow the notification. How
> > can the device queue those requests? The device can just do the available
> > buffer check after resume, then it's all fine.
> >
> Device can always decide to not queue the request and do the available buffer check later.
> The peer device may read also from MMIO space.
> 
> So the intermediate step covers this aspect where device_type specific plumbing is not done.
> Its generic. A device may choose to omit such doorbells as well as long as it knows it can resume.

all this is kind of vague ... should be in the spec.

> > >
> > > Do you know any device configuration space which is RW?
> > > For net and blk I recall it as RO?
> > 
> > For example, WCE. What's more important, the spec allows config space to be
> > RW, so even if there's no examples before, it doesn't mean we won't have a RW
> > in the future.
> > 
> Ok.
> 
> > >
> > > > Let me ask differently, similar to FLR, what happens if the driver
> > > > wants a virtio reset but the hypervisor wants to stop or freeze?
> > > >
> > > The device would respond to stop/freeze request when it has internally
> > started the reset, as device is the single synchronization point which knows how
> > to handle both in parallel.
> > 
> > Let's define the synchronization point first. And it demonstrates at least devices
> > need to synchronize between the free/stop and virtio device status machine
> > which is not as easy as what is done in this patch.
> >
> Synchronization point = device.

Then we need to spec device behaviour.

> > >
> > > > > We would enrich the device context for this, but no need to
> > > > > connects the
> > > > admin mode controlled by the owner device with operational state
> > > > (device_status) owned by the member device.
> > > > >
> > > > > > > + it ignores any device configuration space writes,
> > > > > >
> > > > > > How about read and the device configuration changes?
> > > > > >
> > > > > As listed, device do not have any changes.
> > > > > So device configuration change cannot occur.
> > > >
> > > > It's not necessarily caused by config write, it could be things like
> > > > link status or geometry changes that are initiated from the device.
> > > >
> > > I understand it. Link status was one example, you listed other examples too.
> > > The point is, when in freeze mode, the member device is frozen, hence,
> > device won't initiate those changes.
> > >
> > > > >
> > > > > The device requirements cover this content more explicitly:
> > > > >
> > > > > For the SR-IOV group type, regardless of the member device mode,
> > > > > all the PCI transport level registers MUST be always accessible
> > > > > and the member device MUST function the same way for all the PCI
> > > > > transport level
> > > > registers regardless of the member device mode.
> > > > >
> > > > > > > + the device do not have any changes in the device context.
> > > > > > > + The member device is not accessed in the system through the
> > > > > > > + virtio
> > > > interface.
> > > > > > > + \\
> > > > > >
> > > > > > But accessible via PCI interface?
> > > > > >
> > > > > Yes, as usual.
> > > > >
> > > > > > For example, what happens if we want to freeze during FLR? Does
> > > > > > the hypervisor need to wait for the FLR to be completed?
> > > > > >
> > > > > Hypervisor do not need wait for the FLR to be completed.
> > > >
> > > > So does FLR change device context?
> > > Yes.
> > 
> > So this implies the freeze needs to wait for FLR otherwise device context may
> > change.
> > 
> Device context can change anytime and reflect what is latest.
> I will update the patches to reflect that device is the single synchronization point serving flr, mode changes.
> 
> > >
> > > >
> > > > >
> > > > > > > +\hline
> > > > > > > +\hline
> > > > > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > > > > +\hline
> > > > > > > +\end{tabularx}
> > > > > > > +
> > > > > > > +When the owner driver wants to stop the operation of the
> > > > > > > +device, the owner driver sets the device mode to
> > > > > > > +\field{Stop}. Once the device is in the \field{Stop} mode,
> > > > > > > +the device does not initiate any notifications or does not
> > > > > > > +access any driver memory. Since the member driver may be
> > > > > > > +still active which may send further driver notifications to the device,
> > the device context may be updated.
> > > > > > > +When the member driver has stopped accessing the device, the
> > > > > > > +owner driver sets the device to \field{Freeze} mode
> > > > > > > +indicating to the device that no more driver access occurs.
> > > > > > > +In the \field{Freeze} mode, no more changes occur in the device
> > context.
> > > > > > > +At this point, the device ensures that
> > > > > > there will not be any update to the device context.
> > > > > >
> > > > > > What is missed here are:
> > > > > >
> > > > > > 1) it is a virtio specific states or not
> > > > > It is not.
> > > > >
> > > > > > 2) if it is a virtio specific state, if or how to synchronize
> > > > > > with transport specific interfaces and why
> > > > > > 3) can active go directly to freeze and why
> > > > > >
> > > > > Yes. don’t see a reason to not allow it.
> > > > > Active to freeze mode can change is useful on the destination
> > > > > side, where
> > > > destination hypervisor knows for sure that there is no other entity
> > > > accessing the device.
> > > > > And it needs to setup the device context, it received from the source side.
> > > > > So setting freeze mode can be done directly.
> > > > >
> > > > > > > +
> > > > > > > +The member device has a device context which the owner driver
> > > > > > > +can either read or write. The member device context consist
> > > > > > > +of any device specific data which is needed by the device to
> > > > > > > +resume its operation when the device mode
> > > > > >
> > > > > > This is too vague. There're states that are not suitable for
> > > > > > cmd/queue for
> > > > sure.
> > > > > > I'd split it into
> > > > > >
> > > > > > 1) common states: virtqueue, dirty pages
> > > > > > 2) device specific states: defined be each device
> > > > > >
> > > > > This is theory of operation section. So it capturing such details.
> > > > > Actual device context definition is outside of theory, and precise
> > > > > states of
> > > > virtqueue, device specific, etc are in it.
> > > >
> > > > See my comment above regarding to the device context.
> > > >
> > > I replied above, device context link is added in the patch-3 in the theory of
> > operation.
> > > So reader gets the complete view.
> > >
> > > > >
> > > > > > > +is changed from \field{Stop} to \field{Active} or from
> > > > > > > +\field{Freeze} to \field{Active}.
> > > > > > > +
> > > > > > > +Once the device context is read, it is cleared from the device.
> > > > > >
> > > > > > This is horrible, it means we can't easily
> > > > > >
> > > > > > 1) re-try the migration
> > > > > > 2) recover from migration failure
> > > > > >
> > > > > Can you please explain the flow?
> > > >
> > > > When migration fails, management can choose to resume the device(VM)
> > > > on the source.
> > > >
> > > ok. This should be possible as the management which has the device
> > > context, it can restore it on the source and move the device mode to active.
> > >
> > > > If the state were cleared, it means there's not simple way to resume
> > > > the device but restoring the whole context.
> > > >
> > > Yes, as you say, by restoring the whole context will suffice this corner/rare
> > case scenario.
> > >
> > > > What's the consideration for such clearing?
> > > >
> > > There are two considerations.
> > > 1.  If one does not clear, till how long should it be kept on the device?
> > 
> > Until virtio reset, this is how virtio works now. I've pointed out that it may cause
> > extra troubles when trying to resume, but you don't tell me what's wrong to
> > keep that?
> > 
> If kept, hypervisor may not be able to decide when to change the mode from active->stop.
> We can opt for a mode where full device context is read in each mode without clearing it.
> But than it can be very specific to a version of qemu, which we are avoiding it here.
> 
> > > 2. device context returns incremental value from the previous read. So, it
> > needs to clear it.
> > 
> > I don't understand here. This is not the case for most of the devices.
> >
> Not sure which devices you mean here with "most of the devices".
> Device context functions like a write record pages (aka dirty pages).
> Whatever is already returned is/should not be repeated in subsequent reads, though device can choose to do so.
>  
> > >
> > > > > And which software stack may find this useful?
> > > > > Is there any existing software that can utilize it?
> > > >
> > > > Libvirt.
> > > >
> > > Does libvirt restore on migration failure?
> > 
> > Yes.
> > 
> Ok. the device will be able to resume when it is marked active.
> The device context returned  is the incremental delta as explained above.
> 
> > >
> > > > > Why that device context present with the software vanished, in
> > > > > your
> > > > assumption, if it is?
> > > > >
> > > > > > > Typically, on
> > > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > > +context once when the device is in \field{Active} or
> > > > > > > +\field{Stop} mode and later once the member device is in
> > \field{Freeze} mode.
> > > > > >
> > > > > > Why need the read while device context could be changed? Or is
> > > > > > the dirty page part of the device context?
> > > > > >
> > > > > It is not part of the dirty page.
> > > > > It needs to read in the active/stop mode, so that it can be shared
> > > > > with
> > > > destination hypervisor, which will pre-setup the complex context of
> > > > the device, while it is still running on the source side.
> > > >
> > > > Is such a method used by any hypervisor?
> > > Yes. qemu which uses vfio interface uses it.
> > 
> > Ok, such software technology could be used for all types of devices, I don't see
> > any advantages to mention it here unless it's unique to virtio.
> > 
> It is theory of operation that brings the clarity and rationale.
> So I will keep it.
> 
> > >
> > > >
> > > > >
> > > > > > > +
> > > > > > > +Typically, the device context is read and written one time on
> > > > > > > +the source and the destination hypervisor respectively once
> > > > > > > +the device is in \field{Freeze} mode. On the destination
> > > > > > > +hypervisor, after writing the device context, when the device
> > > > > > > +mode set to \field{Active}, the device uses the most recently
> > > > > > > +set device context and resumes the device
> > > > > > operation.
> > > > > >
> > > > > > There's no context sequence, so this is obvious. It's the
> > > > > > semantic of all other existing interfaces.
> > > > > >
> > > > > Can you please what which existing interfaces do you mean here?
> > > >
> > > > For any common cfg member. E.g queue_addr.
> > > >
> > > > The driver wrote 100 different values to queue_addr and the device
> > > > used the value written last time.
> > > >
> > > o.k. I don’t see any problem in stating what is done, which is less
> > > vague. 😊
> > >
> > > > >
> > > > > > > +
> > > > > > > +In an alternative flow, on the source hypervisor the owner
> > > > > > > +driver may choose to read the device context first time while
> > > > > > > +the device is in \field{Active} mode and second time once the
> > > > > > > +device is in \field{Freeze}
> > > > > > mode.
> > > > > >
> > > > > > Who is going to synchronize the device context with possible
> > > > > > configuration from the driver?
> > > > > >
> > > > > Not sure I understand the question.
> > > > > If I understand you right, do you mean that, When configuration
> > > > > change is done by the guest driver, how does device context change?
> > > > >
> > > >
> > > > Yes.
> > > >
> > > > > If so, device context reading will reflect the new configuration.
> > > >
> > > > How do you do that? For example:
> > > >
> > > > static inline void vp_iowrite64_twopart(u64 val,
> > > >                                         __le32 __iomem *lo,
> > > >                                         __le32 __iomem *hi) {
> > > >         vp_iowrite32((u32)val, lo);
> > > >         vp_iowrite32(val >> 32, hi); }
> > > >
> > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > >
> > > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > section captures the partial value.
> > 
> > There's no way for the device to know whether or not it's a partial value or not.
> > No?
> > 
> Device does not need to know, because when the guest vm and the device is resumed on the destination, it the guest vm will continue with writing the 2nd part.
> 
> > >
> > > > >
> > > > > > > Similarly, on the
> > > > > > > +destination hypervisor writes the device context first time
> > > > > > > +while the device is still running in \field{Active} mode on
> > > > > > > +the source hypervisor and writes the device context second
> > > > > > > +time while the device is in
> > > > > > \field{Freeze} mode.
> > > > > > > +This flow may result in very short setup time as the device
> > > > > > > +context likely have minimal changes from the previously
> > > > > > > +written device
> > > > context.
> > > > > >
> > > > > > Is the hypervisor who is in charge of doing the comparison and
> > > > > > writing only the delta?
> > > > > >
> > > > > The spec commands allow to do so. So possibility exists from spec wise.
> > > >
> > > > There are various optimizations for migration for sure, I don't
> > > > think mentioning any specific one is good.
> > > >
> > > The text is informative text similar to,
> > >
> > > " However, some devices benefit from the ability to find out the
> > > amount of available data in the queue without accessing the virtqueue in
> > memory"
> > >
> > > " To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has
> > been negotiated".
> > >
> > > Is this the only optimization in virtio? No, but we still mention the rationale of
> > why it exists.
> > 
> > The above is a good example as it explain VIRTIO_F_NOTIFICATION_DATA is the
> > only way without accessing the virtqueue. But this is not the case of migration.
> > You said it's just a possibility but not a must which is not the case for
> > VIRTIO_F_NOTIFICATION_DATA.
> > 
> It is one of the optimization apart. The comparison is of one_of_example or not.
> 



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-11 10:54               ` Parav Pandit
  2023-10-11 19:54                 ` Michael S. Tsirkin
@ 2023-10-12 10:00                 ` Zhu, Lingshan
  2023-10-12 10:06                   ` Michael S. Tsirkin
  2023-10-12 10:09                   ` Parav Pandit
  1 sibling, 2 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-12 10:00 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/11/2023 6:54 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, October 11, 2023 3:38 PM
>>
>>>> The system admin can choose only passthrough some of the devices for
>>>> nested guests, so passthrough the PF to L1 guest is not a good idea,
>>>> because there can be many devices still work for the host or L1.
>>> Possible. One size does not fit all.
>>> What I expressed is most common scenarios that user care about.
>> don't block existing usecases, don't break the userspace, nested is common.
> Nothing is broken as virtio spec do not have any single construct to support migration.
> If nested is common, can you share the performance number with real virtio device with/without 2 level nesting?
> I frankly don’t know how they look like.
virtio devices support nested, I mean don't break this usecase
And end user accept performance overhead in nested, this is not related 
to this topic.

>
>>>>> In second use case, where one want to bind only one member device to
>>>>> one VM, I think same plumbing can be extended to have another VF, to
>>>>> take
>>>> the role of migration device instead of owner device.
>>>>> I don’t see a good way to passthrough and also do in-band migration
>>>>> without
>>>> lot of device specific trap and emulation.
>>>>> I also don’t know the cpu performance numbers with 3 levels of
>>>>> nested page
>>>> table translation which to my understanding cannot be accelerated by
>>>> the current cpu.
>>>> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and so
>> on,
>>>> there can be performance overhead, but can be done.
>>>>
>>>> So admin vq migration still don't work for nested, this is surely a blocker.
>>> In specific case of member devices are located at different nest level, it does
>> not.
>> so you got the point, so this series should not be merged.
>>> Why prevents you have a peer VF do the role of migration driver?
>>> Basically, what I am proposing is, connect two VFs to the L1 guest. One VF is
>> migration driver, one VF is passthrough to L2 guest.
>>> And same scheme works.
>> A peer VF? A management VF? still break the existing usecase. and how do you
>> transfer ownership of L2 VF from PF to L1 VF?
> A peer management VF which services admin command (like PF).
> Ownership of admin command is delegated to the management VF.
interesting, do you plan to cook a patch implementing this?
Really make sense?

How do you transfer the ownership?
How to you maintain a different group?
How do you isolate the groups?
How to you keep the guest or host secure?
How do you manage the overlaps?
How do you implement the hardware support that?
How do you change the PCI routing?
>
>>> On the other hand,
>>> Many parts of the cpu subsystem such as PML, page tables do not have N
>> level nesting support either.
>> page tables could be emulated, as showed to you before, just PA to VA, nested
>> PA to nested VA
>>> They all work on top of emulation and pay the price for emulation when
>> nesting is done.
>>> May be that is the first version for virtio too.
>> there are performance overhead, but can be done.
>>> I frankly feel that nesting support requires industry level eco system support
>> not just in virtio.
>>> Virtio attempting to focus on nested and having nearly same level
>> performance as bare metal seems farfetched.
>>> Maybe I am wrong, as we have not seen such high perf nested env even with
>> sw based device.
>>> What can be possibly done is,
>>> 1. What admin commands are useful from this series that can be useful for
>> nesting?
>>> 2. What admin commands from current series needs extension for nesting?
>>> 3. What admin commands do not work at all for nesting, and hence, need to
>> have new commands.
>>> If we can focus on those, maybe we can find common approach that cater to
>> both commands.
>> virtio support nested now, dont let your admin vq LM break this.
> New spec addition is not breaking existing virtio implementation in sw.
don't break nested, again.
> New spec additions of owner and member devices do not apply to non member and non owner devices.
if so, no member no owner, so no admin vq? then this proposal doesn't 
make any sense?
>
>>>>> Do you know how does it work for Intel x86_64?
>>>>> Can it do > 2 level of nested page tables? If no, what is the perf
>>>>> characteristics
>>>> to expect?
>>>> of course that can be done, Page table is not a problem, there are
>>>> soft mmu emulation and viommu, through performance overhead.
>>> Due to the performance overheads, I really doubt any cloud operator would
>> use passthrough virtio device for any sensible workload.
>>> But you may know already how nested performance looks like that may be
>> acceptable to users.
>> Many tenants run their nested cluster. Don't break this.
> How new spec addition such as crypto device addition broke net device?
> Or how net vq interrupt moderation breaks existing sw?
> It does not.
> They are driven through their own feature bits and admin command capabilities.
> It does not break any existing deployments.
we are talking about nested, don't break nested


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 10:00                 ` Zhu, Lingshan
@ 2023-10-12 10:06                   ` Michael S. Tsirkin
  2023-10-12 10:13                     ` Parav Pandit
  2023-10-12 10:52                     ` Zhu, Lingshan
  2023-10-12 10:09                   ` Parav Pandit
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-12 10:06 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 12, 2023 at 06:00:09PM +0800, Zhu, Lingshan wrote:
> we are talking about nested, don't break nested

Yes, and I agree a "simulated VF" is a kind of hack
at least for now. What I could see instead is
one or both of
1 another way to submit admin commands without DMA
2 a way to have a "dummy" PF that only does admin
  commands without anything device specific


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 10:00                 ` Zhu, Lingshan
  2023-10-12 10:06                   ` Michael S. Tsirkin
@ 2023-10-12 10:09                   ` Parav Pandit
  2023-10-12 10:45                     ` Michael S. Tsirkin
  2023-10-12 11:10                     ` Zhu, Lingshan
  1 sibling, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-12 10:09 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, October 12, 2023 3:30 PM
> 
> On 10/11/2023 6:54 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Wednesday, October 11, 2023 3:38 PM
> >>
> >>>> The system admin can choose only passthrough some of the devices
> >>>> for nested guests, so passthrough the PF to L1 guest is not a good
> >>>> idea, because there can be many devices still work for the host or L1.
> >>> Possible. One size does not fit all.
> >>> What I expressed is most common scenarios that user care about.
> >> don't block existing usecases, don't break the userspace, nested is common.
> > Nothing is broken as virtio spec do not have any single construct to support
> migration.
> > If nested is common, can you share the performance number with real virtio
> device with/without 2 level nesting?
> > I frankly don’t know how they look like.
> virtio devices support nested, I mean don't break this usecase And end user
> accept performance overhead in nested, this is not related to this topic.
> 
Can you show an example of virtio device nesting and live migration already supported where the device has _done_ the live migration.
Due to which you claim that new feature of admin command-based owner and member device breaks something?

Please don’t use the verb "break".
Your proposal is the first of its kind that supports migrating nested device.
This is why new patches of config register or admin command does not break anything existing.

> >
> >>>>> In second use case, where one want to bind only one member device
> >>>>> to one VM, I think same plumbing can be extended to have another
> >>>>> VF, to take
> >>>> the role of migration device instead of owner device.
> >>>>> I don’t see a good way to passthrough and also do in-band
> >>>>> migration without
> >>>> lot of device specific trap and emulation.
> >>>>> I also don’t know the cpu performance numbers with 3 levels of
> >>>>> nested page
> >>>> table translation which to my understanding cannot be accelerated
> >>>> by the current cpu.
> >>>> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and
> so
> >> on,
> >>>> there can be performance overhead, but can be done.
> >>>>
> >>>> So admin vq migration still don't work for nested, this is surely a blocker.
> >>> In specific case of member devices are located at different nest
> >>> level, it does
> >> not.
> >> so you got the point, so this series should not be merged.
> >>> Why prevents you have a peer VF do the role of migration driver?
> >>> Basically, what I am proposing is, connect two VFs to the L1 guest.
> >>> One VF is
> >> migration driver, one VF is passthrough to L2 guest.
> >>> And same scheme works.
> >> A peer VF? A management VF? still break the existing usecase. and how
> >> do you transfer ownership of L2 VF from PF to L1 VF?
> > A peer management VF which services admin command (like PF).
> > Ownership of admin command is delegated to the management VF.
> interesting, do you plan to cook a patch implementing this?
No. I am hoping that you can help to draft those patches for nested case to work when one wants to hand of single VM to single nested guest VM.
I will not be able to test any of nested things and show its performance value either, as I don’t see how rest of the eco system can match up for the nested.
Hence, your expertise in drafting extension for nested is desired.

> Really make sense?
> 
> How do you transfer the ownership?
An additional ownership deletgation by a new admin command.
> How to you maintain a different group?
One to one assignment.
> How do you isolate the groups?
Not sure, what it means. The explicit group is created and VFs are placed in this group.
> How to you keep the guest or host secure?
Please be specific. Its very broad question when it comes to defining the interface.
> How do you manage the overlaps?
Overlaps between?
> How do you implement the hardware support that?
Please consult your board designers. Hard to say how to implement something in generic.
> How do you change the PCI routing?
Why anything to be changed in PCI routing?

> > It does not break any existing deployments.
> we are talking about nested, don't break nested
Virtio spec for nested is not defined yet. Hence nothing is broken. Please avoid using the verb, _break_.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 10:06                   ` Michael S. Tsirkin
@ 2023-10-12 10:13                     ` Parav Pandit
  2023-10-12 10:52                     ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-12 10:13 UTC (permalink / raw)
  To: Michael S. Tsirkin, Zhu, Lingshan
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 12, 2023 3:36 PM
> 
> On Thu, Oct 12, 2023 at 06:00:09PM +0800, Zhu, Lingshan wrote:
> > we are talking about nested, don't break nested
> 
> Yes, and I agree a "simulated VF" is a kind of hack at least for now. What I could
> see instead is one or both of
> 1 another way to submit admin commands without DMA
> 2 a way to have a "dummy" PF that only does admin
>   commands without anything device specific

Yes. both the methods make send to me.
But I think, it will still do the DMA at the end for device context, dirty page addresses.
I was imagining, AQ on the VF itself to do some of these commands, that way both methods can utilize common framework.

Another VF exposed as the dummy PF should work fine maintaining most semantics.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-11 11:43                 ` Parav Pandit
@ 2023-10-12 10:21                   ` Zhu, Lingshan
  2023-10-12 10:58                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-12 10:21 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/11/2023 7:43 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, October 11, 2023 3:55 PM
>>>>>>> I don’t have any strong opinion to keep it or remove it as most
>>>>>>> stakeholders
>>>>>> has the clear view of requirements now.
>>>>>>> Let me know.
>>>>>> So some people use VFs with VFIO. Hence the module name.  This
>>>>>> sentence by itself seems to have zero value for the spec. Just drop it.
>>>>> Ok. Will drop.
>>>> So why not build your admin vq live migration on our config space
>>>> solution, get out of the troubles, to make your life easier?
>>>>
>>> Your this question is completely unrelated to this reply or you misunderstood
>> what dropping commit log means.
>> if you can rebase admin vq LM on our basic facilities, I think you dont need to
>> talk about vfio in the first place, so I ask you to re-consider Jason's proposal.
> I don’t really know why you are upset with the vfio term.
> It is the use case of the cloud operator and it is listed to indicate how proposal fits in a such use case.
> If for some reason, you don’t like vfio, fine. Ignore it and move on.
>
> I already answered that I will remove from the commit log, because the requirements are well understood now by the committee.
>
> Your comment is again unrelated (repeated) to your past two questions.
>
> I explained you the technical problem that admin command (not admin VQ) of basic facilities cannot be done using config registers without any mediation layer.
OK, I pop-ed Jason's proposal to make everything easier, and I see it is 
refused.
>
>>> Dropping link to vfio does not drop the requirement.
>>> I am ok to drop because requirements are clear of passthrough of member
>> device.
>>> Vfio is not a trouble at all.
>>> Admin command is not a trouble either.
>>>
>>> The pure technical reason is: all the functionalities proposed cannot be done
>> in any other existing way.
>>> Why? For below reasons.
>>> 1. device context, and write records (aka dirty page addresses) is
>>> huge which cannot be shared using config registers at scale of 4000
>>> member devices
>> dirty page tracking will be implmemented in V2, actually I have the patch right
>> now.
> That is yet again the invitation to non_colloboration mode.
> Without reviewing, v0 and v1, you want to show dirty page tracking in some other way.
>
> But ok, that is your non_coperative mode of working. Cannot help further.
I believe both me and Jason have proposed a solution, I see it is rejected.
But don't take it personal and please keep professional.
>
>> inflight descriptor tracking will be implemented by Eugenio in V2.
> When we have near complete proposal from two device vendors, you want to push something to unknown future without reviewing the work;
> does not make sense.
Didn't I ever provide feedback to you? Really?
>
> You are still in the mode of _take_ what we did with near zero explanation.
> You asked question of why passthrough proposal cannot advantage of in_band config registers.
> I explained technical reason listed here.
I have answered the questions, and asked questions for many times.
What do you mean by "why passthrough proposal cannot advantage of 
in_band config registers."?
Config space work for passthrough for sure.
>
> So please don’t jump to conclusions before finishing the discussion on how both side can take advantage of each other.
>
> Lets please do that.
We have proposed a solution, right?

I still need to point out: admin vq LM does not work, one example is nested.
>
>> There are no scale problem as I repeated for many time, they are per-device
>> basic facilities, just migrate the VF by its own facility, so there are no 40000
>> member devices, this is not per PF.
>>
> I explained that device reset, flr etc flow cannot work when controlling and controlled functions are single entity for passthrough mode.
> The scale problem is, one needs to duplicate the registers on each VF.
> The industry is moving away from the register interface in many _real_ hw devices implementation.
> Some of the examples are IMS, SIOV, NVMe and more.
we have discussed this for many times, please refer to previous threads, 
even with Jason.
>
>> The device context can be read from config space or trapped, like shadow
> There are 1 million flows of the net device flow filters in progress.
> Each flow is 64B in size.
> Total size is 64MB.
> I don’t see how one can read such amount of memory using config registers.
control vq?
Or do you want to migrate non-virtio context?
That is out of spec
>
>> control vq which is already done, that is basic virtualization.
> There is nothing like "basic virtualization".
> What is proposed here is fulfilling the requirement of passthrough mode.
>
> Your comment is implying, "I don’t care for passthrough requirements, do non_passthrough".
that is your understanding, and you misunderstood it. Config space 
servers passthrough for many years.
>
> The discussion should be,
> How can we leverage common framework for passthrough and mediated mode?
> Can we? If so, which are the pieces?
config space is a common framework, right?
>
> For me it is frankly very weird to take native virtio member device, convert into a medicated device using a giant software, and after that convolution get virtio device.
> But for nested case you have the use case.
> So if we focus positively on how two use cases can use some common functionality, that will be great.
why config space need a giant sw to work?

So both Jason and I suggest you build admin vq solution based on our 
basic facilities.
>
>> If you want to migrate device context, you need to specify device context for
>> every type of device, net maybe easy, how do you see virtio-fs?
> Virtio-fs will have its on device context too.
> Every device has some sort of backend in varied degree.
> Net being widely used and moderate complex device.
> Fs being slightly stateful but less complex than net, as it has far less control operations.
so, do you say you have implement a live migration solution which can 
migrate device context,
but only work for net or block?

Then you should call it virtio net/blk migration and implement in 
net/block section.
> In fact virtio-fs device already discusses the migrating the device side state, as listed in device context.
> So virtio-fs device will have its own device-context defined.
if you want to migrate it, you need to define it
>
> The infrastructure and basic facilities are setup in this series, that one can easily extend for all the current and new device types.
really? how?
>
>> And we are migrating stateless devices, or no? How do you migrate virtio-fs?
>>> 2. sharing such large context and write addresses in parallel for
>>> multiple devices cannot be done using single register file
>> see above
>>> 3. These registers cannot be residing in the VF because VF can undergo
>>> FLR, and device reset which must clear these registers
>> do you mean you want to audit all PCI features? When FLR, the device is rested,
>> do you expect a device remember anything after FLR?
> Not at all. VF member device will not remember anything after FLR.
>> Do you want to trap FLR? Why?
> This proposal does _not_ want to trap the FLR in the hypervisor virtio driver.
>
> When one does the mediation-based design, it must trap/emulate/fake the FLR.
> It helps to address the case of nested as you mentioned.
once passthrough, the guest driver can access the config space to reset 
the device, right?
>> Why FLR block or conflict with live migration?
> It does not block or conflict.
OK, cool, so let's make this a conclusion
>
> The whole point is, when you put live migration functionality on the VF itself, you just cannot FLR this device.
> One must trap the FLR and do fake FLR and build the whole infrastructure to not FLR The device.
> Above is not passthrough device.
No, the guest can reset the device, even causing a failed live migration.
>
>>> 4. When VF does the DMA, all dma occurs in the guest address space, not in
>> hypervisor space; any flr and device reset must stop such dma.
>>> And device reset and flr are controlled by the guest (not mediated by
>> hypervisor).
>> if the guest reset the device, it is totally reasonable operation, and the guest
>> own the risk, right?
> Sure, but the guest still expects its dirty pages and device context to be migrated across device_reset.
> Device_reset will lose all this information within the device if done without mediation and special care.
No, if the guest reset a device, that means the device should be RESET, 
to forget its config,
that would be really wired to migrate a fresh device at the source side, 
to be a running device at the destination side.
>
> So, to avoid that now one needs to have fake device reset too and build that infrastructure to not reset.
>
> The passthrough proposal fundamental concept is:
>
> all the native virtio functionalities are between guest driver and the actual device.
see above.
>
>> and still, do you want to audit every PCI features? at least you didn't do that in
>> your series.
> Can you please list which PCI features audit you are talking about?
you audit FLR, then do you want to check everyone?
If no, how to decide which one should be audited, why others not?
>
> Keep in mind, that will all the mediation, one now must equally audit all this giant software stack too.
> So maybe it is fine for those who are ok with it.
so you agree FLR is not a problem, at least for config space solution?
>
>> For migration, you know the hypervisor takes the ownership of the device in the
>> stop_window.
> I do not know what stop_window means.
> Do you mean stop_copy of vfio or it is qemu term?
when guest freeze.
>
>>> 5. Any PASID to separate out admin vq on the VF does not work for two
>> reasons.
>>> R_1: device flr and device reset must stop all the dmas.
>>> R_2: PASID by most leading vendors is still not mature enough
>>> R_3: One also needs to do inversion to not expose PASID capability of
>>> the member PCI device to not expose
>> see above and what if guest shutdown? the same answer, right?
> Not sure, I follow.
> If the guest shutdown, the guest specific shutdown APIs are called.
>
> With passthrough device, R_1 just works as is.
> R_3 is not needed as they are directly given to the guest.
> R_2 platform dependency is not needed either.
I think we already have a concussion for FLR.
For PASID, what blocks the solution?
>
>>>> Actually you don't see any technical problems in our config space
>>>> proposal, right?
>>> In config registers method, for passthrough I clearly see the technical
>> problems (functional and scale) listed above.
>>> Due to which config registers cannot reside on the VF and cannot scale either.
>> so see above answers.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-11 20:14               ` Michael S. Tsirkin
@ 2023-10-12 10:21                 ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-12 10:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 12, 2023 1:44 AM
> 
> On Wed, Oct 11, 2023 at 10:47:23AM +0000, Parav Pandit wrote:

> > FLR updates the device context.
> > Device is serving the device context read write commands, serving FLR,
> > answering mode change command, So device knows the best how to avoid
> any race.
> 
> Heh well but if drivers depend on specific behaviour then we really need to
> document that in the spec.

I am working on v2 enrich the spec language for missing items to address comments of yours and Jason.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-11 19:51             ` Michael S. Tsirkin
@ 2023-10-12 10:23               ` Zhu, Lingshan
  0 siblings, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-12 10:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/12/2023 3:51 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 10, 2023 at 04:57:52PM +0800, Zhu, Lingshan wrote:
>>
>> On 10/10/2023 1:21 AM, Parav Pandit wrote:
>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>> Sent: Monday, October 9, 2023 9:50 PM
>>>>>>> One or more passthrough PCI VF devices are ubiquitous for virtual
>>>>>>> machines usage using generic kernel framework such as vfio [1].
>>>>>> Mentioning a specific subsystem in a specific OS may mislead the
>>>>>> user to think it can only work in that setup. Let's not do that,
>>>>>> virtio is not only used for Linux and VFIO.
>>>>> This is just one example on how these commands are useful.
>>>>> It can be useful in more ways too in more OSes too.
>>>>> I will drop from the patch commit log and keep as information purpose in
>>>> cover letter.
>>>>> Would that work for you?
>>>>>
>>>>> I don’t have any strong opinion to keep it or remove it as most stakeholders
>>>> has the clear view of requirements now.
>>>>> Let me know.
>>>> So some people use VFs with VFIO. Hence the module name.  This sentence by
>>>> itself seems to have zero value for the spec. Just drop it.
>>> Ok. Will drop.
>> So
> This is apropos what?
Maybe cut off, what is this referring to?
>
>> why not build your admin vq live migration on our config space solution,
>> get out of the troubles, to make your life easier?
>> Actually you don't see any technical problems in our config space proposal,
>> right?
> The status bit one? You enumerated some reasons yourself did you not?
> And I sent some right when you asked for help seeing more... or did it go right past?
Yes and I have replied in that thread.
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 10:09                   ` Parav Pandit
@ 2023-10-12 10:45                     ` Michael S. Tsirkin
  2023-10-12 11:23                       ` Parav Pandit
  2023-10-12 11:10                     ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-12 10:45 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 12, 2023 at 10:09:30AM +0000, Parav Pandit wrote:
> 
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Thursday, October 12, 2023 3:30 PM
> > 
> > On 10/11/2023 6:54 PM, Parav Pandit wrote:
> > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >> Sent: Wednesday, October 11, 2023 3:38 PM
> > >>
> > >>>> The system admin can choose only passthrough some of the devices
> > >>>> for nested guests, so passthrough the PF to L1 guest is not a good
> > >>>> idea, because there can be many devices still work for the host or L1.
> > >>> Possible. One size does not fit all.
> > >>> What I expressed is most common scenarios that user care about.
> > >> don't block existing usecases, don't break the userspace, nested is common.
> > > Nothing is broken as virtio spec do not have any single construct to support
> > migration.
> > > If nested is common, can you share the performance number with real virtio
> > device with/without 2 level nesting?
> > > I frankly don’t know how they look like.
> > virtio devices support nested, I mean don't break this usecase And end user
> > accept performance overhead in nested, this is not related to this topic.
> > 
> Can you show an example of virtio device nesting and live migration already supported where the device has _done_ the live migration.
> Due to which you claim that new feature of admin command-based owner and member device breaks something?
> 
> Please don’t use the verb "break".
> Your proposal is the first of its kind that supports migrating nested device.
> This is why new patches of config register or admin command does not break anything existing.

Wording aside, new features should support as wide a variety of configs
as possible, if some config is not supported there should be
a very good reason.

> > >
> > >>>>> In second use case, where one want to bind only one member device
> > >>>>> to one VM, I think same plumbing can be extended to have another
> > >>>>> VF, to take
> > >>>> the role of migration device instead of owner device.
> > >>>>> I don’t see a good way to passthrough and also do in-band
> > >>>>> migration without
> > >>>> lot of device specific trap and emulation.
> > >>>>> I also don’t know the cpu performance numbers with 3 levels of
> > >>>>> nested page
> > >>>> table translation which to my understanding cannot be accelerated
> > >>>> by the current cpu.
> > >>>> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and
> > so
> > >> on,
> > >>>> there can be performance overhead, but can be done.
> > >>>>
> > >>>> So admin vq migration still don't work for nested, this is surely a blocker.
> > >>> In specific case of member devices are located at different nest
> > >>> level, it does
> > >> not.
> > >> so you got the point, so this series should not be merged.
> > >>> Why prevents you have a peer VF do the role of migration driver?
> > >>> Basically, what I am proposing is, connect two VFs to the L1 guest.
> > >>> One VF is
> > >> migration driver, one VF is passthrough to L2 guest.
> > >>> And same scheme works.
> > >> A peer VF? A management VF? still break the existing usecase. and how
> > >> do you transfer ownership of L2 VF from PF to L1 VF?
> > > A peer management VF which services admin command (like PF).
> > > Ownership of admin command is delegated to the management VF.
> > interesting, do you plan to cook a patch implementing this?
> No. I am hoping that you can help to draft those patches for nested case to work when one wants to hand of single VM to single nested guest VM.
> I will not be able to test any of nested things and show its performance value either, as I don’t see how rest of the eco system can match up for the nested.
> Hence, your expertise in drafting extension for nested is desired.
> 
> > Really make sense?
> > 
> > How do you transfer the ownership?
> An additional ownership deletgation by a new admin command.
> > How to you maintain a different group?
> One to one assignment.
> > How do you isolate the groups?
> Not sure, what it means. The explicit group is created and VFs are placed in this group.
> > How to you keep the guest or host secure?
> Please be specific. Its very broad question when it comes to defining the interface.
> > How do you manage the overlaps?
> Overlaps between?
> > How do you implement the hardware support that?
> Please consult your board designers. Hard to say how to implement something in generic.
> > How do you change the PCI routing?
> Why anything to be changed in PCI routing?
> 
> > > It does not break any existing deployments.
> > we are talking about nested, don't break nested
> Virtio spec for nested is not defined yet. Hence nothing is broken. Please avoid using the verb, _break_.

Well people are passing virtio devices through to nested guests.
Ideally such configs should, somehow, support nested hypervisors
migrating nested guests. Considering e.g. write tracking
needs decent performance for live migration to deserve the name,
I doubt pulling data across PCIe with synchronous MMIO
operations with no pipelining will work well enough.
At the same time, if the maintainance cost at spec level is
low and the feature is self-contained, then why not.
Which this one poking at existing registers with subtle semantics,
isn't.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 10:06                   ` Michael S. Tsirkin
  2023-10-12 10:13                     ` Parav Pandit
@ 2023-10-12 10:52                     ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-12 10:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/12/2023 6:06 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 12, 2023 at 06:00:09PM +0800, Zhu, Lingshan wrote:
>> we are talking about nested, don't break nested
> Yes, and I agree a "simulated VF" is a kind of hack
> at least for now. What I could see instead is
> one or both of
> 1 another way to submit admin commands without DMA
> 2 a way to have a "dummy" PF that only does admin
>    commands without anything device specific
Is it overkill?
Can we re-consider Jason's proposal?

I should send V2 so that we can work on the patch
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-12 10:21                   ` Zhu, Lingshan
@ 2023-10-12 10:58                     ` Parav Pandit
  2023-10-12 11:17                       ` Michael S. Tsirkin
                                         ` (2 more replies)
  0 siblings, 3 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-12 10:58 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, October 12, 2023 3:51 PM
> 
> On 10/11/2023 7:43 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Wednesday, October 11, 2023 3:55 PM
> >>>>>>> I don’t have any strong opinion to keep it or remove it as most
> >>>>>>> stakeholders
> >>>>>> has the clear view of requirements now.
> >>>>>>> Let me know.
> >>>>>> So some people use VFs with VFIO. Hence the module name.  This
> >>>>>> sentence by itself seems to have zero value for the spec. Just drop it.
> >>>>> Ok. Will drop.
> >>>> So why not build your admin vq live migration on our config space
> >>>> solution, get out of the troubles, to make your life easier?
> >>>>
> >>> Your this question is completely unrelated to this reply or you
> >>> misunderstood
> >> what dropping commit log means.
> >> if you can rebase admin vq LM on our basic facilities, I think you
> >> dont need to talk about vfio in the first place, so I ask you to re-consider
> Jason's proposal.
> > I don’t really know why you are upset with the vfio term.
> > It is the use case of the cloud operator and it is listed to indicate how proposal
> fits in a such use case.
> > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> >
> > I already answered that I will remove from the commit log, because the
> requirements are well understood now by the committee.
> >
> > Your comment is again unrelated (repeated) to your past two questions.
> >
> > I explained you the technical problem that admin command (not admin VQ)
> of basic facilities cannot be done using config registers without any mediation
> layer.
> OK, I pop-ed Jason's proposal to make everything easier, and I see it is refused.
Because it does not work for passthrough mode.

> >
> >>> Dropping link to vfio does not drop the requirement.
> >>> I am ok to drop because requirements are clear of passthrough of
> >>> member
> >> device.
> >>> Vfio is not a trouble at all.
> >>> Admin command is not a trouble either.
> >>>
> >>> The pure technical reason is: all the functionalities proposed
> >>> cannot be done
> >> in any other existing way.
> >>> Why? For below reasons.
> >>> 1. device context, and write records (aka dirty page addresses) is
> >>> huge which cannot be shared using config registers at scale of 4000
> >>> member devices
> >> dirty page tracking will be implmemented in V2, actually I have the
> >> patch right now.
> > That is yet again the invitation to non_colloboration mode.
> > Without reviewing, v0 and v1, you want to show dirty page tracking in some
> other way.
> >
> > But ok, that is your non_coperative mode of working. Cannot help further.
> I believe both me and Jason have proposed a solution, I see it is rejected.
> But don't take it personal and please keep professional.
Sure, as I explained the config register method do not work for passthrough mode, and does not scale.

> >
> >> inflight descriptor tracking will be implemented by Eugenio in V2.
> > When we have near complete proposal from two device vendors, you want
> > to push something to unknown future without reviewing the work; does not
> make sense.
> Didn't I ever provide feedback to you? Really?
No. I didn’t see why you need to post a new patch for dirty page tracking, when it is already present in this series.
I would like to understand and review this aspects.
Same for the device context.

> >
> > You are still in the mode of _take_ what we did with near zero explanation.
> > You asked question of why passthrough proposal cannot advantage of in_band
> config registers.
> > I explained technical reason listed here.
> I have answered the questions, and asked questions for many times.
> What do you mean by "why passthrough proposal cannot advantage of in_band
> config registers."?
> Config space work for passthrough for sure.
Config space registers are passthrough the guest VM.
Hence hypervisor messing it with, programming some address would result in either security issue.
Or functionally broken, to sustain the functionality, each nested layer needs one copy of these registers for each nest level.
So they must be trapped somehow.

Secondly I don’t see how one can read 1M flows using config registers.

> >
> > So please don’t jump to conclusions before finishing the discussion on how
> both side can take advantage of each other.
> >
> > Lets please do that.
> We have proposed a solution, right?
> 
Which one? To do something in future?
I don’t see a suggestion on how one can use device context and dirty page tracking for nested and passthrough uniformly.
I see a technical difficulty in making both work with uniform interface.

> I still need to point out: admin vq LM does not work, one example is nested.
As Michael said, please don’t confuse between admin commands and admin vq.

> >
> >> There are no scale problem as I repeated for many time, they are
> >> per-device basic facilities, just migrate the VF by its own facility,
> >> so there are no 40000 member devices, this is not per PF.
> >>
> > I explained that device reset, flr etc flow cannot work when controlling and
> controlled functions are single entity for passthrough mode.
> > The scale problem is, one needs to duplicate the registers on each VF.
> > The industry is moving away from the register interface in many _real_ hw
> devices implementation.
> > Some of the examples are IMS, SIOV, NVMe and more.
> we have discussed this for many times, please refer to previous threads, even
> with Jason.
I do not agree for any registers to add to the VF which are reset on device_reset and FLR.
As it does not work for passthrough mode.

> >
> >> The device context can be read from config space or trapped, like
> >> shadow
> > There are 1 million flows of the net device flow filters in progress.
> > Each flow is 64B in size.
> > Total size is 64MB.
> > I don’t see how one can read such amount of memory using config registers.
> control vq?
The control vq and flow filter vqs are owned by the guest driver, not the hypervisor.
So no, cvq cannot be used.

> Or do you want to migrate non-virtio context?
Every thing is virtio device context.

> >
> >> control vq which is already done, that is basic virtualization.
> > There is nothing like "basic virtualization".
> > What is proposed here is fulfilling the requirement of passthrough mode.
> >
> > Your comment is implying, "I don’t care for passthrough requirements, do
> non_passthrough".
> that is your understanding, and you misunderstood it. Config space servers
> passthrough for many years.
"Config space servers" ?
I do not understand it, can you please explain what does that mean?

I do not see your suggestion on how one can implement passthrough member device when passthrough device does the dma and migration framework also need to do the dma.

> >
> > The discussion should be,
> > How can we leverage common framework for passthrough and mediated
> mode?
> > Can we? If so, which are the pieces?
> config space is a common framework, right?
> >
> > For me it is frankly very weird to take native virtio member device, convert
> into a medicated device using a giant software, and after that convolution get
> virtio device.
> > But for nested case you have the use case.
> > So if we focus positively on how two use cases can use some common
> functionality, that will be great.
> why config space need a giant sw to work?

You can count the number of lines of code for existing and rest 30+ devices to see how much does it take.
Which is still missing some of the code for small downtime.
And compare it with passthrough driver code.

Regardless, I just don’t see how config registers work.

> 
> So both Jason and I suggest you build admin vq solution based on our basic
> facilities.
:)
That basic facility is missing dirty page tracking, P2P support, device context, FLR, device reset support.
Hence, it is unusable right now for passthough member device.
And 6th problemetic thing in it is, it does not scale with member devices.

> >
> >> If you want to migrate device context, you need to specify device
> >> context for every type of device, net maybe easy, how do you see virtio-fs?
> > Virtio-fs will have its on device context too.
> > Every device has some sort of backend in varied degree.
> > Net being widely used and moderate complex device.
> > Fs being slightly stateful but less complex than net, as it has far less control
> operations.
> so, do you say you have implement a live migration solution which can migrate
> device context, but only work for net or block?
I don’t think this question about implementation has any relevance.
Frankly feels like a court to me. :(
No. I dint say that.
We have implemented net, fs, block devices and single framework proposed here can support all 3 and rest 28+.
The device context part in this series do not cover special/optional things of all the device type.
This is something I promised to do gradually, once the framework looks good.
> 
> Then you should call it virtio net/blk migration and implement in net/block
> section.
No. you misunderstood. My point was showing orthogonal complexities of net vs fs.
I likely failed to explain that.

> > In fact virtio-fs device already discusses the migrating the device side state, as
> listed in device context.
> > So virtio-fs device will have its own device-context defined.
> if you want to migrate it, you need to define it
Sure.
Only device specific things to be defined in future.
Rest is already present.
We are not going to define all the device context in one patch series that no one can review reliably.
It will be done incrementally.

But the feedback, I am taking is, we need to add a command that indicates which TLVs are supported in the device migration.
So virtio-fs or other device migration capabilities can be discovered.
I will cover this in v2.

Thanks a lot for this thoughts.

> >
> > The infrastructure and basic facilities are setup in this series, that one can
> easily extend for all the current and new device types.
> really? how?
> >
> >> And we are migrating stateless devices, or no? How do you migrate virtio-fs?
> >>> 2. sharing such large context and write addresses in parallel for
> >>> multiple devices cannot be done using single register file
> >> see above
> >>> 3. These registers cannot be residing in the VF because VF can
> >>> undergo FLR, and device reset which must clear these registers
> >> do you mean you want to audit all PCI features? When FLR, the device
> >> is rested, do you expect a device remember anything after FLR?
> > Not at all. VF member device will not remember anything after FLR.
> >> Do you want to trap FLR? Why?
> > This proposal does _not_ want to trap the FLR in the hypervisor virtio driver.
> >
> > When one does the mediation-based design, it must trap/emulate/fake the
> FLR.
> > It helps to address the case of nested as you mentioned.
> once passthrough, the guest driver can access the config space to reset the
> device, right?
> >> Why FLR block or conflict with live migration?
> > It does not block or conflict.
> OK, cool, so let's make this a conclusion
> >
> > The whole point is, when you put live migration functionality on the VF itself,
> you just cannot FLR this device.
> > One must trap the FLR and do fake FLR and build the whole infrastructure to
> not FLR The device.
> > Above is not passthrough device.
> No, the guest can reset the device, even causing a failed live migration.
Not in the proposal here.
Can you please prove how in the current v1 proposal, device reset will fail the migration?
I would like to fix it.

> >
> >>> 4. When VF does the DMA, all dma occurs in the guest address space,
> >>> not in
> >> hypervisor space; any flr and device reset must stop such dma.
> >>> And device reset and flr are controlled by the guest (not mediated
> >>> by
> >> hypervisor).
> >> if the guest reset the device, it is totally reasonable operation,
> >> and the guest own the risk, right?
> > Sure, but the guest still expects its dirty pages and device context to be
> migrated across device_reset.
> > Device_reset will lose all this information within the device if done without
> mediation and special care.
> No, if the guest reset a device, that means the device should be RESET, to forget
> its config, that would be really wired to migrate a fresh device at the source
> side, to be a running device at the destination side.
Device reset not doing the role of reset is just a plain broken spec.

> >
> > So, to avoid that now one needs to have fake device reset too and build that
> infrastructure to not reset.
> >
> > The passthrough proposal fundamental concept is:
> >
> > all the native virtio functionalities are between guest driver and the actual
> device.
> see above.
> >
> >> and still, do you want to audit every PCI features? at least you
> >> didn't do that in your series.
> > Can you please list which PCI features audit you are talking about?
> you audit FLR, then do you want to check everyone?
> If no, how to decide which one should be audited, why others not?

I really find it hard to follow your question.

I explained in patch 5 and 8 about interactions with the FLR and its support.
Not sure what you want me to check.

You mentioned that "I didn’t audit every PCI features"? So can you please list which one and in relation to which admin commands?

> >
> > Keep in mind, that will all the mediation, one now must equally audit all this
> giant software stack too.
> > So maybe it is fine for those who are ok with it.
> so you agree FLR is not a problem, at least for config space solution?
I don’t know what you mean "FLR is not a problem".

FLR on the VF must work as it works without live migration for passthrough device as today.
And admin commands have some interactions with it.
And this proposal covers it.
I am missing some text that Michael and Jason pointed out.
I am working on v2 to annotate or better word them.

> >
> >> For migration, you know the hypervisor takes the ownership of the
> >> device in the stop_window.
> > I do not know what stop_window means.
> > Do you mean stop_copy of vfio or it is qemu term?
> when guest freeze.
> >
> >>> 5. Any PASID to separate out admin vq on the VF does not work for
> >>> two
> >> reasons.
> >>> R_1: device flr and device reset must stop all the dmas.
> >>> R_2: PASID by most leading vendors is still not mature enough
> >>> R_3: One also needs to do inversion to not expose PASID capability
> >>> of the member PCI device to not expose
> >> see above and what if guest shutdown? the same answer, right?
> > Not sure, I follow.
> > If the guest shutdown, the guest specific shutdown APIs are called.
> >
> > With passthrough device, R_1 just works as is.
> > R_3 is not needed as they are directly given to the guest.
> > R_2 platform dependency is not needed either.
> I think we already have a concussion for FLR.
I don’t have any concussion.
I wrote what to be supported for the FLR above.

> For PASID, what blocks the solution?
When the device is passthrough, PASID capabilities cannot be emulated.
PASID space is owned fully by the guest.

There is no single known cpu vendor support splitting pasid between hypervisor and guest.
I can double check, but last I recall that Linux kernel removed such weird support.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 10:09                   ` Parav Pandit
  2023-10-12 10:45                     ` Michael S. Tsirkin
@ 2023-10-12 11:10                     ` Zhu, Lingshan
  2023-10-12 11:37                       ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-12 11:10 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/12/2023 6:09 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, October 12, 2023 3:30 PM
>>
>> On 10/11/2023 6:54 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Wednesday, October 11, 2023 3:38 PM
>>>>
>>>>>> The system admin can choose only passthrough some of the devices
>>>>>> for nested guests, so passthrough the PF to L1 guest is not a good
>>>>>> idea, because there can be many devices still work for the host or L1.
>>>>> Possible. One size does not fit all.
>>>>> What I expressed is most common scenarios that user care about.
>>>> don't block existing usecases, don't break the userspace, nested is common.
>>> Nothing is broken as virtio spec do not have any single construct to support
>> migration.
>>> If nested is common, can you share the performance number with real virtio
>> device with/without 2 level nesting?
>>> I frankly don’t know how they look like.
>> virtio devices support nested, I mean don't break this usecase And end user
>> accept performance overhead in nested, this is not related to this topic.
>>
> Can you show an example of virtio device nesting and live migration already supported where the device has _done_ the live migration.
> Due to which you claim that new feature of admin command-based owner and member device breaks something?
current virito/kvm/qemu support nested. You you try to setup a nested VM 
to check the result.
>
> Please don’t use the verb "break".
> Your proposal is the first of its kind that supports migrating nested device.
> This is why new patches of config register or admin command does not break anything existing.
if your proposal don't support nested, you break nested use cases.
>
>>>>>>> In second use case, where one want to bind only one member device
>>>>>>> to one VM, I think same plumbing can be extended to have another
>>>>>>> VF, to take
>>>>>> the role of migration device instead of owner device.
>>>>>>> I don’t see a good way to passthrough and also do in-band
>>>>>>> migration without
>>>>>> lot of device specific trap and emulation.
>>>>>>> I also don’t know the cpu performance numbers with 3 levels of
>>>>>>> nested page
>>>>>> table translation which to my understanding cannot be accelerated
>>>>>> by the current cpu.
>>>>>> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA and
>> so
>>>> on,
>>>>>> there can be performance overhead, but can be done.
>>>>>>
>>>>>> So admin vq migration still don't work for nested, this is surely a blocker.
>>>>> In specific case of member devices are located at different nest
>>>>> level, it does
>>>> not.
>>>> so you got the point, so this series should not be merged.
>>>>> Why prevents you have a peer VF do the role of migration driver?
>>>>> Basically, what I am proposing is, connect two VFs to the L1 guest.
>>>>> One VF is
>>>> migration driver, one VF is passthrough to L2 guest.
>>>>> And same scheme works.
>>>> A peer VF? A management VF? still break the existing usecase. and how
>>>> do you transfer ownership of L2 VF from PF to L1 VF?
>>> A peer management VF which services admin command (like PF).
>>> Ownership of admin command is delegated to the management VF.
>> interesting, do you plan to cook a patch implementing this?
> No. I am hoping that you can help to draft those patches for nested case to work when one wants to hand of single VM to single nested guest VM.
> I will not be able to test any of nested things and show its performance value either, as I don’t see how rest of the eco system can match up for the nested.
> Hence, your expertise in drafting extension for nested is desired.
I see it does not support nested. As MST ever pointed out, a management 
VF sounds awkward
>
>> Really make sense?
>>
>> How do you transfer the ownership?
> An additional ownership deletgation by a new admin command.
if you think this can work, do you want to cook a patch to implement 
this before you submitting this live migration series?
>> How to you maintain a different group?
> One to one assignment.
same as above
>> How do you isolate the groups?
> Not sure, what it means. The explicit group is created and VFs are placed in this group.
VF resource are on PF, right?
>> How to you keep the guest or host secure?
> Please be specific. Its very broad question when it comes to defining the interface.
without isolation, can be attacked?
>> How do you manage the overlaps?
> Overlaps between?
host pf and L1 VF
>> How do you implement the hardware support that?
> Please consult your board designers. Hard to say how to implement something in generic.
so you don't have an idea
>> How do you change the PCI routing?
> Why anything to be changed in PCI routing?
do you place PF and mangement VF in an ACL group?
Do does L1 management VF's member device belong to the PF physically?
>
>>> It does not break any existing deployments.
>> we are talking about nested, don't break nested
> Virtio spec for nested is not defined yet. Hence nothing is broken. Please avoid using the verb, _break_.
virtio nested works for many years


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-12 10:58                     ` Parav Pandit
@ 2023-10-12 11:17                       ` Michael S. Tsirkin
  2023-10-12 11:47                         ` Parav Pandit
  2023-10-13  1:16                       ` Jason Wang
  2023-10-13  9:06                       ` Zhu, Lingshan
  2 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-12 11:17 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 12, 2023 at 10:58:14AM +0000, Parav Pandit wrote:
> Sure, as I explained the config register method do not work for passthrough mode, and does not scale.

And on the flip side, to be frank not everyone has huge guests and
numbers of VMs and a slower memory mapped interface for small devices
might make sense.  What we need to do though is make this a small
non-intrusive part since it's not clear such embedded cases even need
live migration.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 10:45                     ` Michael S. Tsirkin
@ 2023-10-12 11:23                       ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-12 11:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 12, 2023 4:15 PM

[..]
> > Please don’t use the verb "break".
> > Your proposal is the first of its kind that supports migrating nested device.
> > This is why new patches of config register or admin command does not break
> anything existing.
> 
> Wording aside, new features should support as wide a variety of configs as
> possible, if some config is not supported there should be a very good reason.
> 
I agree, if possible, we should define one solution that can work for nested and non-nested case at same level of performance.
I just don’t think that it is possible for PCI devices.
Additionally, the eco-system for N level nesting is also not in place for page tables, PML etc to my knowledge.
And I may be totally wrong.

[..]
> > > > It does not break any existing deployments.
> > > we are talking about nested, don't break nested
> > Virtio spec for nested is not defined yet. Hence nothing is broken. Please avoid
> using the verb, _break_.
> 
> Well people are passing virtio devices through to nested guests.
> Ideally such configs should, somehow, support nested hypervisors migrating
> nested guests. 
I think nesting needs some kind of mediation and passthrough needs to avoid it.
So best I can think of somehow the admin commands of this proposal if they can work on the AQ of the member device, it may work.
And if Lingshan can help to extend these commands it will be really nice.

Or the second idea you proposed of dummy PF, can seamlessly work with admin commands too.

> Considering e.g. write tracking needs decent performance for
> live migration to deserve the name, I doubt pulling data across PCIe with
> synchronous MMIO operations with no pipelining will work well enough.
Right. Even Intel PML does not support nesting as per my last read.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 11:10                     ` Zhu, Lingshan
@ 2023-10-12 11:37                       ` Parav Pandit
  2023-10-12 13:03                         ` Michael S. Tsirkin
                                           ` (2 more replies)
  0 siblings, 3 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-12 11:37 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, October 12, 2023 4:40 PM

> On 10/12/2023 6:09 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Thursday, October 12, 2023 3:30 PM
> >>
> >> On 10/11/2023 6:54 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Wednesday, October 11, 2023 3:38 PM
> >>>>
> >>>>>> The system admin can choose only passthrough some of the devices
> >>>>>> for nested guests, so passthrough the PF to L1 guest is not a
> >>>>>> good idea, because there can be many devices still work for the host or
> L1.
> >>>>> Possible. One size does not fit all.
> >>>>> What I expressed is most common scenarios that user care about.
> >>>> don't block existing usecases, don't break the userspace, nested is
> common.
> >>> Nothing is broken as virtio spec do not have any single construct to
> >>> support
> >> migration.
> >>> If nested is common, can you share the performance number with real
> >>> virtio
> >> device with/without 2 level nesting?
> >>> I frankly don’t know how they look like.
> >> virtio devices support nested, I mean don't break this usecase And
> >> end user accept performance overhead in nested, this is not related to this
> topic.
> >>
> > Can you show an example of virtio device nesting and live migration already
> supported where the device has _done_ the live migration.
> > Due to which you claim that new feature of admin command-based owner
> and member device breaks something?
> current virito/kvm/qemu support nested. 
Sure, two of the 3 components are not part of the virtio spec.
Hence, they are not broken.

> >
> > Please don’t use the verb "break".
> > Your proposal is the first of its kind that supports migrating nested device.
> > This is why new patches of config register or admin command does not break
> anything existing.
> if your proposal don't support nested, you break nested use cases.
> >
> >>>>>>> In second use case, where one want to bind only one member
> >>>>>>> device to one VM, I think same plumbing can be extended to have
> >>>>>>> another VF, to take
> >>>>>> the role of migration device instead of owner device.
> >>>>>>> I don’t see a good way to passthrough and also do in-band
> >>>>>>> migration without
> >>>>>> lot of device specific trap and emulation.
> >>>>>>> I also don’t know the cpu performance numbers with 3 levels of
> >>>>>>> nested page
> >>>>>> table translation which to my understanding cannot be accelerated
> >>>>>> by the current cpu.
> >>>>>> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA
> and
> >> so
> >>>> on,
> >>>>>> there can be performance overhead, but can be done.
> >>>>>>
> >>>>>> So admin vq migration still don't work for nested, this is surely a
> blocker.
> >>>>> In specific case of member devices are located at different nest
> >>>>> level, it does
> >>>> not.
> >>>> so you got the point, so this series should not be merged.
> >>>>> Why prevents you have a peer VF do the role of migration driver?
> >>>>> Basically, what I am proposing is, connect two VFs to the L1 guest.
> >>>>> One VF is
> >>>> migration driver, one VF is passthrough to L2 guest.
> >>>>> And same scheme works.
> >>>> A peer VF? A management VF? still break the existing usecase. and
> >>>> how do you transfer ownership of L2 VF from PF to L1 VF?
> >>> A peer management VF which services admin command (like PF).
> >>> Ownership of admin command is delegated to the management VF.
> >> interesting, do you plan to cook a patch implementing this?
> > No. I am hoping that you can help to draft those patches for nested case to
> work when one wants to hand of single VM to single nested guest VM.
> > I will not be able to test any of nested things and show its performance value
> either, as I don’t see how rest of the eco system can match up for the nested.
> > Hence, your expertise in drafting extension for nested is desired.
Answer to your below question of patch drafting is here. If you can help to extend it will be good.

> >
> >> Really make sense?
> >>
> >> How do you transfer the ownership?
> > An additional ownership deletgation by a new admin command.
> if you think this can work, do you want to cook a patch to implement this before
> you submitting this live migration series?
I answered this already above.

> >> How to you maintain a different group?
> > One to one assignment.
> same as above
> >> How do you isolate the groups?
> > Not sure, what it means. The explicit group is created and VFs are placed in
> this group.
> VF resource are on PF, right?
Which resource?
Before jumping to resource, may be you want to answer "group isolation"?

> >> How to you keep the guest or host secure?
> > Please be specific. Its very broad question when it comes to defining the
> interface.
> without isolation, can be attacked?
What isolation are you talking about?
I am suggesting that one VF as dummy PF is given the role of admin commands.

> >> How do you manage the overlaps?
> > Overlaps between?
> host pf and L1 VF
L1 VF works at it own level.
Host PF works at its own level.
This is the true nesting.

> >> How do you implement the hardware support that?
> > Please consult your board designers. Hard to say how to implement something
> in generic.
> so you don't have an idea
:)
Right, I do not have idea for Intel boards.
I was suggesting a management VF that can service the admin commands.

> >> How do you change the PCI routing?
> > Why anything to be changed in PCI routing?
> do you place PF and mangement VF in an ACL group?
ACL group at which layer?

> Do does L1 management VF's member device belong to the PF physically?
Yes.
> >
> >>> It does not break any existing deployments.
> >> we are talking about nested, don't break nested
> > Virtio spec for nested is not defined yet. Hence nothing is broken. Please avoid
> using the verb, _break_.
> virtio nested works for many years
I replied: your break comment is not applicable to virtio_spec, nor does it apply to any existing software you listed.

As Michael said, software based nesting is used.. 
See if actual hw based devices can implement it or not. Many components of cpu cannot do N level nesting either, but may be virtio can.
I don’t know how yet.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-12 11:17                       ` Michael S. Tsirkin
@ 2023-10-12 11:47                         ` Parav Pandit
  2023-10-12 13:05                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-12 11:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Thursday, October 12, 2023 4:48 PM
> On Thu, Oct 12, 2023 at 10:58:14AM +0000, Parav Pandit wrote:
> > Sure, as I explained the config register method do not work for passthrough
> mode, and does not scale.
> 
> And on the flip side, to be frank not everyone has huge guests and numbers of
> VMs and a slower memory mapped interface for small devices might make
> sense.  What we need to do though is make this a small non-intrusive part since
> it's not clear such embedded cases even need live migration.

I almost sure, I am misunderstanding your point.

If the VM is small who may not need a performance, may be hypervisor can just offer
VIRTIO_NET_F_STANDBY.

I recollect we even accelerated VIRTIO_NET_F_STANDBY flow as well.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 11:37                       ` Parav Pandit
@ 2023-10-12 13:03                         ` Michael S. Tsirkin
  2023-10-12 13:13                           ` Parav Pandit
  2023-10-13  1:18                         ` Jason Wang
  2023-10-13  9:44                         ` Zhu, Lingshan
  2 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-12 13:03 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 12, 2023 at 11:37:01AM +0000, Parav Pandit wrote:
> As Michael said, software based nesting is used.. 
> See if actual hw based devices can implement it or not. Many components of cpu cannot do N level nesting either, but may be virtio can.
> I don’t know how yet.

It is worth pondering though. Can you give it some thought?
I am thinking of a special bar region (or a couple) through which admin
command structure can be exposed, and a special group type that
refers to same pci function.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-12 11:47                         ` Parav Pandit
@ 2023-10-12 13:05                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-12 13:05 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 12, 2023 at 11:47:05AM +0000, Parav Pandit wrote:
> 
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Michael S. Tsirkin
> > Sent: Thursday, October 12, 2023 4:48 PM
> > On Thu, Oct 12, 2023 at 10:58:14AM +0000, Parav Pandit wrote:
> > > Sure, as I explained the config register method do not work for passthrough
> > mode, and does not scale.
> > 
> > And on the flip side, to be frank not everyone has huge guests and numbers of
> > VMs and a slower memory mapped interface for small devices might make
> > sense.  What we need to do though is make this a small non-intrusive part since
> > it's not clear such embedded cases even need live migration.
> 
> I almost sure, I am misunderstanding your point.
> 
> If the VM is small who may not need a performance, may be hypervisor can just offer
> VIRTIO_NET_F_STANDBY.
> 
> I recollect we even accelerated VIRTIO_NET_F_STANDBY flow as well.

standby hack is frankly problematic in that vm can not just be
migrated at any time.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 13:03                         ` Michael S. Tsirkin
@ 2023-10-12 13:13                           ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-12 13:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Thursday, October 12, 2023 6:33 PM
> 
> On Thu, Oct 12, 2023 at 11:37:01AM +0000, Parav Pandit wrote:
> > As Michael said, software based nesting is used..
> > See if actual hw based devices can implement it or not. Many components of
> cpu cannot do N level nesting either, but may be virtio can.
> > I don’t know how yet.
> 
> It is worth pondering though. Can you give it some thought?
Yes. 
Device context and dirty tracking can be used mostly as_is.
It needs some different set of commands as one cannot stop/freeze the device.

> I am thinking of a special bar region (or a couple) through which admin
> command structure can be exposed, and a special group type that refers to
> same pci function.
Yes, this can be an option as additional way for non_passthrough mode.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-11 10:47             ` Parav Pandit
  2023-10-11 20:14               ` Michael S. Tsirkin
@ 2023-10-13  1:15               ` Jason Wang
  2023-10-13  6:36                 ` Parav Pandit
  2023-10-13 11:41                 ` Michael S. Tsirkin
  1 sibling, 2 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-13  1:15 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Wed, Oct 11, 2023 at 6:47 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, October 11, 2023 8:44 AM
> >
> > On Tue, Oct 10, 2023 at 3:19 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, October 10, 2023 11:21 AM
> > > >
> > > > On Mon, Oct 9, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, October 9, 2023 2:19 PM
> > > > > >
> > > > > > Adding LingShan.
> > > > > >
> > > > > Thanks for adding him.
> > > > >
> > > > > > Parav, if you want any specific people to comment, please do cc them.
> > > > > >
> > > > > Sure, will cc them in v2 as now I see there is interest in the review.
> > > > >
> > > > > > On Sun, Oct 8, 2023 at 7:26 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > > > >
> > > > > > > One or more passthrough PCI VF devices are ubiquitous for
> > > > > > > virtual machines usage using generic kernel framework such as vfio [1].
> > > > > >
> > > > > > Mentioning a specific subsystem in a specific OS may mislead the
> > > > > > user to think it can only work in that setup. Let's not do that,
> > > > > > virtio is not only used for Linux and VFIO.
> > > > > >
> > > > > Not really. it is an example in the cover letter.
> > > > > It is not the only use case.
> > > > > A use case gives a crisp clarity of what UAPI it needs to fulfil.
> > > > > So I will keep it. It is anyway written as one use case.
> > > > >
> > > > > > >
> > > > > > > A passthrough PCI VF device is fully owned by the virtual
> > > > > > > machine device driver.
> > > > > >
> > > > > > Is this true? Even VFIO needs to mediate PCI stuff. Or how do
> > > > > > you define "passthrough" here?
> > > > > >
> > > > > Other than PCI config registers and due to some legacy, msix.
> > > > > The "device interface" side is not mediated.
> > > > > The definition of passthrough here is: To not mediate a device
> > > > > type specific
> > > > and virtio specific interfaces for modern and future devices.
> > > >
> > > > Ok, but what's the difference between "device type specific" and
> > > > "virtio specific interfaces". Maybe an example for this?
> > > >
> > > Virtio device specific means: cvq of crypto device, cvq of net device, flow filter
> > vqs of net device etc.
> > > Virtio specific interface: virtio driver notifications, virtio virtqueue and
> > configuration mediation etc.
> > >
> > > > >
> > > > > > > This passthrough device controls its own device reset flow,
> > > > > > > basic functionality as PCI VF function level reset
> > > > > >
> > > > > > How about other PCI stuff? Or Why is FLR special?
> > > > > FLR is special for the readers to get the clarity that FLR is also
> > > > > done by the
> > > > guest driver hence, the device migration commands do not
> > > > interact/depend with FLR flow.
> > > >
> > > > It's still not clear to me how this is done.
> > > >
> > > > 1) guest starts FLR
> > > > 2) adminq freeze the VF
> > > > 3) FLR is done
> > > >
> > > > If the freezing doesn't wait for the FLR, does it mean we need to
> > > > migrate to a state like FLR is pending? If yes, do we need to
> > > > migrate the other sub states like this? If not, why?
> > > >
> > > In most practical cases #2 followed by #1 should not happen as on the source
> > side the expected is mode change to stop from active.
> >
> > How does the hypervisor know if a guest is doing what without trapping?
> >
> Hypervisor does not know. The device knows being the recipient of #1 and #2.

We are discussing the possibility in software/driver side isn't it?

1) is initiated from the guest
2) is initiated from the hypervisor

Both are softwares, and you're saying 2) should not happen after 1)
since the device knows what is being done by guests? How can devices
control software behaviour?

This only possible thing is to make sure 3) is done before 2) That is
what I'm asking but you are saying freeze doesn't need to wait for
FLR...

>
> > > But ok, since we active to freeze mode change is allowed, lets discuss above.
> > >
> > > A device is the single synchronization point for any device reset, FLR or admin
> > command operation.
> >
> > So you agree we need synchronization? And I'm not sure I get the meaning of
> > synchronization point, do you mean the synchronization between freeze/stop
> > and virtio facilities?
> >
> Synchronization means, handling two events in parallel such as FLR and other.

Great. So we have a perfect race:

1) guest initiates FLR
2) device start FLR
3) hypervisor stop and freeze the device
4) device is freeze
5) hypervisor read device context A
6) migrate device contextA
8) migration is done
9) FLR is done
10) hypervisor read device context B

So we end up with inconsistent device context, no? Dest want B or A+B,
but you give A.

>
> > > So, the migration driver do not need to wait for FLR to complete.
> >
> > I'm confused, you said below that device context could be changed by FLR.
> >
> Yes.
> > If FLR needs to clear device context, we can have a race where device context is
> > cleared when we are trying to read it?
> >
> I didn’t say clear the context.
> FLR updates the device context.

In what sense?

> Device is serving the device context read write commands, serving FLR, answering mode change command,
> So device knows the best how to avoid any race.

You want to leave those details for the vendor to figure out? If
devices know everything, why do we need device normative?

I see issues at least for FLR, I'm pretty sure they are others. If a
design requires us to audit all the possible conflicts between virtio
facilities and transport. It's a strong hint of layer violation and
when it happens it for sure may hit a lot of problems that are very
hard to find or debug thus we should drop such a design. I suggest
using the RFC tag since the next version (if there is one) as I see it
is immature in many ways.

What's more, solving races is much easier if the device functionality
is self contained. For example, for a self contained device with the
transport as the single interface, we can leverage from transport
(PCI) for dealing with races, arbitration, ordering, QOS etc which is
probably required in the internal channel between the owner and the
member. But all of these were missed in your series and even if you
can I'm not sure it's worthwhile to reinvent all of them.

For example, for the architecture like owner/member, if the virtio or
transport facility could be controlled via device internal channels
besides the transport, such a channel may complicate the
synchronization a lot. The device needs to be able to handle or
synchronize requests from both PCI and owner in parallel. They are
just too many possible races and most of my questions so far come from
this viewpoint. I wouldn't go further for other stuff since I believe
I've spotted sufficient issues and that's why I must stop at this
patch before looking at the rest.

Admin commands are fine if it does real administrative jobs such as
provisioning since such work is beyond the core virtio functionality.

Again, the goal of virtio spec is to have a device with sufficient
guidelines that is easy to implement but not leave the vendors to
waste their engineering resources in figuring or fuzzing the corner
cases.

>
> > > When admin cmd freeze the VF it can expect FLR_completed VF.
> >
> > We need to explain why and how about the resume? For example, is resuming
> > required to wait for the completion of FLR, if not, why?

This question is ignored.

> >
> > > Secondly since the FLR is local to the source, intermediate sub state does not
> > migrate.
> > >
> > > But I agree, it is worth to have the text capturing this.
> > >
> > > > >
> > > > > >
> > > > > > > and rest of the virtio device functionality such as control
> > > > > > > vq,
> > > > > >
> > > > > > What do you mean by "rest of"?
> > > > > >
> > > > > As given in the example cvq.
> > > > >
> > > > > > Which part is not controlled and why?
> > > > > Not controlled because as states, it is passthrough device.
> > > > >
> > > > > > > config space access, data path descriptors handling.
> > > > > > >
> > > > > > > Additionally, VM live migration using a precopy method is also
> > > > > > > widely
> > > > used.
> > > > > >
> > > > > > Why is this mentioned here?
> > > > > >
> > > > > Huh. You should be positive for bringing clarity to the readers on
> > > > understanding the use case.
> > > > > And you seem opposite, but ok.
> > > > >
> > > > > As stated, it for the reader to understand the use case and see
> > > > > how proposed
> > > > commands addresses the use case.
> > > >
> > > > The problem is that the hardware features should be designed for a
> > > > general purpose instead of a specific technology if it can. The only
> > > > missing part for post copy is the page fault.
> > > >
> > > Ok. The use case and requirement of member device passthrough is clear to
> > most reviewers now.
> >
> > In another thread you are saying that the PCI composition is done by hypervisor,
> > so passthrough is really confusing at least for me.
> >
> I explained there what vPCI composition is done there.
> PCI config space and msix side of composition is done.
> The whole virtio interface is not composed.

You need to describe this somewhere, no? That's what I'm saying.

And passthrough is misleading here.

>
> > > Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> > > This section is not normative section, so using an extra word like "flow" does
> > not confuse anyone.
> > > I will link to the section anyway.
> >
> > Probably, but you mention FLR flow as well.
> As I said, not repeating the PCIe spec here. The reader knows what FLR of the PCIe transport.

Ok, I'm not a native speaker, but I really don't know the difference
between "FLR" and "FLR flow".

>
> >
> > >
> > > > >
> > > > > > > and may also undergo PCI function level
> > > > > > > +reset(FLR) flow.
> > > > > >
> > > > > > Why is only FLR special here? I've asked FRS but you ignore the question.
> > > > > >
> > > > > FLR is special to bring clarity that guest owns the VF doing FLR,
> > > > > hence
> > > > hypervisor cannot mediate any registers of the VF.
> > > >
> > > > It's not about mediation at all, it's about how the device can
> > > > implement what you want here correctly.
> > > >
> > > > See my above question.
> > > >
> > > Ok. it is clear that live migration commands cannot stay on the member device
> > because the member device can undergo device reset and FLR flows owned by
> > the guest.
> >
> > I disagree, hypervisors can emulate FLR and never send FLR to real devices.
> >
> That would be some other trap alternative that needs to dissect the device and build infrastructure for such dissection is not desired in the listed use case.

Do you need to trap FLR or not? You're saying the hypervisor is in
charge of vPCI, how is this differ to what you proposed? If not, how
can vPCI be composed?

I believe you need to document how vpci is supposed to be done, since
I believe your proposal can only work with such specific types of PCI
composition. This is one of the important things that is missed in
this series.

> Here we are addressing the requirement of passthrough the device.

I don't think what you proposed here is passthrough, at least the PCI
part is not. And whether or not the virto can be passthrough is still
questionable to me.

>
> So your disagreement is fine for non-passthrough devices.
>
> > > (and hypervisor is not involved in these two flows, hence the admin command
> > interface is designed such that it can fullfil above requirements).
> > >
> > > Theory of operation brings out this clarity. Please notice that it is in
> > introductory section with an example.
> > > Not normative line.
> > >
> > > > >
> > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > +virtio specification;
> > > > > >
> > > > > > This seems unnecessary and obvious as it applies to all other
> > > > > > PCI and virtio functionality.
> > > > > >
> > > > > Great. But your comment is contradicts.
> > > > >
> > > > > > What's more, for the things that need to be synchronized, I
> > > > > > don't see any descriptions in this patch. And if it doesn't need, why?
> > > > > With which operation should it be synchronized and why?
> > > > > Can you please be specific?
> > > >
> > > > See my above question regarding FLR. And it may have others which I
> > > > haven't had time to audit.
> > > >
> > > Ok. when you get chance to audit, lets discuss that time.
> >
> > Well, I'm not the author of this series, it should be your job otherwise it would
> > be too late.
> >
> As author, what we think, I will cover. If you have specific points to add value, please share, I will look into it.

I've pointed out sufficient issues. I have a lot of others but I don't
want to have a giant thread once again.

>
> > For example, how is the power management interaction with the freeze/stop?
> >
> Power management is owned by the guest, like any other virtio interface.
> So freeze/stop do not interfere with it.

I don't think this is a good answer. I'm asking how the PM interacts
with freeze/stop, you answer it works well.

I'm not obliged to design hardware for you but figuring out the bad
design for virtio. I'm not convinced with a proposal that misses a lot
of obvious critical cases and for sure it's not my job to solve them.

I've demonstrated the possible races with FLR. So did the PM. For
example, if VF is in D3cold state, can we still read its device
context? If yes, is it a violation of the PCIE spec? If not, why? How
about other states? Can the device be freezed in the middle of PM
state transitions? If yes, how can it work without migrating PCI
states?

>
> > >
> > > > >
> > > > > It is not written in this series, because we believe it must not
> > > > > be synchronized
> > > > as it is fully controlled by the guest.
> > > > >
> > > > > >
> > > > > > > at the same time such flows must not obstruct
> > > > > > > +the device migration flow. In such a scenario, a group owner
> > > > > > > +device can provide the administration command interface to
> > > > > > > +facilitate the device migration related operations.
> > > > > > > +
> > > > > > > +When a virtual machine migrates from one hypervisor to
> > > > > > > +another hypervisor, these hypervisors are named as source and
> > > > > > > +destination
> > > > > > hypervisor respectively.
> > > > > > > +In such a scenario, a source hypervisor administers the
> > > > > > > +member device to suspend the device and preserves the device
> > context.
> > > > > > > +Subsequently, a destination hypervisor administers the member
> > > > > > > +device to setup a device context and resumes the member device.
> > > > > > > +The source hypervisor reads the member device context and the
> > > > > > > +destination hypervisor writes the member device context. The
> > > > > > > +method to transfer the member device context from the source
> > > > > > > +to the destination hypervisor is
> > > > > > outside the scope of this specification.
> > > > > > > +
> > > > > > > +The member device can be in any of the three migration modes.
> > > > > > > +The owner driver sets the member device in one of the
> > > > > > > +following modes during
> > > > > > device migration flow.
> > > > > > > +
> > > > > > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name
> > > > > > > +& Description \\ \hline \hline
> > > > > > > +0x0   & Active &
> > > > > > > +  It is the default mode after instantiation of the member
> > > > > > > +device. \\
> > > > > >
> > > > > > I don't think we ever define "instantiation" anywhere.
> > > > > >
> > > > > Well a transport has implicit definition of the instantiation already.
> > > > > May be a text can be added, but don’t see a value in duplicating
> > > > > PCI spec
> > > > here.
> > > >
> > > > Ok, maybe something like "transport specific instantiation"
> > > >
> > > Ok. that’s a good text. I will change to it.
> > >
> > > > >
> > > > > > > +\hline
> > > > > > > +0x1   & Stop &
> > > > > > > + In this mode, the member device does not send any
> > > > > > > +notifications, and it does not access any driver memory.
> > > > > >
> > > > > > What's the meaning of "driver memory"?
> > > > > >
> > > > > May be guest memory? Or do you suggest a better naming for the
> > > > > memory
> > > > allocated by the guest driver?
> > > >
> > > > Virtqueue?
> > > >
> > > Virtqueue and any memory referred by the virtqueue.
> > >
> > > This is good text, I will change to it.
> > >
> > > > >
> > > > > > And stop seems to be a source of inflight buffers.
> > > > > >
> > > > > I didn’t follow it.
> > > > > If you mean without stop there are no inflight buffer, then I don’t agree.
> > > > > We don’t want to violate the spec by having descriptors with zero
> > > > > size
> > > > returned.
> > > > > Stop is not the source of inflight descriptors.
> > > >
> > > > I think not since you forbid access to the used ring here. So even
> > > > if the buffer were processed by the device it can't be added back to
> > > > the used ring thus became inflight ones.
> > > >
> > > > >
> > > > > There are inflight descriptors with the device that are not yet
> > > > > returned to the
> > > > driver, and device wont return them as zero size wrong completions.
> > > > >
> > > > > > > + The member device may receive driver notifications in this
> > > > > > > + mode,
> > > > > >
> > > > > > What's the meaning of "receive"? For example if the device can
> > > > > > still process buffers, "stop" is not accurate.
> > > > > >
> > > > > Receive means, driver can send the notification as PCIe TLP that
> > > > > device may
> > > > receive as incoming PCIe TLP.
> > > >
> > > > Ok, so this is the transport level. But the device can keep processing the
> > queue?
> > > >
> > > Device cannot process the queue because it does not initiate any read/write
> > towards the virtqueue.
> >
> > Read/Write only results in a driver noticeable behaviour, it doesn't mean the
> > device can't process the buffers.  For example, devices can keep processing
> > available buffers and make them as inflight ones.
> >
> The idea is to stop the device and prepare for the migration, so the command to do so.
> Otherwise just the keep the device in active mode and avoid the complications.

Well, I meant we need a more precise definition of each state
otherwise it could be ambiguous (as I pointed above).

>
> > >
> > > > >
> > > > > In "stop" mode, the device wont process descriptors.
> > > >
> > > > If the device won't process descriptors, why still allow it to receive
> > notifications?
> > > Because notification may still arrive and if the device may update any
> > > counters as part of
> >
> > Which counters did you mean here?
> >
> The counter that Xuan is adding and any other state that device may have to update as result of driver notification.
> For example caching the posted avail index in the notification.

A link to those proposals? If the device must depend on those cached
features to work it's really fragile. If not, we don't need to care
about them.

>
> > > it which needs to be migrated or store the received notification.
> > >
> > > > Or does it really matter if the device can receive or not here?
> > > >
> > > From device point of view, the device is given the chance to update its device
> > context as part of notifications or access to it.
> >
> > This is in conflict with what you said above " Device cannot process the queue
> > ..."
> >
> No, it does not.
> Device context is updated within the device without accessing the queue memory of the guest.

This is not documented or explained anywhere?

>
> > Maybe you can give a concrete example.
> >
> The above one.
>
> > >
> > > > >
> > > > > > > + the member device context
> > > > > >
> > > > > > I don't think we define "device context" anywhere.
> > > > > >
> > > > > It is defined further in the description.
> > > >
> > > > Like this?
> > > >
> > > > """
> > > >  +The member device has a device context which the owner driver can
> > > > +either read or write. The member device context consist of any
> > > > device  +specific data which is needed by the device to resume its
> > > > operation  +when the device mode """
> > > >
> > > Yes.
> > > Further patch-3 adds the device context and also add the link to it in the
> > theory of operation section so reader can read more detail about it.
> > >
> > > > "Any" is probably too hard for vendors to implement. And in patch 3
> > > > I only see virtio device context. Does this mean we don't need
> > > > transport
> > > > (PCI) context at all? If yes, how can it work?
> > > >
> > > Right. PCI member device is present at source and destination with its layout,
> > only the virtio device context is transferred.
> > > Which part cannot work?
> >
> > It is explained in another thread where you are saying the PCI requires
> > mediation. I think any author should not ignore such important assumptions in
> > both the change log and the patch.
> >
> > And again, the more I review the more I see how narrow this series can be used:
> >
> I explained this before and also covered in the cover letter.
>
> > 1) Only works for SR-IOV member device like VF
> It can be extended to SIOV member device in future.
> Today these are the only type of member device virtio has.

That is exactly what I want to say, it can only work for the
owner/member model. It can't work when the virtio device is not
structured like that. And you missed that most of the existing virtio
devices are not implemented in this model. It means they can't be
migrated with a pure virtio specific extension. For you, SR-IOV is all
but this is not true for virtio. PCI is not the only transport and
SR-IOV is not the only architecture in PCI.

And I'm pretty sure the owner/member is not the only requirement,
there are a lot of other assumptions which are missed in this series.

>
> > 2) Mediate PCI but not virtio which is tricky
> > 3) Can only work for a specific BAR/capability register layout
> >
> > Only 1) is described in the change log.
> >
> > The other important assumptions like 2) and 3) are not documented anywhere.
> > And this patch never explains why 2) and 3) is needed or why it can be used for
> > subsystems other than VFIO/Linux.
> >
> Since I am not mentioning vfio now, I will refrain from mentioning others as well. :)

It's not about VFIO at all. It's about to let people know under which
case this proposal could work. Otherwise if a vendor develops a
BAR/cap which is not at page boundary. How could you make it work with
your proposal here?

>
> > >
> > > > >
> > > > > > >and device configuration space may change. \\
> > > > > > > +\hline
> > > > > >
> > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > >
> > > > > All pci devices which belong to a single guest VM are not stopped
> > atomically.
> > > > > Hence, one device which is in freeze mode, may still receive
> > > > > driver notifications from other pci device,
> > > >
> > > > Device may choose to ignore those notifications, no?
> > > >
> > > > > or it may experience a read from the shared memory and get garbage
> > data.
> > > >
> > > > Could you give me an example for this?
> > > >
> > > Section 2.10 Shared Memory Regions.
> >
> > How can it experience a read in this case?
> >
> MMIO read/write can be initiated by the peer device while the device is in stopped state.

Ok, but what I want to say is how it can get the garbage data here?

>
> > Btw, shared regions are tricky for hardware.
> >
> > >
> > > > > And things can break.
> > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > chance to stop
> > > > themselves, and later when freezed, to not change anything internally.
> > > > >
> > > > > > > +0x2   & Freeze &
> > > > > > > + In this mode, the member device does not accept any driver
> > > > > > > +notifications,
> > > > > >
> > > > > > This is too vague. Is the device allowed to be freezed in the
> > > > > > middle of any virtio or PCI operations?
> > > > > >
> > > > > > For example, in the middle of feature negotiation etc. It may
> > > > > > cause implementation specific sub-states which can't be migrated easily.
> > > > > >
> > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > It is passthrough device, hence hypervisor layer do not get to see sub-
> > state.
> > > > >
> > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > The device context already covers this sub-state.
> > > >
> > > > 1) driver writes driver_features
> > > > 2) driver sets FEAUTRES_OK
> > > >
> > > > 3) device receive driver_features
> > > > 4) device validating driver_features
> > > > 5) device clears FEATURES_OK
> > > >
> > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > >
> > > > Is it valid to be frozen of the above?
> > > No. device mode is frozen when hypervisor is sure that no more access by the
> > guest will be done.
> >
> > How, you don't trap so 1) and 2) are posted, how can hypervisor know if there's
> > inflight transactions to any registers?
> >
> Because hypervisor has stopped the vcpus which are issuing them.

MMIO are posted. vCPU is stopped but the transactions are inflight.
How could the hypervisor/device know if there's any inflight PCIE
transactions here? So I can imagine what happens in fact is the TLP
for freezing is ordered with the TLP for posted MMIO. This is probably
guaranteed for typical PCIE setup but how about the relaxed ordering?

>
> > > What can happen between #2 and #3, is device mode may change to stop.
> >
> > Why can't be freezed in this case? It's really hard to deduce why it can't just
> > from your above descriptions.
> >
> On the source hypervisor, the mode changes are active->stop->freeze.
> Hence when freeze is done, the hypervisor knows that all inflight has been stopped by now.

Ok, but how about freezing between 3) and 4). If we allow it, do we
need to migrate to this state? If yes, how can it work with your
device context? If not, shouldn't we document this?

>
> > Even if it had, is it even possible to list all the places where freezing is
> > prohibited? We don't want to end up with a spec that is hard to implement or
> > leave the vendor to figure out those tricky parts.
> >
> The general idea is not prohibiting the freeze/stop mode.
> If the device needs more time, let device take time to do it.

Ok, it means:

1) there're conditions from stop to freeze, then what are they?
2) how much time at most? E.g FLR takes at most 100ms.
3) If it needs more time, can this time satisfy the downtime requirement?

>
>
> > > And in stop mode, device context would capture #5 or #4, depending where is
> > device at that point.
> > >
> > > > >
> > > > > > And what's more, the above state machine seems to be virtio
> > > > > > specific, but you don't explain the interaction with the device
> > > > > > status state
> > > > machine.
> > > > > First, above is not a state machine.
> > > >
> > > > So how do readers know if a state can go to another state and when?
> > > >
> > > Not sure what you mean by reader. Can you please explain.
> >
> > The people who read virtio spec.
> >
> So question is "how reader knows if a state can go to another state and when"?
> It is described and listed in the table, when a mode can change.

It's not only "if" but also "when". Your table partially answers the
"if '' but not "when". I think you should know now the state
transition is conditional. So let's try our best to ease the life of
the vendor.

>
> > > > So only the driver notification is allowed by not config write?
> > > > What's the consideration for allowing driver notification?
> > > >
> > > Because for most practical purposes, peer device wants to queue blk, net
> > other requests and not do device configuration.
> >
> > You forbid the device to process the queue but only allow the notification. How
> > can the device queue those requests? The device can just do the available
> > buffer check after resume, then it's all fine.
> >
> Device can always decide to not queue the request and do the available buffer check later.
> The peer device may read also from MMIO space.
>
> So the intermediate step covers this aspect where device_type specific plumbing is not done.
> Its generic. A device may choose to omit such doorbells as well as long as it knows it can resume.

I'm not sure I will get here, but the device doesn't need to be kicked
after resume. That's what I want to say.

>
> > >
> > > Do you know any device configuration space which is RW?
> > > For net and blk I recall it as RO?
> >
> > For example, WCE. What's more important, the spec allows config space to be
> > RW, so even if there's no examples before, it doesn't mean we won't have a RW
> > in the future.
> >
> Ok.
>
> > >
> > > > Let me ask differently, similar to FLR, what happens if the driver
> > > > wants a virtio reset but the hypervisor wants to stop or freeze?
> > > >
> > > The device would respond to stop/freeze request when it has internally
> > started the reset, as device is the single synchronization point which knows how
> > to handle both in parallel.
> >
> > Let's define the synchronization point first. And it demonstrates at least devices
> > need to synchronize between the free/stop and virtio device status machine
> > which is not as easy as what is done in this patch.
> >
> Synchronization point = device.

This is obvious as we can't rule stuff outside virtio, and we are
talking about devices not drivers here. But the spec needs sufficient
guidance/normative for the vendor to implement. It's more than just
saying "device is synchronization point".

>
> > >
> > > > > We would enrich the device context for this, but no need to
> > > > > connects the
> > > > admin mode controlled by the owner device with operational state
> > > > (device_status) owned by the member device.
> > > > >
> > > > > > > + it ignores any device configuration space writes,
> > > > > >
> > > > > > How about read and the device configuration changes?
> > > > > >
> > > > > As listed, device do not have any changes.
> > > > > So device configuration change cannot occur.
> > > >
> > > > It's not necessarily caused by config write, it could be things like
> > > > link status or geometry changes that are initiated from the device.
> > > >
> > > I understand it. Link status was one example, you listed other examples too.
> > > The point is, when in freeze mode, the member device is frozen, hence,
> > device won't initiate those changes.
> > >
> > > > >
> > > > > The device requirements cover this content more explicitly:
> > > > >
> > > > > For the SR-IOV group type, regardless of the member device mode,
> > > > > all the PCI transport level registers MUST be always accessible
> > > > > and the member device MUST function the same way for all the PCI
> > > > > transport level
> > > > registers regardless of the member device mode.
> > > > >
> > > > > > > + the device do not have any changes in the device context.
> > > > > > > + The member device is not accessed in the system through the
> > > > > > > + virtio
> > > > interface.
> > > > > > > + \\
> > > > > >
> > > > > > But accessible via PCI interface?
> > > > > >
> > > > > Yes, as usual.
> > > > >
> > > > > > For example, what happens if we want to freeze during FLR? Does
> > > > > > the hypervisor need to wait for the FLR to be completed?
> > > > > >
> > > > > Hypervisor do not need wait for the FLR to be completed.
> > > >
> > > > So does FLR change device context?
> > > Yes.
> >
> > So this implies the freeze needs to wait for FLR otherwise device context may
> > change.
> >
> Device context can change anytime and reflect what is latest.
> I will update the patches to reflect that device is the single synchronization point serving flr, mode changes.
>
> > >
> > > >
> > > > >
> > > > > > > +\hline
> > > > > > > +\hline
> > > > > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > > > > +\hline
> > > > > > > +\end{tabularx}
> > > > > > > +
> > > > > > > +When the owner driver wants to stop the operation of the
> > > > > > > +device, the owner driver sets the device mode to
> > > > > > > +\field{Stop}. Once the device is in the \field{Stop} mode,
> > > > > > > +the device does not initiate any notifications or does not
> > > > > > > +access any driver memory. Since the member driver may be
> > > > > > > +still active which may send further driver notifications to the device,
> > the device context may be updated.
> > > > > > > +When the member driver has stopped accessing the device, the
> > > > > > > +owner driver sets the device to \field{Freeze} mode
> > > > > > > +indicating to the device that no more driver access occurs.
> > > > > > > +In the \field{Freeze} mode, no more changes occur in the device
> > context.
> > > > > > > +At this point, the device ensures that
> > > > > > there will not be any update to the device context.
> > > > > >
> > > > > > What is missed here are:
> > > > > >
> > > > > > 1) it is a virtio specific states or not
> > > > > It is not.
> > > > >
> > > > > > 2) if it is a virtio specific state, if or how to synchronize
> > > > > > with transport specific interfaces and why
> > > > > > 3) can active go directly to freeze and why
> > > > > >
> > > > > Yes. don’t see a reason to not allow it.
> > > > > Active to freeze mode can change is useful on the destination
> > > > > side, where
> > > > destination hypervisor knows for sure that there is no other entity
> > > > accessing the device.
> > > > > And it needs to setup the device context, it received from the source side.
> > > > > So setting freeze mode can be done directly.
> > > > >
> > > > > > > +
> > > > > > > +The member device has a device context which the owner driver
> > > > > > > +can either read or write. The member device context consist
> > > > > > > +of any device specific data which is needed by the device to
> > > > > > > +resume its operation when the device mode
> > > > > >
> > > > > > This is too vague. There're states that are not suitable for
> > > > > > cmd/queue for
> > > > sure.
> > > > > > I'd split it into
> > > > > >
> > > > > > 1) common states: virtqueue, dirty pages
> > > > > > 2) device specific states: defined be each device
> > > > > >
> > > > > This is theory of operation section. So it capturing such details.
> > > > > Actual device context definition is outside of theory, and precise
> > > > > states of
> > > > virtqueue, device specific, etc are in it.
> > > >
> > > > See my comment above regarding to the device context.
> > > >
> > > I replied above, device context link is added in the patch-3 in the theory of
> > operation.
> > > So reader gets the complete view.
> > >
> > > > >
> > > > > > > +is changed from \field{Stop} to \field{Active} or from
> > > > > > > +\field{Freeze} to \field{Active}.
> > > > > > > +
> > > > > > > +Once the device context is read, it is cleared from the device.
> > > > > >
> > > > > > This is horrible, it means we can't easily
> > > > > >
> > > > > > 1) re-try the migration
> > > > > > 2) recover from migration failure
> > > > > >
> > > > > Can you please explain the flow?
> > > >
> > > > When migration fails, management can choose to resume the device(VM)
> > > > on the source.
> > > >
> > > ok. This should be possible as the management which has the device
> > > context, it can restore it on the source and move the device mode to active.
> > >
> > > > If the state were cleared, it means there's not simple way to resume
> > > > the device but restoring the whole context.
> > > >
> > > Yes, as you say, by restoring the whole context will suffice this corner/rare
> > case scenario.
> > >
> > > > What's the consideration for such clearing?
> > > >
> > > There are two considerations.
> > > 1.  If one does not clear, till how long should it be kept on the device?
> >
> > Until virtio reset, this is how virtio works now. I've pointed out that it may cause
> > extra troubles when trying to resume, but you don't tell me what's wrong to
> > keep that?
> >
> If kept, hypervisor may not be able to decide when to change the mode from active->stop.

Why? It is simply done when mgmt requires a migration?

What's more important, PCI allows multiple common_cfgs. So the
hypervisor can choose to reserve one common_cfg for live migration. In
this case we don't have to read to clear semantics.

Or, are you saying the value read from common_cfg is not device
context? Isn't this conflict with your vague definition of device
context?

> We can opt for a mode where full device context is read in each mode without clearing it.
> But than it can be very specific to a version of qemu, which we are avoiding it here.
>
> > > 2. device context returns incremental value from the previous read. So, it
> > needs to clear it.
> >
> > I don't understand here. This is not the case for most of the devices.
> >
> Not sure which devices you mean here with "most of the devices".
> Device context functions like a write record pages (aka dirty pages).

It's definitely different. We want to migrate dirty pages lively which
can consume a lot of bandwidth. So reporting delta makes a lot of
sense here since it would have a lot of rounds of syncing and it
doesn't result in blockers resuming.

For device context, how many rounds of syncing did you expect, and if
we have N rounds, we need to restore N rounds in order to resume? Do
you want to live migrating device states? If it's only 1 or 2 rounds,
why bother?

And for the delta, how do you know you can easily define deltas for
every type of device, especially the ones with complicated internal
states? Defining states has already been demonstrated as a complicated
task for some devices like virtio-FS and you want to complicate it
furtherly?

What is proposed in this series is an ad-hoc optimization for a
specific deivce type within a specific subsystem (e.g VFIO) in a
specific operating system which is not the general.

As demsonsted many times, starting from something simple and stupid is
the most easy way.

> Whatever is already returned is/should not be repeated in subsequent reads, though device can choose to do so.
>
> > >
> > > > > And which software stack may find this useful?
> > > > > Is there any existing software that can utilize it?
> > > >
> > > > Libvirt.
> > > >
> > > Does libvirt restore on migration failure?
> >
> > Yes.
> >
> Ok. the device will be able to resume when it is marked active.
> The device context returned  is the incremental delta as explained above.

I disagree, see my above reply.

>
> > >
> > > > > Why that device context present with the software vanished, in
> > > > > your
> > > > assumption, if it is?
> > > > >
> > > > > > > Typically, on
> > > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > > +context once when the device is in \field{Active} or
> > > > > > > +\field{Stop} mode and later once the member device is in
> > \field{Freeze} mode.
> > > > > >
> > > > > > Why need the read while device context could be changed? Or is
> > > > > > the dirty page part of the device context?
> > > > > >
> > > > > It is not part of the dirty page.
> > > > > It needs to read in the active/stop mode, so that it can be shared
> > > > > with
> > > > destination hypervisor, which will pre-setup the complex context of
> > > > the device, while it is still running on the source side.
> > > >
> > > > Is such a method used by any hypervisor?
> > > Yes. qemu which uses vfio interface uses it.
> >
> > Ok, such software technology could be used for all types of devices, I don't see
> > any advantages to mention it here unless it's unique to virtio.
> >
> It is theory of operation that brings the clarity and rationale.

I think it's not. Since it's not something that is unique to virtio.

> So I will keep it.
>
> > >
> > > >
> > > > >
> > > > > > > +
> > > > > > > +Typically, the device context is read and written one time on
> > > > > > > +the source and the destination hypervisor respectively once
> > > > > > > +the device is in \field{Freeze} mode. On the destination
> > > > > > > +hypervisor, after writing the device context, when the device
> > > > > > > +mode set to \field{Active}, the device uses the most recently
> > > > > > > +set device context and resumes the device
> > > > > > operation.
> > > > > >
> > > > > > There's no context sequence, so this is obvious. It's the
> > > > > > semantic of all other existing interfaces.
> > > > > >
> > > > > Can you please what which existing interfaces do you mean here?
> > > >
> > > > For any common cfg member. E.g queue_addr.
> > > >
> > > > The driver wrote 100 different values to queue_addr and the device
> > > > used the value written last time.
> > > >
> > > o.k. I don’t see any problem in stating what is done, which is less
> > > vague. 😊
> > >
> > > > >
> > > > > > > +
> > > > > > > +In an alternative flow, on the source hypervisor the owner
> > > > > > > +driver may choose to read the device context first time while
> > > > > > > +the device is in \field{Active} mode and second time once the
> > > > > > > +device is in \field{Freeze}
> > > > > > mode.
> > > > > >
> > > > > > Who is going to synchronize the device context with possible
> > > > > > configuration from the driver?
> > > > > >
> > > > > Not sure I understand the question.
> > > > > If I understand you right, do you mean that, When configuration
> > > > > change is done by the guest driver, how does device context change?
> > > > >
> > > >
> > > > Yes.
> > > >
> > > > > If so, device context reading will reflect the new configuration.
> > > >
> > > > How do you do that? For example:
> > > >
> > > > static inline void vp_iowrite64_twopart(u64 val,
> > > >                                         __le32 __iomem *lo,
> > > >                                         __le32 __iomem *hi) {
> > > >         vp_iowrite32((u32)val, lo);
> > > >         vp_iowrite32(val >> 32, hi); }
> > > >
> > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > >
> > > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > section captures the partial value.
> >
> > There's no way for the device to know whether or not it's a partial value or not.
> > No?
> >
> Device does not need to know, because when the guest vm and the device is resumed on the destination, it the guest vm will continue with writing the 2nd part.
>
> > >
> > > > >
> > > > > > > Similarly, on the
> > > > > > > +destination hypervisor writes the device context first time
> > > > > > > +while the device is still running in \field{Active} mode on
> > > > > > > +the source hypervisor and writes the device context second
> > > > > > > +time while the device is in
> > > > > > \field{Freeze} mode.
> > > > > > > +This flow may result in very short setup time as the device
> > > > > > > +context likely have minimal changes from the previously
> > > > > > > +written device
> > > > context.
> > > > > >
> > > > > > Is the hypervisor who is in charge of doing the comparison and
> > > > > > writing only the delta?
> > > > > >
> > > > > The spec commands allow to do so. So possibility exists from spec wise.
> > > >
> > > > There are various optimizations for migration for sure, I don't
> > > > think mentioning any specific one is good.
> > > >
> > > The text is informative text similar to,
> > >
> > > " However, some devices benefit from the ability to find out the
> > > amount of available data in the queue without accessing the virtqueue in
> > memory"
> > >
> > > " To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has
> > been negotiated".
> > >
> > > Is this the only optimization in virtio? No, but we still mention the rationale of
> > why it exists.
> >
> > The above is a good example as it explain VIRTIO_F_NOTIFICATION_DATA is the
> > only way without accessing the virtqueue. But this is not the case of migration.
> > You said it's just a possibility but not a must which is not the case for
> > VIRTIO_F_NOTIFICATION_DATA.
> >
> It is one of the optimization apart. The comparison is of one_of_example or not.

I don't get this.

Thanks

>



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-12 10:58                     ` Parav Pandit
  2023-10-12 11:17                       ` Michael S. Tsirkin
@ 2023-10-13  1:16                       ` Jason Wang
  2023-10-13  6:36                         ` Parav Pandit
  2023-10-13 11:26                         ` Michael S. Tsirkin
  2023-10-13  9:06                       ` Zhu, Lingshan
  2 siblings, 2 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-13  1:16 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Oct 12, 2023 at 6:58 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Thursday, October 12, 2023 3:51 PM
> >
> > On 10/11/2023 7:43 PM, Parav Pandit wrote:
> > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >> Sent: Wednesday, October 11, 2023 3:55 PM
> > >>>>>>> I don’t have any strong opinion to keep it or remove it as most
> > >>>>>>> stakeholders
> > >>>>>> has the clear view of requirements now.
> > >>>>>>> Let me know.
> > >>>>>> So some people use VFs with VFIO. Hence the module name.  This
> > >>>>>> sentence by itself seems to have zero value for the spec. Just drop it.
> > >>>>> Ok. Will drop.
> > >>>> So why not build your admin vq live migration on our config space
> > >>>> solution, get out of the troubles, to make your life easier?
> > >>>>
> > >>> Your this question is completely unrelated to this reply or you
> > >>> misunderstood
> > >> what dropping commit log means.
> > >> if you can rebase admin vq LM on our basic facilities, I think you
> > >> dont need to talk about vfio in the first place, so I ask you to re-consider
> > Jason's proposal.
> > > I don’t really know why you are upset with the vfio term.
> > > It is the use case of the cloud operator and it is listed to indicate how proposal
> > fits in a such use case.
> > > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> > >
> > > I already answered that I will remove from the commit log, because the
> > requirements are well understood now by the committee.
> > >
> > > Your comment is again unrelated (repeated) to your past two questions.
> > >
> > > I explained you the technical problem that admin command (not admin VQ)
> > of basic facilities cannot be done using config registers without any mediation
> > layer.
> > OK, I pop-ed Jason's proposal to make everything easier, and I see it is refused.
> Because it does not work for passthrough mode.

How and why? What's wrong with just passing through the newly
introduced 2 or 3 registers to guests?

This is the question you never answer even if I keep asking.

And again, passthrough is really confusing, PCI stuff can be
passthrough, and virtio can only be passthrough with a lot of
assumptions which are all missed in your series.

>
> > >
> > >>> Dropping link to vfio does not drop the requirement.
> > >>> I am ok to drop because requirements are clear of passthrough of
> > >>> member
> > >> device.
> > >>> Vfio is not a trouble at all.
> > >>> Admin command is not a trouble either.
> > >>>
> > >>> The pure technical reason is: all the functionalities proposed
> > >>> cannot be done
> > >> in any other existing way.
> > >>> Why? For below reasons.
> > >>> 1. device context, and write records (aka dirty page addresses) is
> > >>> huge which cannot be shared using config registers at scale of 4000
> > >>> member devices
> > >> dirty page tracking will be implmemented in V2, actually I have the
> > >> patch right now.
> > > That is yet again the invitation to non_colloboration mode.
> > > Without reviewing, v0 and v1, you want to show dirty page tracking in some
> > other way.
> > >
> > > But ok, that is your non_coperative mode of working. Cannot help further.
> > I believe both me and Jason have proposed a solution, I see it is rejected.
> > But don't take it personal and please keep professional.
> Sure, as I explained the config register method do not work for passthrough mode, and does not scale.

We need to make sure your migration proposal can work for 1 VF which
is still questionable then we can talk about others like scaling. No?

And most of your concern regarding scalability seems more like a
limitation of a transport. Let's not mix the scalability for a
specific transport with the one for core virtio devices.

>
> > >
> > >> inflight descriptor tracking will be implemented by Eugenio in V2.
> > > When we have near complete proposal from two device vendors, you want
> > > to push something to unknown future without reviewing the work; does not
> > make sense.
> > Didn't I ever provide feedback to you? Really?
> No. I didn’t see why you need to post a new patch for dirty page tracking, when it is already present in this series.

You know there are various ways to do dirty paging? For example, the
well known bitmap and its variants. I think we've discussed several
times in many places in the past. I don't see where you explain why
you choose one of them but not the others but you want to forbid other
types of dirty page logging? Why?

> I would like to understand and review this aspects.
> Same for the device context.
>
> > >
> > > You are still in the mode of _take_ what we did with near zero explanation.
> > > You asked question of why passthrough proposal cannot advantage of in_band
> > config registers.
> > > I explained technical reason listed here.
> > I have answered the questions, and asked questions for many times.
> > What do you mean by "why passthrough proposal cannot advantage of in_band
> > config registers."?
> > Config space work for passthrough for sure.
> Config space registers are passthrough the guest VM.
> Hence hypervisor messing it with, programming some address would result in either security issue.
> Or functionally broken, to sustain the functionality, each nested layer needs one copy of these registers for each nest level.
> So they must be trapped somehow.
>
> Secondly I don’t see how one can read 1M flows using config registers.

Why can't we trap them? vIOMMU even migrates internal translation tables.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 11:37                       ` Parav Pandit
  2023-10-12 13:03                         ` Michael S. Tsirkin
@ 2023-10-13  1:18                         ` Jason Wang
  2023-10-13  6:40                           ` Parav Pandit
  2023-10-13  9:44                         ` Zhu, Lingshan
  2 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-13  1:18 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Oct 12, 2023 at 7:37 PM Parav Pandit <parav@nvidia.com> wrote:>
> As Michael said, software based nesting is used..

I've pointed out in another thread when hardware has less abstraction
level than nesting, trap/emulation is a must.

> See if actual hw based devices can implement it or not. Many components of cpu cannot do N level nesting either, but may be virtio can.
> I don’t know how yet.

I would not repeat the lessons given by Gerald J. Popek and Robert P.
Goldberg[1] in 1976, but I think you miss a lot of fundamental things
in the methodology of virtualization. For example, nesting is a very
important criteria to examine whether an architecture is well designed
for virtualization.

That is to say for any CPU/hypervisor vendors, the architecture should
be designed to run any levels of nesting instead of just an awkward 2
levels (but what you proposed can not work for even 2). For x86 and
KVM, any level of nesting has been done for about 10 years ago.

For virtio, it can do any level. So did for vhost/vDPA. For example, I
usually develop and test virtio/vDPA/vhost in a nesting environment.

Thanks

[1] https://dl.acm.org/doi/pdf/10.1145/361011.361073

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13  1:15               ` Jason Wang
@ 2023-10-13  6:36                 ` Parav Pandit
  2023-10-17  1:41                   ` Jason Wang
  2023-10-13 11:41                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-13  6:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan


> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, October 13, 2023 6:46 AM

[..]
> > > > > It's still not clear to me how this is done.
> > > > >
> > > > > 1) guest starts FLR
> > > > > 2) adminq freeze the VF
> > > > > 3) FLR is done
> > > > >
> > > > > If the freezing doesn't wait for the FLR, does it mean we need
> > > > > to migrate to a state like FLR is pending? If yes, do we need to
> > > > > migrate the other sub states like this? If not, why?
> > > > >
> > > > In most practical cases #2 followed by #1 should not happen as on
> > > > the source
> > > side the expected is mode change to stop from active.
> > >
> > > How does the hypervisor know if a guest is doing what without trapping?
> > >
> > Hypervisor does not know. The device knows being the recipient of #1 and #2.
> 
> We are discussing the possibility in software/driver side isn't it?
> 
> 1) is initiated from the guest
> 2) is initiated from the hypervisor
> 
> Both are softwares, and you're saying 2) should not happen after 1) since the
> device knows what is being done by guests? How can devices control software
> behaviour?
> 
Device do not control software behavior.
i.e. either hypervisor can initiate device mode change to stop (not freeze) or guest can initiate FLR.
Device knows which is initiated first as single recipient of both.
Therefore, device responds accordingly.
For example, in the sequence you described,
A device will delay mode change command response, until the FLR is completed.


> This only possible thing is to make sure 3) is done before 2) That is what I'm
> asking but you are saying freeze doesn't need to wait for FLR...
> 
I think I responded in previous email further down on synchronization point being fw.
I meant to say software do not need to wait for initiation of the freeze mode command.
Just the command will complete at right time.

This is anyway very corner case.
On source hypervisor as written in the theory of operation, the sequence is active->stop->freeze.
When mode change is done to stop, the vcpus are already suspended.

I agree FLR may have been initiated and driver is waiting now for 100msec.

So yes, device single entity synchronized it.

> >
> > > > But ok, since we active to freeze mode change is allowed, lets discuss
> above.
> > > >
> > > > A device is the single synchronization point for any device reset,
> > > > FLR or admin
> > > command operation.
> > >
> > > So you agree we need synchronization? And I'm not sure I get the
> > > meaning of synchronization point, do you mean the synchronization
> > > between freeze/stop and virtio facilities?
> > >
> > Synchronization means, handling two events in parallel such as FLR and other.
> 
> Great. So we have a perfect race:
> 
> 1) guest initiates FLR
> 2) device start FLR
> 3) hypervisor stop and freeze the device
> 4) device is freeze
> 5) hypervisor read device context A
> 6) migrate device contextA
> 8) migration is done
> 9) FLR is done
> 10) hypervisor read device context B
> 
> So we end up with inconsistent device context, no? Dest want B or A+B, but you
> give A.
> 
Since #1 and #2 is done before #3, the device knows to finish the FLR, hence #9 is completed before #4.

Alternatively, in above sequence when destination sees #10, it can immediately finish the FLR as dest device is not under FLR, treating it as no-op.

Both ways to handle are fine. (and rare in practice, but yes, its possible).

I will write both the options in the device requirements.

> >
> > > > So, the migration driver do not need to wait for FLR to complete.
> > >
> > > I'm confused, you said below that device context could be changed by FLR.
> > >
> > Yes.
> > > If FLR needs to clear device context, we can have a race where
> > > device context is cleared when we are trying to read it?
> > >
> > I didn’t say clear the context.
> > FLR updates the device context.
> 
> In what sense?
> 
Indicating a new device context indicating a new device context and discard the old one.
I am glad you asked this. I wanted to get the basic part captured before adding this optimization.
Probably it is good to add it now in the v2 as we crossed this stage now.

> > Device is serving the device context read write commands, serving FLR,
> > answering mode change command, So device knows the best how to avoid
> any race.
> 
> You want to leave those details for the vendor to figure out? If devices know
> everything, why do we need device normative?
> 
Device knows its implementation.
Implementation guidelines to be in the normative.
I will add it to the normative.

> I see issues at least for FLR, I'm pretty sure they are others. If a design requires
> us to audit all the possible conflicts between virtio facilities and transport. It's a
> strong hint of layer violation and when it happens it for sure may hit a lot of
> problems that are very hard to find or debug thus we should drop such a design.
> I suggest using the RFC tag since the next version (if there is one) as I see it is
> immature in many ways.
> 
Technical committee audits the required touch points like rest of the industry committees that I participated.
I disagree to your above point.
If you do not want to review, that is fine.
We are reviewing with other members and also contributed by them.

> What's more, solving races is much easier if the device functionality is self
> contained. For example, for a self contained device with the transport as the
> single interface, we can leverage from transport
> (PCI) for dealing with races, arbitration, ordering, QOS etc which is probably
> required in the internal channel between the owner and the member. But all of
> these were missed in your series and even if you can I'm not sure it's
> worthwhile to reinvent all of them.
> 
At the end there is one physical device serving owner and member devices.
So a claim like things are on the VF hence you magically get 200% QoS guarantee is myth.

Quoting "all of these" is also incorrect.

Things added gradually, first functionally with reasonable performance, followed by notion and extension for QoS.
By definition of PCI transport for SR-IOV there is internal channel.

It is reasonably well proposal in current form.
There are few race condition that you highlight are extremely rare in nature.
Suggestions are welcome to improve.
There were couple of them by Michael too, I am addressing them in the v2.

> For example, for the architecture like owner/member, if the virtio or transport
> facility could be controlled via device internal channels besides the transport,
> such a channel may complicate the synchronization a lot. 
Two vendors who actually make the hw sriov devices are authoring these and others are also reviewing.
So I am more confident that it is solid enough.
Also, a similar design has been seen with other device for more than a year as GPL integrated with QEMU for a year now and with upstream kernel.

> The device needs to
> be able to handle or synchronize requests from both PCI and owner in parallel.
> They are just too many possible races and most of my questions so far come
> from this viewpoint. I wouldn't go further for other stuff since I believe I've
> spotted sufficient issues and that's why I must stop at this patch before looking
> at the rest.
It is your call to stop or progress.
I find your reviews useful to improve this proposal, so I will fix them.

> 
> Admin commands are fine if it does real administrative jobs such as provisioning
> since such work is beyond the core virtio functionality.
> 
> Again, the goal of virtio spec is to have a device with sufficient guidelines that is
> easy to implement but not leave the vendors to waste their engineering
> resources in figuring or fuzzing the corner cases.
I have not seen an industry standard spec or a software that does not have corner cases.
The spec proposal is from > 1 device vendors.

I will focus on more practical aspects to progress and improve this spec.
> 
> >
> > > > When admin cmd freeze the VF it can expect FLR_completed VF.
> > >
> > > We need to explain why and how about the resume? For example, is
> > > resuming required to wait for the completion of FLR, if not, why?
> 
> This question is ignored.
> 
I probably missed. Sorry about it.
No, the driver does not need to wait for FLR to finish to issue resume command, as this typically done on the destination member device which should not be under FLR.
I will write up the requirements further.

> > > In another thread you are saying that the PCI composition is done by
> > > hypervisor, so passthrough is really confusing at least for me.
> > >
> > I explained there what vPCI composition is done there.
> > PCI config space and msix side of composition is done.
> > The whole virtio interface is not composed.
> 
> You need to describe this somewhere, no? That's what I'm saying.
> 
Mostly not. What is not done is not written.

> And passthrough is misleading here.
> 
Passthrough is mentioned in theory of operation.
It is not present in requirements section.
So, it is fine.

> >
> > > > Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> > > > This section is not normative section, so using an extra word like
> > > > "flow" does
> > > not confuse anyone.
> > > > I will link to the section anyway.
> > >
> > > Probably, but you mention FLR flow as well.
> > As I said, not repeating the PCIe spec here. The reader knows what FLR of the
> PCIe transport.
> 
> Ok, I'm not a native speaker, but I really don't know the difference between
> "FLR" and "FLR flow".
> 
Lets keep it simple. I will write it as FLR, as pci transport has it as FLR.

> >
> > >
> > > >
> > > > > >
> > > > > > > > and may also undergo PCI function level
> > > > > > > > +reset(FLR) flow.
> > > > > > >
> > > > > > > Why is only FLR special here? I've asked FRS but you ignore the
> question.
> > > > > > >
> > > > > > FLR is special to bring clarity that guest owns the VF doing
> > > > > > FLR, hence
> > > > > hypervisor cannot mediate any registers of the VF.
> > > > >
> > > > > It's not about mediation at all, it's about how the device can
> > > > > implement what you want here correctly.
> > > > >
> > > > > See my above question.
> > > > >
> > > > Ok. it is clear that live migration commands cannot stay on the
> > > > member device
> > > because the member device can undergo device reset and FLR flows
> > > owned by the guest.
> > >
> > > I disagree, hypervisors can emulate FLR and never send FLR to real devices.
> > >
> > That would be some other trap alternative that needs to dissect the device
> and build infrastructure for such dissection is not desired in the listed use case.
> 
> Do you need to trap FLR or not? You're saying the hypervisor is in charge of
> vPCI, how is this differ to what you proposed? If not, how can vPCI be
> composed?
> 
Live migration driver do not need to trap FLR.

> I believe you need to document how vpci is supposed to be done, since I believe
> your proposal can only work with such specific types of PCI composition. This is
> one of the important things that is missed in this series.
> 
I don’t see a need to describe vpci composition as there may be more than one way to do it.
What I think it is worth to describe is the whole pci device is not stored in device context.
I will try to add a short description around it.

> >
> > So your disagreement is fine for non-passthrough devices.
> >
> > > > (and hypervisor is not involved in these two flows, hence the
> > > > admin command
> > > interface is designed such that it can fullfil above requirements).
> > > >
> > > > Theory of operation brings out this clarity. Please notice that it
> > > > is in
> > > introductory section with an example.
> > > > Not normative line.
> > > >
> > > > > >
> > > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > > +virtio specification;
> > > > > > >
> > > > > > > This seems unnecessary and obvious as it applies to all
> > > > > > > other PCI and virtio functionality.
> > > > > > >
> > > > > > Great. But your comment is contradicts.
> > > > > >
> > > > > > > What's more, for the things that need to be synchronized, I
> > > > > > > don't see any descriptions in this patch. And if it doesn't need, why?
> > > > > > With which operation should it be synchronized and why?
> > > > > > Can you please be specific?
> > > > >
> > > > > See my above question regarding FLR. And it may have others
> > > > > which I haven't had time to audit.
> > > > >
> > > > Ok. when you get chance to audit, lets discuss that time.
> > >
> > > Well, I'm not the author of this series, it should be your job
> > > otherwise it would be too late.
> > >
> > As author, what we think, I will cover. If you have specific points to add value,
> please share, I will look into it.
> 
> I've pointed out sufficient issues. I have a lot of others but I don't want to have a
> giant thread once again.
> 
I see following things to improve in the requirements which I will do in v2.

1. Document race around FLR and admin commands for really rare corner case.
2. Some text around not migrating the pci device registers
3. Interaction with PM commands

> >
> > > For example, how is the power management interaction with the
> freeze/stop?
> > >
> > Power management is owned by the guest, like any other virtio interface.
> > So freeze/stop do not interfere with it.
> 
> I don't think this is a good answer. I'm asking how the PM interacts with
> freeze/stop, you answer it works well.

> 
> I'm not obliged to design hardware for you but figuring out the bad design for
> virtio. I'm not convinced with a proposal that misses a lot of obvious critical
> cases and for sure it's not my job to solve them.
> 
I am not asking you to solve.

> I've demonstrated the possible races with FLR. So did the PM. For example, if VF
> is in D3cold state, can we still read its device context?
I think yes, but I will double check.
 If yes, is it a violation of the PCIE spec? If not, why? 
No, because device context is owned by the owner device and not the VF. SR-PCIM interface has defined it be outside of scope of PCIe spec.

> How about other states? Can the device be freezed
> in the middle of PM state transitions? If yes, how can it work without migrating
> PCI states?
I will double check, but unlikely, it should be similar to FLR case to keep the device to avoid treating it differently.

> Well, I meant we need a more precise definition of each state otherwise it
> could be ambiguous (as I pointed above).
Ok. so, few things about read and other messages, I will add.

> 
> >
> > > >
> > > > > >
> > > > > > In "stop" mode, the device wont process descriptors.
> > > > >
> > > > > If the device won't process descriptors, why still allow it to
> > > > > receive
> > > notifications?
> > > > Because notification may still arrive and if the device may update
> > > > any counters as part of
> > >
> > > Which counters did you mean here?
> > >
> > The counter that Xuan is adding and any other state that device may have to
> update as result of driver notification.
> > For example caching the posted avail index in the notification.
> 
> A link to those proposals? 
[1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00048.html

> If the device must depend on those cached features to
> work it's really fragile. If not, we don't need to care about them.
It is not dependent.
It is the infrastructure to enable it.
Same for other shared memory region accesses.

> 
> >
> > > > it which needs to be migrated or store the received notification.
> > > >
> > > > > Or does it really matter if the device can receive or not here?
> > > > >
> > > > From device point of view, the device is given the chance to
> > > > update its device
> > > context as part of notifications or access to it.
> > >
> > > This is in conflict with what you said above " Device cannot process
> > > the queue ..."
> > >
> > No, it does not.
> > Device context is updated within the device without accessing the queue
> memory of the guest.
> 
> This is not documented or explained anywhere?
> 
Why should it be explained?
device is not accessing the guest memory -> this is mentioned in stop mode.
Hence, there is no need to write above.

> >
> > > Maybe you can give a concrete example.
> > >
> > The above one.
> >
> > > >
> > > > > >
> > > > > > > > + the member device context
> > > > > > >
> > > > > > > I don't think we define "device context" anywhere.
> > > > > > >
> > > > > > It is defined further in the description.
> > > > >
> > > > > Like this?
> > > > >
> > > > > """
> > > > >  +The member device has a device context which the owner driver
> > > > > can
> > > > > +either read or write. The member device context consist of any
> > > > > device  +specific data which is needed by the device to resume
> > > > > its operation  +when the device mode """
> > > > >
> > > > Yes.
> > > > Further patch-3 adds the device context and also add the link to
> > > > it in the
> > > theory of operation section so reader can read more detail about it.
> > > >
> > > > > "Any" is probably too hard for vendors to implement. And in
> > > > > patch 3 I only see virtio device context. Does this mean we
> > > > > don't need transport
> > > > > (PCI) context at all? If yes, how can it work?
> > > > >
> > > > Right. PCI member device is present at source and destination with
> > > > its layout,
> > > only the virtio device context is transferred.
> > > > Which part cannot work?
> > >
> > > It is explained in another thread where you are saying the PCI
> > > requires mediation. I think any author should not ignore such
> > > important assumptions in both the change log and the patch.
> > >
> > > And again, the more I review the more I see how narrow this series can be
> used:
> > >
> > I explained this before and also covered in the cover letter.
> >
> > > 1) Only works for SR-IOV member device like VF
> > It can be extended to SIOV member device in future.
> > Today these are the only type of member device virtio has.
> 
> That is exactly what I want to say, it can only work for the owner/member
> model. It can't work when the virtio device is not structured like that. And you
> missed that most of the existing virtio devices are not implemented in this
> model. It means they can't be migrated with a pure virtio specific extension. For
> you, SR-IOV is all but this is not true for virtio. PCI is not the only transport and
> SR-IOV is not the only architecture in PCI.
> 
Each transport will have its own way to handle it.
When there is MMIO owner-member relationship arise, one will be able to do so as well.
In fact other transports will likely miss out as they have not established such pace.

> And I'm pretty sure the owner/member is not the only requirement, there are a
> lot of other assumptions which are missed in this series.
> 
One proposal does not do everything.
It is just impractical.

> >
> > > 2) Mediate PCI but not virtio which is tricky
> > > 3) Can only work for a specific BAR/capability register layout
> > >
> > > Only 1) is described in the change log.
> > >
> > > The other important assumptions like 2) and 3) are not documented
> anywhere.
> > > And this patch never explains why 2) and 3) is needed or why it can
> > > be used for subsystems other than VFIO/Linux.
> > >
> > Since I am not mentioning vfio now, I will refrain from mentioning
> > others as well. :)
> 
> It's not about VFIO at all. It's about to let people know under which case this
> proposal could work. Otherwise if a vendor develops a BAR/cap which is not at
> page boundary. How could you make it work with your proposal here?
> 
Vendor is a cloud operator which is building the device, so it will always work it has the matching capabilities on source and destination.

> >
> > > >
> > > > > >
> > > > > > > >and device configuration space may change. \\
> > > > > > > > +\hline
> > > > > > >
> > > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > > >
> > > > > > All pci devices which belong to a single guest VM are not
> > > > > > stopped
> > > atomically.
> > > > > > Hence, one device which is in freeze mode, may still receive
> > > > > > driver notifications from other pci device,
> > > > >
> > > > > Device may choose to ignore those notifications, no?
> > > > >
> > > > > > or it may experience a read from the shared memory and get
> > > > > > garbage
> > > data.
> > > > >
> > > > > Could you give me an example for this?
> > > > >
> > > > Section 2.10 Shared Memory Regions.
> > >
> > > How can it experience a read in this case?
> > >
> > MMIO read/write can be initiated by the peer device while the device is in
> stopped state.
> 
> Ok, but what I want to say is how it can get the garbage data here?
> 
If the device mode is changed to freeze while it is being read by the peer device, it can get garbage data or last data.
Which may not be the one that is expected.
So first all the initiator devices are stopped, ensure that they do not make any requests.

And there are requests, which gets proper answer.

> >
> > > Btw, shared regions are tricky for hardware.
> > >
> > > >
> > > > > > And things can break.
> > > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > > chance to stop
> > > > > themselves, and later when freezed, to not change anything internally.
> > > > > >
> > > > > > > > +0x2   & Freeze &
> > > > > > > > + In this mode, the member device does not accept any
> > > > > > > > +driver notifications,
> > > > > > >
> > > > > > > This is too vague. Is the device allowed to be freezed in
> > > > > > > the middle of any virtio or PCI operations?
> > > > > > >
> > > > > > > For example, in the middle of feature negotiation etc. It
> > > > > > > may cause implementation specific sub-states which can't be
> migrated easily.
> > > > > > >
> > > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > > It is passthrough device, hence hypervisor layer do not get to
> > > > > > see sub-
> > > state.
> > > > > >
> > > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > > The device context already covers this sub-state.
> > > > >
> > > > > 1) driver writes driver_features
> > > > > 2) driver sets FEAUTRES_OK
> > > > >
> > > > > 3) device receive driver_features
> > > > > 4) device validating driver_features
> > > > > 5) device clears FEATURES_OK
> > > > >
> > > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > > >
> > > > > Is it valid to be frozen of the above?
> > > > No. device mode is frozen when hypervisor is sure that no more
> > > > access by the
> > > guest will be done.
> > >
> > > How, you don't trap so 1) and 2) are posted, how can hypervisor know
> > > if there's inflight transactions to any registers?
> > >
> > Because hypervisor has stopped the vcpus which are issuing them.
> 
> MMIO are posted. vCPU is stopped but the transactions are inflight.
> How could the hypervisor/device know if there's any inflight PCIE transactions
> here? So I can imagine what happens in fact is the TLP for freezing is ordered
> with the TLP for posted MMIO. This is probably guaranteed for typical PCIE
> setup but how about the relaxed ordering?

Vcpus do not generated relaxed ordering MMIOs.
In pci spec: " If this bit is Set, the Function is permitted to set the Relaxed Ordering bit in
the Attributes field of transactions it initiates".

Function initiates RO requests, not the vcpu.
Hence, it is fine.

> >
> > > > What can happen between #2 and #3, is device mode may change to stop.
> > >
> > > Why can't be freezed in this case? It's really hard to deduce why it
> > > can't just from your above descriptions.
> > >
> > On the source hypervisor, the mode changes are active->stop->freeze.
> > Hence when freeze is done, the hypervisor knows that all inflight has been
> stopped by now.
> 
> Ok, but how about freezing between 3) and 4). If we allow it, do we need to
> migrate to this state? If yes, how can it work with your device context? If not,
> shouldn't we document this?
> 
May be, some of these are implementation details. I am not sure it belongs to spec.
Like RSS update while packets are received.. such implementation details are not part of the spec.

> >
> > > Even if it had, is it even possible to list all the places where
> > > freezing is prohibited? We don't want to end up with a spec that is
> > > hard to implement or leave the vendor to figure out those tricky parts.
> > >
> > The general idea is not prohibiting the freeze/stop mode.
> > If the device needs more time, let device take time to do it.
> 
> Ok, it means:
> 
> 1) there're conditions from stop to freeze, then what are they?
No, there isn’t condition.
May be I didn’t follow the question.
> 2) how much time at most? E.g FLR takes at most 100ms.
From the driver side, it is 100msec for device side it can be less too.
As soon as FLR is done or enough to record it, is done, stop can continue.

> 3) If it needs more time, can this time satisfy the downtime requirement?
> 
Guest VM for all practical purposes is not busy in doing FLR, it is a corner case, yet we have to cover it.
And yes, it satisfy the downtime requirements, because VM is already not interested in the packets, it is busy doing the FLR.

> >
> >
> > > > And in stop mode, device context would capture #5 or #4, depending
> > > > where is
> > > device at that point.
> > > >
> > > > > >
> > > > > > > And what's more, the above state machine seems to be virtio
> > > > > > > specific, but you don't explain the interaction with the
> > > > > > > device status state
> > > > > machine.
> > > > > > First, above is not a state machine.
> > > > >
> > > > > So how do readers know if a state can go to another state and when?
> > > > >
> > > > Not sure what you mean by reader. Can you please explain.
> > >
> > > The people who read virtio spec.
> > >
> > So question is "how reader knows if a state can go to another state and
> when"?
> > It is described and listed in the table, when a mode can change.
> 
> It's not only "if" but also "when". Your table partially answers the "if '' but not
> "when". I think you should know now the state transition is conditional. So let's
> try our best to ease the life of the vendor.
What do you mean when?
I do not understand that "mode change is conditional"? it is not based on the condition.
[..]

> > > Let's define the synchronization point first. And it demonstrates at
> > > least devices need to synchronize between the free/stop and virtio
> > > device status machine which is not as easy as what is done in this patch.
> > >
> > Synchronization point = device.
> 
> This is obvious as we can't rule stuff outside virtio, and we are talking about
> devices not drivers here. But the spec needs sufficient guidance/normative for
> the vendor to implement. It's more than just saying "device is synchronization
> point".
> 
The requirements are already covering what device needs to do.
Some interaction points are missing, as I acked above, I will add them.

[..]
> > > Until virtio reset, this is how virtio works now. I've pointed out
> > > that it may cause extra troubles when trying to resume, but you
> > > don't tell me what's wrong to keep that?
> > >
> > If kept, hypervisor may not be able to decide when to change the mode from
> active->stop.
> 
> Why? It is simply done when mgmt requires a migration?
> 
Mgmt is bit higher level entity. Underneath the software layers may wait until the time is right to migrate.
The fundamental point is, the device context is expected to return the incremental value, that is changed content from last time.
So once all changed content is read, its empty.

> What's more important, PCI allows multiple common_cfgs. So the hypervisor
> can choose to reserve one common_cfg for live migration. In this case we don't
> have to read to clear semantics.
Common_cfg does not serve large device context, nor it serves DMA.

> 
> Or, are you saying the value read from common_cfg is not device context? 
The value of common config is part of the device context that represents current common config.

> Isn't this conflict with your vague definition of device context?
>
You mentioned you stop at this patch, so likely you didn’t read device context patch, hence you quote it vague.
So I don’t know what you mean by vague.
Please let me know what you additional thing you want to see in device context after you reach that patch.

 
> > We can opt for a mode where full device context is read in each mode
> without clearing it.
> > But than it can be very specific to a version of qemu, which we are avoiding it
> here.
> >
> > > > 2. device context returns incremental value from the previous
> > > > read. So, it
> > > needs to clear it.
> > >
> > > I don't understand here. This is not the case for most of the devices.
> > >
> > Not sure which devices you mean here with "most of the devices".
> > Device context functions like a write record pages (aka dirty pages).
> 
> It's definitely different. We want to migrate dirty pages lively which can
> consume a lot of bandwidth. So reporting delta makes a lot of sense here since
> it would have a lot of rounds of syncing and it doesn't result in blockers
> resuming.
> 
Write records are reported as delta from the previous read.

> For device context, how many rounds of syncing did you expect, and if we have
> N rounds, we need to restore N rounds in order to resume? Do you want to live
> migrating device states? If it's only 1 or 2 rounds, why bother?
> 
Live migrate the device context. Typically in current software using it, it is 2 rounds.
The interface is generic that if needed more rounds are possible.

Even device for most practical purpose will implement 2 rounds.

> And for the delta, how do you know you can easily define deltas for every type
> of device, especially the ones with complicated internal states? Defining states
> has already been demonstrated as a complicated task for some devices like
> virtio-FS and you want to complicate it furtherly?
> 
What is your question? If you say virtio-fs is complicated state, may be it should not have existed itself in the virtio spec as first place.
But I differ to think that.
Virtio-fs guest side state wont be changed as part of it.
Virtio-fs is the first device which has considered and listed to migrate the device state.
So it should be possible.

> What is proposed in this series is an ad-hoc optimization for a specific deivce
> type within a specific subsystem (e.g VFIO) in a specific operating system which
> is not the general.
> 
Oh now you mention vfio. Not me. :)

I am not going to comment on this. It is not ad-hoc.
It uses similar dirty page tracking like technique present in cpu hw and other devices.

> As demsonsted many times, starting from something simple and stupid is the
> most easy way.
> 

> > Whatever is already returned is/should not be repeated in subsequent reads,
> though device can choose to do so.
> >
> > > >
> > > > > > And which software stack may find this useful?
> > > > > > Is there any existing software that can utilize it?
> > > > >
> > > > > Libvirt.
> > > > >
> > > > Does libvirt restore on migration failure?
> > >
> > > Yes.
> > >
> > Ok. the device will be able to resume when it is marked active.
> > The device context returned  is the incremental delta as explained above.
> 
> I disagree, see my above reply.
I replied above.

> 
> >
> > > >
> > > > > > Why that device context present with the software vanished, in
> > > > > > your
> > > > > assumption, if it is?
> > > > > >
> > > > > > > > Typically, on
> > > > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > > > +context once when the device is in \field{Active} or
> > > > > > > > +\field{Stop} mode and later once the member device is in
> > > \field{Freeze} mode.
> > > > > > >
> > > > > > > Why need the read while device context could be changed? Or
> > > > > > > is the dirty page part of the device context?
> > > > > > >
> > > > > > It is not part of the dirty page.
> > > > > > It needs to read in the active/stop mode, so that it can be
> > > > > > shared with
> > > > > destination hypervisor, which will pre-setup the complex context
> > > > > of the device, while it is still running on the source side.
> > > > >
> > > > > Is such a method used by any hypervisor?
> > > > Yes. qemu which uses vfio interface uses it.
> > >
> > > Ok, such software technology could be used for all types of devices,
> > > I don't see any advantages to mention it here unless it's unique to virtio.
> > >
> > It is theory of operation that brings the clarity and rationale.
> 
> I think it's not. Since it's not something that is unique to virtio.
> 
> > So I will keep it.
> >
> > > >
> > > > >
> > > > > >
> > > > > > > > +
> > > > > > > > +Typically, the device context is read and written one
> > > > > > > > +time on the source and the destination hypervisor
> > > > > > > > +respectively once the device is in \field{Freeze} mode.
> > > > > > > > +On the destination hypervisor, after writing the device
> > > > > > > > +context, when the device mode set to \field{Active}, the
> > > > > > > > +device uses the most recently set device context and
> > > > > > > > +resumes the device
> > > > > > > operation.
> > > > > > >
> > > > > > > There's no context sequence, so this is obvious. It's the
> > > > > > > semantic of all other existing interfaces.
> > > > > > >
> > > > > > Can you please what which existing interfaces do you mean here?
> > > > >
> > > > > For any common cfg member. E.g queue_addr.
> > > > >
> > > > > The driver wrote 100 different values to queue_addr and the
> > > > > device used the value written last time.
> > > > >
> > > > o.k. I don’t see any problem in stating what is done, which is
> > > > less vague. 😊
> > > >
> > > > > >
> > > > > > > > +
> > > > > > > > +In an alternative flow, on the source hypervisor the
> > > > > > > > +owner driver may choose to read the device context first
> > > > > > > > +time while the device is in \field{Active} mode and
> > > > > > > > +second time once the device is in \field{Freeze}
> > > > > > > mode.
> > > > > > >
> > > > > > > Who is going to synchronize the device context with possible
> > > > > > > configuration from the driver?
> > > > > > >
> > > > > > Not sure I understand the question.
> > > > > > If I understand you right, do you mean that, When
> > > > > > configuration change is done by the guest driver, how does device
> context change?
> > > > > >
> > > > >
> > > > > Yes.
> > > > >
> > > > > > If so, device context reading will reflect the new configuration.
> > > > >
> > > > > How do you do that? For example:
> > > > >
> > > > > static inline void vp_iowrite64_twopart(u64 val,
> > > > >                                         __le32 __iomem *lo,
> > > > >                                         __le32 __iomem *hi) {
> > > > >         vp_iowrite32((u32)val, lo);
> > > > >         vp_iowrite32(val >> 32, hi); }
> > > > >
> > > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > > >
> > > > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > > section captures the partial value.
> > >
> > > There's no way for the device to know whether or not it's a partial value or
> not.
> > > No?
> > >
> > Device does not need to know, because when the guest vm and the device is
> resumed on the destination, it the guest vm will continue with writing the 2nd
> part.
> >
> > > >
> > > > > >
> > > > > > > > Similarly, on the
> > > > > > > > +destination hypervisor writes the device context first
> > > > > > > > +time while the device is still running in \field{Active}
> > > > > > > > +mode on the source hypervisor and writes the device
> > > > > > > > +context second time while the device is in
> > > > > > > \field{Freeze} mode.
> > > > > > > > +This flow may result in very short setup time as the
> > > > > > > > +device context likely have minimal changes from the
> > > > > > > > +previously written device
> > > > > context.
> > > > > > >
> > > > > > > Is the hypervisor who is in charge of doing the comparison
> > > > > > > and writing only the delta?
> > > > > > >
> > > > > > The spec commands allow to do so. So possibility exists from spec
> wise.
> > > > >
> > > > > There are various optimizations for migration for sure, I don't
> > > > > think mentioning any specific one is good.
> > > > >
> > > > The text is informative text similar to,
> > > >
> > > > " However, some devices benefit from the ability to find out the
> > > > amount of available data in the queue without accessing the
> > > > virtqueue in
> > > memory"
> > > >
> > > > " To help with these optimizations, when
> > > > VIRTIO_F_NOTIFICATION_DATA has
> > > been negotiated".
> > > >
> > > > Is this the only optimization in virtio? No, but we still mention
> > > > the rationale of
> > > why it exists.
> > >
> > > The above is a good example as it explain VIRTIO_F_NOTIFICATION_DATA
> > > is the only way without accessing the virtqueue. But this is not the case of
> migration.
> > > You said it's just a possibility but not a must which is not the
> > > case for VIRTIO_F_NOTIFICATION_DATA.
> > >
> > It is one of the optimization apart. The comparison is of one_of_example or
> not.
> 
> I don't get this.
Theory of operation is describing a flow how things are done and how the constructs are helpful to achieve it.
And it is not the end of the list.
That does not mean one should not write those.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13  1:16                       ` Jason Wang
@ 2023-10-13  6:36                         ` Parav Pandit
  2023-10-17  1:53                           ` Jason Wang
  2023-10-13 11:26                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-13  6:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, October 13, 2023 6:47 AM
> 
> On Thu, Oct 12, 2023 at 6:58 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Thursday, October 12, 2023 3:51 PM
> > >
> > > On 10/11/2023 7:43 PM, Parav Pandit wrote:
> > > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >> Sent: Wednesday, October 11, 2023 3:55 PM
> > > >>>>>>> I don’t have any strong opinion to keep it or remove it as
> > > >>>>>>> most stakeholders
> > > >>>>>> has the clear view of requirements now.
> > > >>>>>>> Let me know.
> > > >>>>>> So some people use VFs with VFIO. Hence the module name.
> > > >>>>>> This sentence by itself seems to have zero value for the spec. Just
> drop it.
> > > >>>>> Ok. Will drop.
> > > >>>> So why not build your admin vq live migration on our config
> > > >>>> space solution, get out of the troubles, to make your life easier?
> > > >>>>
> > > >>> Your this question is completely unrelated to this reply or you
> > > >>> misunderstood
> > > >> what dropping commit log means.
> > > >> if you can rebase admin vq LM on our basic facilities, I think
> > > >> you dont need to talk about vfio in the first place, so I ask you
> > > >> to re-consider
> > > Jason's proposal.
> > > > I don’t really know why you are upset with the vfio term.
> > > > It is the use case of the cloud operator and it is listed to
> > > > indicate how proposal
> > > fits in a such use case.
> > > > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> > > >
> > > > I already answered that I will remove from the commit log, because
> > > > the
> > > requirements are well understood now by the committee.
> > > >
> > > > Your comment is again unrelated (repeated) to your past two questions.
> > > >
> > > > I explained you the technical problem that admin command (not
> > > > admin VQ)
> > > of basic facilities cannot be done using config registers without
> > > any mediation layer.
> > > OK, I pop-ed Jason's proposal to make everything easier, and I see it is
> refused.
> > Because it does not work for passthrough mode.
> 
> How and why? What's wrong with just passing through the newly introduced 2
> or 3 registers to guests?
> 
If passed to the guest who is not involved in the live migration flow, cannot operate the device.
VF = member device = controlled function
PF = owner device = controlling function
Device migration commands from the hypervisor are not forwarded inside the guest.

> This is the question you never answer even if I keep asking.
> 

> > Sure, as I explained the config register method do not work for passthrough
> mode, and does not scale.
> 
> We need to make sure your migration proposal can work for 1 VF which is still
> questionable then we can talk about others like scaling. No?
Sure but in making sure that, the interface is built so that it can work for N VFs too.

> 
> And most of your concern regarding scalability seems more like a limitation of a
> transport. Let's not mix the scalability for a specific transport with the one for
> core virtio devices.
Virtio device is for the defined transport.
So it needs to work for the defined transport.
Therefore, scalability cannot be ignored.

It is not a question anymore as for any bulk transfer virtqueue is the specification choice for obvious technical gains.

> 
> >
> > > >
> > > >> inflight descriptor tracking will be implemented by Eugenio in V2.
> > > > When we have near complete proposal from two device vendors, you
> > > > want to push something to unknown future without reviewing the
> > > > work; does not
> > > make sense.
> > > Didn't I ever provide feedback to you? Really?
> > No. I didn’t see why you need to post a new patch for dirty page tracking,
> when it is already present in this series.
> 
> You know there are various ways to do dirty paging? For example, the well
> known bitmap and its variants. I think we've discussed several times in many
> places in the past. I don't see where you explain why you choose one of them
> but not the others but you want to forbid other types of dirty page logging?
> Why?
The well known bitmap simply do not work for the pci transport in atomic way, effectively.
You tend to derive many conclusions to oppose the work frankly. :)
Other types of dirty page logging is not forbidden.

I don’t see the need to explain every single word why a given scheme is chosen in the spec language.
There is line drawn to avoid writing a book and a spec.

> > Secondly I don’t see how one can read 1M flows using config registers.
> 
> Why can't we trap them? vIOMMU even migrates internal translation tables.

Because they are added and removed over the virtqueues almost as data path operations.
Virtio queues are not mediated/trapped when the native device is virtio member device itself.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-13  1:18                         ` Jason Wang
@ 2023-10-13  6:40                           ` Parav Pandit
  2023-10-17  2:10                             ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-13  6:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, October 13, 2023 6:48 AM
> 
> On Thu, Oct 12, 2023 at 7:37 PM Parav Pandit <parav@nvidia.com> wrote:>
> > As Michael said, software based nesting is used..
> 
> I've pointed out in another thread when hardware has less abstraction level
> than nesting, trap/emulation is a must.
> 
> > See if actual hw based devices can implement it or not. Many components of
> cpu cannot do N level nesting either, but may be virtio can.
> > I don’t know how yet.
> 
> I would not repeat the lessons given by Gerald J. Popek and Robert P.
> Goldberg[1] in 1976, but I think you miss a lot of fundamental things in the
> methodology of virtualization. 
Weekend is coming. I will read it.

> For example, nesting is a very important criteria
> to examine whether an architecture is well designed for virtualization.
>

In my reading of a leading OS vendor documentation, I leant that OS vendor do not recommend nested virtualization for production at [1].
Snippet:
"In addition, Red Hat does not recommend using nested virtualization in production user environments, due to various limitations in functionality. Instead, nested virtualization is primarily intended for development and testing scenarios."

[1] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_virtualization/creating-nested-virtual-machines_configuring-and-managing-virtualization

2nd leading hypervisor listed nested virtualization to be not used for "performance sensitive applications".

I want to repeat and emphasize that I am not ignoring the nested case.

An extension for nesting would be the VF presented to the guest itself with SR-IOV capability can work as_is as proposed here.
Michael presented the idea of the dummy PF, which is to represent the VF as dummy PF which can do the SR-IOV with one VF.
You need the support from the platform too, I guess TC can extend it.
May be a different interface more suitable for nested case which do not have performance needs.

How about a nested user to have AQ located on the VF so that mediation sw can operate admin commands over self?
Device mode commands will not be applicable there, instead some other things to be done.
So non passthrough mode software possibly can make use of it?

> That is to say for any CPU/hypervisor vendors, the architecture should be
> designed to run any levels of nesting instead of just an awkward 2 levels (but
> what you proposed can not work for even 2).
Huh, some missing text for corner case as making claim, _not_working in not a healthy discussion.

> For x86 and KVM, any level of
> nesting has been done for about 10 years ago.
>
I didn’t find hw for PML support in x86 for N or 3 level nesting. Did I miss?
I didn’t find hw for nested page tables upto N level walking on the PCIe read/writes in any cpu. Did I miss?
Have you seen nesting in hw works at N level?

> For virtio, it can do any level. So did for vhost/vDPA. For example, I usually
> develop and test virtio/vDPA/vhost in a nesting environment.
> 
Great.
Can you share the performance test results relative number with 2 and 3 level nesting covering the cpu utilization, latency?

> Thanks
> 
> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-12 10:58                     ` Parav Pandit
  2023-10-12 11:17                       ` Michael S. Tsirkin
  2023-10-13  1:16                       ` Jason Wang
@ 2023-10-13  9:06                       ` Zhu, Lingshan
  2023-10-13 11:28                         ` Michael S. Tsirkin
  2023-10-13 11:28                         ` Parav Pandit
  2 siblings, 2 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-13  9:06 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/12/2023 6:58 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, October 12, 2023 3:51 PM
>>
>> On 10/11/2023 7:43 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Wednesday, October 11, 2023 3:55 PM
>>>>>>>>> I don’t have any strong opinion to keep it or remove it as most
>>>>>>>>> stakeholders
>>>>>>>> has the clear view of requirements now.
>>>>>>>>> Let me know.
>>>>>>>> So some people use VFs with VFIO. Hence the module name.  This
>>>>>>>> sentence by itself seems to have zero value for the spec. Just drop it.
>>>>>>> Ok. Will drop.
>>>>>> So why not build your admin vq live migration on our config space
>>>>>> solution, get out of the troubles, to make your life easier?
>>>>>>
>>>>> Your this question is completely unrelated to this reply or you
>>>>> misunderstood
>>>> what dropping commit log means.
>>>> if you can rebase admin vq LM on our basic facilities, I think you
>>>> dont need to talk about vfio in the first place, so I ask you to re-consider
>> Jason's proposal.
>>> I don’t really know why you are upset with the vfio term.
>>> It is the use case of the cloud operator and it is listed to indicate how proposal
>> fits in a such use case.
>>> If for some reason, you don’t like vfio, fine. Ignore it and move on.
>>>
>>> I already answered that I will remove from the commit log, because the
>> requirements are well understood now by the committee.
>>> Your comment is again unrelated (repeated) to your past two questions.
>>>
>>> I explained you the technical problem that admin command (not admin VQ)
>> of basic facilities cannot be done using config registers without any mediation
>> layer.
>> OK, I pop-ed Jason's proposal to make everything easier, and I see it is refused.
> Because it does not work for passthrough mode.
what are you talking about?
Config space does not work passthrough?
Have you ever tried pass through a virtio device to a guest?
>
>>>>> Dropping link to vfio does not drop the requirement.
>>>>> I am ok to drop because requirements are clear of passthrough of
>>>>> member
>>>> device.
>>>>> Vfio is not a trouble at all.
>>>>> Admin command is not a trouble either.
>>>>>
>>>>> The pure technical reason is: all the functionalities proposed
>>>>> cannot be done
>>>> in any other existing way.
>>>>> Why? For below reasons.
>>>>> 1. device context, and write records (aka dirty page addresses) is
>>>>> huge which cannot be shared using config registers at scale of 4000
>>>>> member devices
>>>> dirty page tracking will be implmemented in V2, actually I have the
>>>> patch right now.
>>> That is yet again the invitation to non_colloboration mode.
>>> Without reviewing, v0 and v1, you want to show dirty page tracking in some
>> other way.
>>> But ok, that is your non_coperative mode of working. Cannot help further.
>> I believe both me and Jason have proposed a solution, I see it is rejected.
>> But don't take it personal and please keep professional.
> Sure, as I explained the config register method do not work for passthrough mode, and does not scale.
Let me repeat again, these live migration facilities are 
per-device(per-VF) facility, so it only migrates itself.

And for pass through, you can try passthrough a virito device to a 
guest, see how the guest initialize the device
through the config space.

That is really basic virtualization, not hard to test.
>
>>>> inflight descriptor tracking will be implemented by Eugenio in V2.
>>> When we have near complete proposal from two device vendors, you want
>>> to push something to unknown future without reviewing the work; does not
>> make sense.
>> Didn't I ever provide feedback to you? Really?
> No. I didn’t see why you need to post a new patch for dirty page tracking, when it is already present in this series.
> I would like to understand and review this aspects.
> Same for the device context.
you will see dirty page tracking in my V2, as I repeated for many times.
For device context, we have discussed this in other threads, did you 
ignored that again?
Hint: how do you define device context for every device type, e.g, 
virtio-fs.
Don't say you only migrate virito-net or blk.
>
>>> You are still in the mode of _take_ what we did with near zero explanation.
>>> You asked question of why passthrough proposal cannot advantage of in_band
>> config registers.
>>> I explained technical reason listed here.
>> I have answered the questions, and asked questions for many times.
>> What do you mean by "why passthrough proposal cannot advantage of in_band
>> config registers."?
>> Config space work for passthrough for sure.
> Config space registers are passthrough the guest VM.
> Hence hypervisor messing it with, programming some address would result in either security issue.
> Or functionally broken, to sustain the functionality, each nested layer needs one copy of these registers for each nest level.
> So they must be trapped somehow.
trap and emulated are basic virtualization.
>
> Secondly I don’t see how one can read 1M flows using config registers.
Not sure what you are talking about, beyond the spec?
>
>>> So please don’t jump to conclusions before finishing the discussion on how
>> both side can take advantage of each other.
>>> Lets please do that.
>> We have proposed a solution, right?
>>
> Which one? To do something in future?
> I don’t see a suggestion on how one can use device context and dirty page tracking for nested and passthrough uniformly.
> I see a technical difficulty in making both work with uniform interface.
Please don't ignore previous answers, don't force us repeat again and again.

It is Jason's proposal. Please refer to previous threads, also for 
device context and dirty pages.
>
>> I still need to point out: admin vq LM does not work, one example is nested.
> As Michael said, please don’t confuse between admin commands and admin vq.
anyway, admin vq live migration don't work for nested.
>
>>>> There are no scale problem as I repeated for many time, they are
>>>> per-device basic facilities, just migrate the VF by its own facility,
>>>> so there are no 40000 member devices, this is not per PF.
>>>>
>>> I explained that device reset, flr etc flow cannot work when controlling and
>> controlled functions are single entity for passthrough mode.
>>> The scale problem is, one needs to duplicate the registers on each VF.
>>> The industry is moving away from the register interface in many _real_ hw
>> devices implementation.
>>> Some of the examples are IMS, SIOV, NVMe and more.
>> we have discussed this for many times, please refer to previous threads, even
>> with Jason.
> I do not agree for any registers to add to the VF which are reset on device_reset and FLR.
> As it does not work for passthrough mode.
Jason has answered your these FLR questions for many times, I don't want 
to repeat his words,
even myself have answered many times. If you keep ignoring the answers, 
and ask again and again,
what is the point?

So please refer to the previous threads.
>
>>>> The device context can be read from config space or trapped, like
>>>> shadow
>>> There are 1 million flows of the net device flow filters in progress.
>>> Each flow is 64B in size.
>>> Total size is 64MB.
>>> I don’t see how one can read such amount of memory using config registers.
>> control vq?
> The control vq and flow filter vqs are owned by the guest driver, not the hypervisor.
> So no, cvq cannot be used.
first, don't cut off the threads, don't delete words, that really 
confusing readers.

And I think you misunderstand a lot of virtualization fundamentals,
at least have a look at how shadow control vq works.

And the parameters set to config vq are also device context as we 
discussed for many times.
>
>> Or do you want to migrate non-virtio context?
> Every thing is virtio device context.
see above
>
>>>> control vq which is already done, that is basic virtualization.
>>> There is nothing like "basic virtualization".
>>> What is proposed here is fulfilling the requirement of passthrough mode.
>>>
>>> Your comment is implying, "I don’t care for passthrough requirements, do
>> non_passthrough".
>> that is your understanding, and you misunderstood it. Config space servers
>> passthrough for many years.
> "Config space servers" ?
> I do not understand it, can you please explain what does that mean?
>
> I do not see your suggestion on how one can implement passthrough member device when passthrough device does the dma and migration framework also need to do the dma.
Try pass through a virtio device to a guest and learn how the guest take 
advantage the config space before you comment.
>
>>> The discussion should be,
>>> How can we leverage common framework for passthrough and mediated
>> mode?
>>> Can we? If so, which are the pieces?
>> config space is a common framework, right?
>>> For me it is frankly very weird to take native virtio member device, convert
>> into a medicated device using a giant software, and after that convolution get
>> virtio device.
>>> But for nested case you have the use case.
>>> So if we focus positively on how two use cases can use some common
>> functionality, that will be great.
>> why config space need a giant sw to work?
> You can count the number of lines of code for existing and rest 30+ devices to see how much does it take.
> Which is still missing some of the code for small downtime.
> And compare it with passthrough driver code.
>
> Regardless, I just don’t see how config registers work.
again, please try pass through a device to a guest. Try to understand 
how config space work.
>
>> So both Jason and I suggest you build admin vq solution based on our basic
>> facilities.
> :)
> That basic facility is missing dirty page tracking, P2P support, device context, FLR, device reset support.
> Hence, it is unusable right now for passthough member device.
> And 6th problemetic thing in it is, it does not scale with member devices.
Please refer to previous discussions, it is meaningless if you keep 
ignoring our answers and keep asking the same
questions.
>
>>>> If you want to migrate device context, you need to specify device
>>>> context for every type of device, net maybe easy, how do you see virtio-fs?
>>> Virtio-fs will have its on device context too.
>>> Every device has some sort of backend in varied degree.
>>> Net being widely used and moderate complex device.
>>> Fs being slightly stateful but less complex than net, as it has far less control
>> operations.
>> so, do you say you have implement a live migration solution which can migrate
>> device context, but only work for net or block?
> I don’t think this question about implementation has any relevance.
> Frankly feels like a court to me. :(
> No. I dint say that.
> We have implemented net, fs, block devices and single framework proposed here can support all 3 and rest 28+.
> The device context part in this series do not cover special/optional things of all the device type.
> This is something I promised to do gradually, once the framework looks good.
If you don't define them, only talking about "migrate the device 
context" but don't tell us what do migrate,
does this make sense to anybody?
>> Then you should call it virtio net/blk migration and implement in net/block
>> section.
> No. you misunderstood. My point was showing orthogonal complexities of net vs fs.
> I likely failed to explain that.
see above, anyway you need to define them, how about starting form 
virito FS?
>
>>> In fact virtio-fs device already discusses the migrating the device side state, as
>> listed in device context.
>>> So virtio-fs device will have its own device-context defined.
>> if you want to migrate it, you need to define it
> Sure.
> Only device specific things to be defined in future.
Now, not future if you want to migrate device context.
> Rest is already present.
> We are not going to define all the device context in one patch series that no one can review reliably.
> It will be done incrementally.
so you agree at least for now we should migrate stateless devices, right?
>
> But the feedback, I am taking is, we need to add a command that indicates which TLVs are supported in the device migration.
> So virtio-fs or other device migration capabilities can be discovered.
> I will cover this in v2.
so you propose a solution as "virtio migration", but only migrate 
selective types of devices?
You should rename it to be "virtio-net live migration".
>
> Thanks a lot for this thoughts.
>
>>> The infrastructure and basic facilities are setup in this series, that one can
>> easily extend for all the current and new device types.
>> really? how?
>>>> And we are migrating stateless devices, or no? How do you migrate virtio-fs?
>>>>> 2. sharing such large context and write addresses in parallel for
>>>>> multiple devices cannot be done using single register file
>>>> see above
>>>>> 3. These registers cannot be residing in the VF because VF can
>>>>> undergo FLR, and device reset which must clear these registers
>>>> do you mean you want to audit all PCI features? When FLR, the device
>>>> is rested, do you expect a device remember anything after FLR?
>>> Not at all. VF member device will not remember anything after FLR.
>>>> Do you want to trap FLR? Why?
>>> This proposal does _not_ want to trap the FLR in the hypervisor virtio driver.
>>>
>>> When one does the mediation-based design, it must trap/emulate/fake the
>> FLR.
>>> It helps to address the case of nested as you mentioned.
>> once passthrough, the guest driver can access the config space to reset the
>> device, right?
>>>> Why FLR block or conflict with live migration?
>>> It does not block or conflict.
>> OK, cool, so let's make this a conclusion
>>> The whole point is, when you put live migration functionality on the VF itself,
>> you just cannot FLR this device.
>>> One must trap the FLR and do fake FLR and build the whole infrastructure to
>> not FLR The device.
>>> Above is not passthrough device.
>> No, the guest can reset the device, even causing a failed live migration.
> Not in the proposal here.
> Can you please prove how in the current v1 proposal, device reset will fail the migration?
> I would like to fix it.
if the device is reset, it forgets everything right?
>
>>>>> 4. When VF does the DMA, all dma occurs in the guest address space,
>>>>> not in
>>>> hypervisor space; any flr and device reset must stop such dma.
>>>>> And device reset and flr are controlled by the guest (not mediated
>>>>> by
>>>> hypervisor).
>>>> if the guest reset the device, it is totally reasonable operation,
>>>> and the guest own the risk, right?
>>> Sure, but the guest still expects its dirty pages and device context to be
>> migrated across device_reset.
>>> Device_reset will lose all this information within the device if done without
>> mediation and special care.
>> No, if the guest reset a device, that means the device should be RESET, to forget
>> its config, that would be really wired to migrate a fresh device at the source
>> side, to be a running device at the destination side.
> Device reset not doing the role of reset is just a plain broken spec.
why? The reset behavior is well defined in the spec, and works fine for 
years.
>
>>> So, to avoid that now one needs to have fake device reset too and build that
>> infrastructure to not reset.
>>> The passthrough proposal fundamental concept is:
>>>
>>> all the native virtio functionalities are between guest driver and the actual
>> device.
>> see above.
>>>> and still, do you want to audit every PCI features? at least you
>>>> didn't do that in your series.
>>> Can you please list which PCI features audit you are talking about?
>> you audit FLR, then do you want to check everyone?
>> If no, how to decide which one should be audited, why others not?
> I really find it hard to follow your question.
>
> I explained in patch 5 and 8 about interactions with the FLR and its support.
> Not sure what you want me to check.
>
> You mentioned that "I didn’t audit every PCI features"? So can you please list which one and in relation to which admin commands?
Your job to audit everyone if you talk about FLR. Because FLR is PCI 
spec, not virtio, you need to explain why other PCI features not
need to be audited.

We have explained why FLR is not a concern for many times, and I don't 
want to repeat, please refer to previous discussions.
>
>>> Keep in mind, that will all the mediation, one now must equally audit all this
>> giant software stack too.
>>> So maybe it is fine for those who are ok with it.
>> so you agree FLR is not a problem, at least for config space solution?
> I don’t know what you mean "FLR is not a problem".
>
> FLR on the VF must work as it works without live migration for passthrough device as today.
> And admin commands have some interactions with it.
> And this proposal covers it.
> I am missing some text that Michael and Jason pointed out.
> I am working on v2 to annotate or better word them.
When guest reset the device, the device should be reset for sure. then 
it forgets everything,
how do you expect the reset-ed device still work for live migration? is 
it a race?
>
>>>> For migration, you know the hypervisor takes the ownership of the
>>>> device in the stop_window.
>>> I do not know what stop_window means.
>>> Do you mean stop_copy of vfio or it is qemu term?
>> when guest freeze.
>>>>> 5. Any PASID to separate out admin vq on the VF does not work for
>>>>> two
>>>> reasons.
>>>>> R_1: device flr and device reset must stop all the dmas.
>>>>> R_2: PASID by most leading vendors is still not mature enough
>>>>> R_3: One also needs to do inversion to not expose PASID capability
>>>>> of the member PCI device to not expose
>>>> see above and what if guest shutdown? the same answer, right?
>>> Not sure, I follow.
>>> If the guest shutdown, the guest specific shutdown APIs are called.
>>>
>>> With passthrough device, R_1 just works as is.
>>> R_3 is not needed as they are directly given to the guest.
>>> R_2 platform dependency is not needed either.
>> I think we already have a concussion for FLR.
> I don’t have any concussion.
> I wrote what to be supported for the FLR above.
OK, again, our discussions has been ignored again, and all start over again.

Would you please read our previous discussions?
>
>> For PASID, what blocks the solution?
> When the device is passthrough, PASID capabilities cannot be emulated.
> PASID space is owned fully by the guest.
>
> There is no single known cpu vendor support splitting pasid between hypervisor and guest.
> I can double check, but last I recall that Linux kernel removed such weird support.
do you know there is something called vIOMMU?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-12 11:37                       ` Parav Pandit
  2023-10-12 13:03                         ` Michael S. Tsirkin
  2023-10-13  1:18                         ` Jason Wang
@ 2023-10-13  9:44                         ` Zhu, Lingshan
  2023-10-13 11:54                           ` Parav Pandit
  2023-10-13 13:49                           ` Michael S. Tsirkin
  2 siblings, 2 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-13  9:44 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/12/2023 7:37 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, October 12, 2023 4:40 PM
>> On 10/12/2023 6:09 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Thursday, October 12, 2023 3:30 PM
>>>>
>>>> On 10/11/2023 6:54 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Wednesday, October 11, 2023 3:38 PM
>>>>>>
>>>>>>>> The system admin can choose only passthrough some of the devices
>>>>>>>> for nested guests, so passthrough the PF to L1 guest is not a
>>>>>>>> good idea, because there can be many devices still work for the host or
>> L1.
>>>>>>> Possible. One size does not fit all.
>>>>>>> What I expressed is most common scenarios that user care about.
>>>>>> don't block existing usecases, don't break the userspace, nested is
>> common.
>>>>> Nothing is broken as virtio spec do not have any single construct to
>>>>> support
>>>> migration.
>>>>> If nested is common, can you share the performance number with real
>>>>> virtio
>>>> device with/without 2 level nesting?
>>>>> I frankly don’t know how they look like.
>>>> virtio devices support nested, I mean don't break this usecase And
>>>> end user accept performance overhead in nested, this is not related to this
>> topic.
>>> Can you show an example of virtio device nesting and live migration already
>> supported where the device has _done_ the live migration.
>>> Due to which you claim that new feature of admin command-based owner
>> and member device breaks something?
>> current virito/kvm/qemu support nested.
> Sure, two of the 3 components are not part of the virtio spec.
> Hence, they are not broken.
you want virtio work for them right? don't break this.
>
>>> Please don’t use the verb "break".
>>> Your proposal is the first of its kind that supports migrating nested device.
>>> This is why new patches of config register or admin command does not break
>> anything existing.
>> if your proposal don't support nested, you break nested use cases.
>>>>>>>>> In second use case, where one want to bind only one member
>>>>>>>>> device to one VM, I think same plumbing can be extended to have
>>>>>>>>> another VF, to take
>>>>>>>> the role of migration device instead of owner device.
>>>>>>>>> I don’t see a good way to passthrough and also do in-band
>>>>>>>>> migration without
>>>>>>>> lot of device specific trap and emulation.
>>>>>>>>> I also don’t know the cpu performance numbers with 3 levels of
>>>>>>>>> nested page
>>>>>>>> table translation which to my understanding cannot be accelerated
>>>>>>>> by the current cpu.
>>>>>>>> host_PA->L1_QEMU_VA->L1_Guest_PA->L1_QEMU_VA->L2_Guest_PA
>> and
>>>> so
>>>>>> on,
>>>>>>>> there can be performance overhead, but can be done.
>>>>>>>>
>>>>>>>> So admin vq migration still don't work for nested, this is surely a
>> blocker.
>>>>>>> In specific case of member devices are located at different nest
>>>>>>> level, it does
>>>>>> not.
>>>>>> so you got the point, so this series should not be merged.
>>>>>>> Why prevents you have a peer VF do the role of migration driver?
>>>>>>> Basically, what I am proposing is, connect two VFs to the L1 guest.
>>>>>>> One VF is
>>>>>> migration driver, one VF is passthrough to L2 guest.
>>>>>>> And same scheme works.
>>>>>> A peer VF? A management VF? still break the existing usecase. and
>>>>>> how do you transfer ownership of L2 VF from PF to L1 VF?
>>>>> A peer management VF which services admin command (like PF).
>>>>> Ownership of admin command is delegated to the management VF.
>>>> interesting, do you plan to cook a patch implementing this?
>>> No. I am hoping that you can help to draft those patches for nested case to
>> work when one wants to hand of single VM to single nested guest VM.
>>> I will not be able to test any of nested things and show its performance value
>> either, as I don’t see how rest of the eco system can match up for the nested.
>>> Hence, your expertise in drafting extension for nested is desired.
> Answer to your below question of patch drafting is here. If you can help to extend it will be good.
where are the draft patch?
>
>>>> Really make sense?
>>>>
>>>> How do you transfer the ownership?
>>> An additional ownership deletgation by a new admin command.
>> if you think this can work, do you want to cook a patch to implement this before
>> you submitting this live migration series?
> I answered this already above.
talk is cheap, show me your patch
>
>>>> How to you maintain a different group?
>>> One to one assignment.
>> same as above
>>>> How do you isolate the groups?
>>> Not sure, what it means. The explicit group is created and VFs are placed in
>> this group.
>> VF resource are on PF, right?
> Which resource?
> Before jumping to resource, may be you want to answer "group isolation"?
>
>>>> How to you keep the guest or host secure?
>>> Please be specific. Its very broad question when it comes to defining the
>> interface.
>> without isolation, can be attacked?
> What isolation are you talking about?
> I am suggesting that one VF as dummy PF is given the role of admin commands.
>
>>>> How do you manage the overlaps?
>>> Overlaps between?
>> host pf and L1 VF
> L1 VF works at it own level.
> Host PF works at its own level.
> This is the true nesting.
>
>>>> How do you implement the hardware support that?
>>> Please consult your board designers. Hard to say how to implement something
>> in generic.
>> so you don't have an idea
> :)
> Right, I do not have idea for Intel boards.
> I was suggesting a management VF that can service the admin commands.
>
>>>> How do you change the PCI routing?
>>> Why anything to be changed in PCI routing?
>> do you place PF and mangement VF in an ACL group?
> ACL group at which layer?
>
>> Do does L1 management VF's member device belong to the PF physically?
> Yes.
Answer all questions above, if you think a management VF can work,
please show me your patch.
>>>>> It does not break any existing deployments.
>>>> we are talking about nested, don't break nested
>>> Virtio spec for nested is not defined yet. Hence nothing is broken. Please avoid
>> using the verb, _break_.
>> virtio nested works for many years
> I replied: your break comment is not applicable to virtio_spec, nor does it apply to any existing software you listed.
>
> As Michael said, software based nesting is used..
> See if actual hw based devices can implement it or not. Many components of cpu cannot do N level nesting either, but may be virtio can.
> I don’t know how yet.
two facts:
1. virito works for nested for years
2. your admin vq lm solution does not work for nested


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13  1:16                       ` Jason Wang
  2023-10-13  6:36                         ` Parav Pandit
@ 2023-10-13 11:26                         ` Michael S. Tsirkin
  2023-10-13 11:41                           ` Parav Pandit
  2023-10-17  1:42                           ` Jason Wang
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-13 11:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 13, 2023 at 09:16:43AM +0800, Jason Wang wrote:
> On Thu, Oct 12, 2023 at 6:58 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Thursday, October 12, 2023 3:51 PM
> > >
> > > On 10/11/2023 7:43 PM, Parav Pandit wrote:
> > > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >> Sent: Wednesday, October 11, 2023 3:55 PM
> > > >>>>>>> I don’t have any strong opinion to keep it or remove it as most
> > > >>>>>>> stakeholders
> > > >>>>>> has the clear view of requirements now.
> > > >>>>>>> Let me know.
> > > >>>>>> So some people use VFs with VFIO. Hence the module name.  This
> > > >>>>>> sentence by itself seems to have zero value for the spec. Just drop it.
> > > >>>>> Ok. Will drop.
> > > >>>> So why not build your admin vq live migration on our config space
> > > >>>> solution, get out of the troubles, to make your life easier?
> > > >>>>
> > > >>> Your this question is completely unrelated to this reply or you
> > > >>> misunderstood
> > > >> what dropping commit log means.
> > > >> if you can rebase admin vq LM on our basic facilities, I think you
> > > >> dont need to talk about vfio in the first place, so I ask you to re-consider
> > > Jason's proposal.
> > > > I don’t really know why you are upset with the vfio term.
> > > > It is the use case of the cloud operator and it is listed to indicate how proposal
> > > fits in a such use case.
> > > > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> > > >
> > > > I already answered that I will remove from the commit log, because the
> > > requirements are well understood now by the committee.
> > > >
> > > > Your comment is again unrelated (repeated) to your past two questions.
> > > >
> > > > I explained you the technical problem that admin command (not admin VQ)
> > > of basic facilities cannot be done using config registers without any mediation
> > > layer.
> > > OK, I pop-ed Jason's proposal to make everything easier, and I see it is refused.
> > Because it does not work for passthrough mode.
> 
> How and why? What's wrong with just passing through the newly
> introduced 2 or 3 registers to guests?
> 
> This is the question you never answer even if I keep asking.

It is, fundamentally, a question of supporting as many architectures
as we can as opposed to being opinionated.

On the one end of the spectrum, device is completely under guest control
and anything external has to trap to hypervisor.
None of existing implementations are there, at least pci config space
is typically under hypervisor control.
What Parav calls "passthrough" is built I think along these lines:
memory and interrupts go straight to guest, config space
is trapped and emulated.
On the other side of the spectrum is trapping everything in hypervisor.
Your "2 to 3 registers" is also not there, but is I think closer to that end
of the arc.

Any new feature should ideally be a building block supporting as many
approaches as possible. Fundamentally that requires a level of
indirection, as usual :) Having two completely distict interfaces for
that straight off the bat?  Gimme a break.
-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13  9:06                       ` Zhu, Lingshan
@ 2023-10-13 11:28                         ` Michael S. Tsirkin
  2023-10-13 11:42                           ` Parav Pandit
  2023-10-16  8:41                           ` Zhu, Lingshan
  2023-10-13 11:28                         ` Parav Pandit
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-13 11:28 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 13, 2023 at 05:06:02PM +0800, Zhu, Lingshan wrote:
> Hint: how do you define device context for every device type, e.g,
> virtio-fs.
> Don't say you only migrate virito-net or blk.

Indeed. I don't think anyone can avoid either defining that or
leaving that completely up to implementations and just hoping
they don't miss anything. Given the choice I'm definitely for
defining. Each device must grow a "migration" section documenting
the context.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13  9:06                       ` Zhu, Lingshan
  2023-10-13 11:28                         ` Michael S. Tsirkin
@ 2023-10-13 11:28                         ` Parav Pandit
  2023-10-13 11:49                           ` Michael S. Tsirkin
  2023-10-16  9:44                           ` Zhu, Lingshan
  1 sibling, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-13 11:28 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, October 13, 2023 2:36 PM

[..]
> > Because it does not work for passthrough mode.
> what are you talking about?
> Config space does not work passthrough?

Once the register space of the VF that is supposed to be used by the live migration is passed to the guest, it is under guest control.
Hence, live migration driver won't be able to use it.

> Have you ever tried pass through a virtio device to a guest?
:)
Please explain how the question is relevant to this discussion in separate thread, so that one can keep technical focus.
(Please keep your discussion technical, instead of derogatory to other members).

> Let me repeat again, these live migration facilities are
> per-device(per-VF) facility, so it only migrates itself.
> 
Since they are per device (per VF), they reside in the guest VM. Hence, VMM cannot live migrate it.

> And for pass through, you can try passthrough a virito device to a guest, see
> how the guest initialize the device through the config space.
> 
> That is really basic virtualization, not hard to test.
Repeated points, I am omitting.

> >
> >>>> inflight descriptor tracking will be implemented by Eugenio in V2.
> >>> When we have near complete proposal from two device vendors, you
> >>> want to push something to unknown future without reviewing the work;
> >>> does not
> >> make sense.
> >> Didn't I ever provide feedback to you? Really?
> > No. I didn’t see why you need to post a new patch for dirty page tracking,
> when it is already present in this series.
This is plain ignorance and shows non_cooperative mode of working in technical committee.

> > I would like to understand and review this aspects.
> > Same for the device context.
> you will see dirty page tracking in my V2, as I repeated for many times.
Since you are not co-operative, I have less sympathy to see V2.
I don’t see a reason to see when, it is fully presented here.

> For device context, we have discussed this in other threads, did you ignored that
> again?
No. I didn’t. I replied that the generic infrastructure is built the enables every device type to migrate by defining their device context.

> Hint: how do you define device context for every device type, e.g, virtio-fs.
> Don't say you only migrate virito-net or blk.
I didn’t say it. I said to migrate all 30+ device types.
And infrastructure is presented here.

> >
> >>> You are still in the mode of _take_ what we did with near zero explanation.
> >>> You asked question of why passthrough proposal cannot advantage of
> >>> in_band
> >> config registers.
> >>> I explained technical reason listed here.
> >> I have answered the questions, and asked questions for many times.
> >> What do you mean by "why passthrough proposal cannot advantage of
> >> in_band config registers."?
> >> Config space work for passthrough for sure.
> > Config space registers are passthrough the guest VM.
> > Hence hypervisor messing it with, programming some address would result in
> either security issue.
> > Or functionally broken, to sustain the functionality, each nested layer needs
> one copy of these registers for each nest level.
> > So they must be trapped somehow.
> trap and emulated are basic virtualization.
Not for passthrough devices, sorry.
See the paper that Jason pointed out.
Control program/vmm is trap is involved only on the privileged operation of the VMM.
Virtio cvqs, virtio registers are not the privileged operation of the VMM, because they are of the native virtio device itself.
Period.

> >
> > Secondly I don’t see how one can read 1M flows using config registers.
> Not sure what you are talking about, beyond the spec?
The spec which is under works for few months by multiple technical members.
Please subscribe to virtio-comment mailing list.
How come you changed your point from cvq to different argument of out of spec? :)

> >
> >>> So please don’t jump to conclusions before finishing the discussion
> >>> on how
> >> both side can take advantage of each other.
> >>> Lets please do that.
> >> We have proposed a solution, right?
> >>
> > Which one? To do something in future?
> > I don’t see a suggestion on how one can use device context and dirty page
> tracking for nested and passthrough uniformly.
> > I see a technical difficulty in making both work with uniform interface.
> Please don't ignore previous answers, don't force us repeat again and again.
> 
You didn’t answer, how.
Your answer was "you will post dirty page tracking without reviewing current" and Eugenio will post v2....

> It is Jason's proposal. Please refer to previous threads, also for device context
> and dirty pages.
> >
> >> I still need to point out: admin vq LM does not work, one example is nested.
> > As Michael said, please don’t confuse between admin commands and admin
> vq.
> anyway, admin vq live migration don't work for nested.
I am convicned with the paper that Jason pointed out.

A nested solution involves a member device supporting the nesting without trap and emulation so that it follows the two properties:
The efficiency property and equivalence property.

Hence a member device which wants to support nested case, should present itself with attributes to support nesting.


> >
> >>>> There are no scale problem as I repeated for many time, they are
> >>>> per-device basic facilities, just migrate the VF by its own
> >>>> facility, so there are no 40000 member devices, this is not per PF.
> >>>>
> >>> I explained that device reset, flr etc flow cannot work when
> >>> controlling and
> >> controlled functions are single entity for passthrough mode.
> >>> The scale problem is, one needs to duplicate the registers on each VF.
> >>> The industry is moving away from the register interface in many
> >>> _real_ hw
> >> devices implementation.
> >>> Some of the examples are IMS, SIOV, NVMe and more.
> >> we have discussed this for many times, please refer to previous
> >> threads, even with Jason.
> > I do not agree for any registers to add to the VF which are reset on
> device_reset and FLR.
> > As it does not work for passthrough mode.
> Jason has answered your these FLR questions for many times, I don't want to
> repeat his words, even myself have answered many times. If you keep ignoring
> the answers, and ask again and again, what is the point?
> 
> So please refer to the previous threads.

I don’t think I asked the question above. Please re-read.

> >
> >>>> The device context can be read from config space or trapped, like
> >>>> shadow
> >>> There are 1 million flows of the net device flow filters in progress.
> >>> Each flow is 64B in size.
> >>> Total size is 64MB.
> >>> I don’t see how one can read such amount of memory using config
> registers.
> >> control vq?
> > The control vq and flow filter vqs are owned by the guest driver, not the
> hypervisor.
> > So no, cvq cannot be used.
> first, don't cut off the threads, don't delete words, that really confusing readers.
> 
Your comments are so long that it is hard to follow such a long thread.
Hence only the related comments are kept.
But I understand, will try to avoid.

> And I think you misunderstand a lot of virtualization fundamentals, at least have
> a look at how shadow control vq works.
> 
In case if you don’t know, the shadow cvq acceleration for Nvidia ConnectX6-DX is done jointly with Dragos and me, with recent patches from Sie-Wei.

I don’t think so I missed.

Shadow vq is great when you don’t have underlying support from the device.

When you have passthrough member devices, they are not trapped or emulated.
The future hypervisor must not be able to see things of cvq, datavq or addressed programmed by the guest.
And hence the infrastructure is geared towards such approach.

> And the parameters set to config vq are also device context as we discussed for
> many times.
> >
> >> Or do you want to migrate non-virtio context?
> > Every thing is virtio device context.
> see above
> >
> >>>> control vq which is already done, that is basic virtualization.
> >>> There is nothing like "basic virtualization".
> >>> What is proposed here is fulfilling the requirement of passthrough mode.
> >>>
> >>> Your comment is implying, "I don’t care for passthrough
> >>> requirements, do
> >> non_passthrough".
> >> that is your understanding, and you misunderstood it. Config space
> >> servers passthrough for many years.
> > "Config space servers" ?
> > I do not understand it, can you please explain what does that mean?
> >
> > I do not see your suggestion on how one can implement passthrough member
> device when passthrough device does the dma and migration framework also
> need to do the dma.
> Try pass through a virtio device to a guest and learn how the guest take
> advantage the config space before you comment.
Right. It does not work. The guest is doing the device_reset and flr.
Hence, it is resetting everything. All the dirty page log is lost.
All the device context is lost.
Hypervisor didn’t see any of this happening, because it didn’t do the trap.

Look, if you are going to continue to argue that you must do trap + emulation and don’t talk about passthrough,
Please stop here, because discussion won't go anywhere.

I made my best to answer the limitations in very first email where you asked.

> > That basic facility is missing dirty page tracking, P2P support, device context,
> FLR, device reset support.
> > Hence, it is unusable right now for passthough member device.
> > And 6th problemetic thing in it is, it does not scale with member devices.
> Please refer to previous discussions, it is meaningless if you keep ignoring our
> answers and keep asking the same questions.
Again, please re-read, I didn’t ask the question.
I replied 6 problems that are not solved.

> >
> >>>> If you want to migrate device context, you need to specify device
> >>>> context for every type of device, net maybe easy, how do you see virtio-fs?
> >>> Virtio-fs will have its on device context too.
> >>> Every device has some sort of backend in varied degree.
> >>> Net being widely used and moderate complex device.
> >>> Fs being slightly stateful but less complex than net, as it has far
> >>> less control
> >> operations.
> >> so, do you say you have implement a live migration solution which can
> >> migrate device context, but only work for net or block?
> > I don’t think this question about implementation has any relevance.
> > Frankly feels like a court to me. :(
> > No. I dint say that.
> > We have implemented net, fs, block devices and single framework proposed
> here can support all 3 and rest 28+.
> > The device context part in this series do not cover special/optional things of
> all the device type.
> > This is something I promised to do gradually, once the framework looks good.
> If you don't define them, only talking about "migrate the device context" but
> don't tell us what do migrate, does this make sense to anybody?
> >> Then you should call it virtio net/blk migration and implement in
> >> net/block section.
> > No. you misunderstood. My point was showing orthogonal complexities of net
> vs fs.
> > I likely failed to explain that.
> see above, anyway you need to define them, how about starting form virito FS?
> >
> >>> In fact virtio-fs device already discusses the migrating the device
> >>> side state, as
> >> listed in device context.
> >>> So virtio-fs device will have its own device-context defined.
> >> if you want to migrate it, you need to define it
> > Sure.
> > Only device specific things to be defined in future.
> Now, not future if you want to migrate device context.
It is not mandatory, and it is impractical do everything in one series.
It is planned for 1.4.

> > Rest is already present.
> > We are not going to define all the device context in one patch series that no
> one can review reliably.
> > It will be done incrementally.
> so you agree at least for now we should migrate stateless devices, right?
> >
> > But the feedback, I am taking is, we need to add a command that indicates
> which TLVs are supported in the device migration.
> > So virtio-fs or other device migration capabilities can be discovered.
> > I will cover this in v2.
> so you propose a solution as "virtio migration", but only migrate selective types
> of devices?

> You should rename it to be "virtio-net live migration".
Sorry, I wont. Because infrastructure is for majority device types.

Which field did you observe which is net specific?
We want to cover all the device types.
Don’t need to cook their context in one series.

> >
> > Thanks a lot for this thoughts.
> >
> >>> The infrastructure and basic facilities are setup in this series,
> >>> that one can
> >> easily extend for all the current and new device types.
> >> really? how?
> >>>> And we are migrating stateless devices, or no? How do you migrate virtio-
> fs?
> >>>>> 2. sharing such large context and write addresses in parallel for
> >>>>> multiple devices cannot be done using single register file
> >>>> see above
> >>>>> 3. These registers cannot be residing in the VF because VF can
> >>>>> undergo FLR, and device reset which must clear these registers
> >>>> do you mean you want to audit all PCI features? When FLR, the
> >>>> device is rested, do you expect a device remember anything after FLR?
> >>> Not at all. VF member device will not remember anything after FLR.
> >>>> Do you want to trap FLR? Why?
> >>> This proposal does _not_ want to trap the FLR in the hypervisor virtio driver.
> >>>
> >>> When one does the mediation-based design, it must trap/emulate/fake
> >>> the
> >> FLR.
> >>> It helps to address the case of nested as you mentioned.
> >> once passthrough, the guest driver can access the config space to
> >> reset the device, right?
> >>>> Why FLR block or conflict with live migration?
> >>> It does not block or conflict.
> >> OK, cool, so let's make this a conclusion
> >>> The whole point is, when you put live migration functionality on the
> >>> VF itself,
> >> you just cannot FLR this device.
> >>> One must trap the FLR and do fake FLR and build the whole
> >>> infrastructure to
> >> not FLR The device.
> >>> Above is not passthrough device.
> >> No, the guest can reset the device, even causing a failed live migration.
> > Not in the proposal here.
> > Can you please prove how in the current v1 proposal, device reset will fail the
> migration?
> > I would like to fix it.
> if the device is reset, it forgets everything right?
Right. This is why all dirty page track; device context is lost on device reset.
Hence, the controlling function and controlled function are two different entities.

> >
> >>>>> 4. When VF does the DMA, all dma occurs in the guest address
> >>>>> space, not in
> >>>> hypervisor space; any flr and device reset must stop such dma.
> >>>>> And device reset and flr are controlled by the guest (not mediated
> >>>>> by
> >>>> hypervisor).
> >>>> if the guest reset the device, it is totally reasonable operation,
> >>>> and the guest own the risk, right?
> >>> Sure, but the guest still expects its dirty pages and device context
> >>> to be
> >> migrated across device_reset.
> >>> Device_reset will lose all this information within the device if
> >>> done without
> >> mediation and special care.
> >> No, if the guest reset a device, that means the device should be
> >> RESET, to forget its config, that would be really wired to migrate a
> >> fresh device at the source side, to be a running device at the destination
> side.
> > Device reset not doing the role of reset is just a plain broken spec.
> why? The reset behavior is well defined in the spec, and works fine for years.
So any new construct that one adds, it will be reset as well and dirty page track is lost.

> >
> >>> So, to avoid that now one needs to have fake device reset too and
> >>> build that
> >> infrastructure to not reset.
> >>> The passthrough proposal fundamental concept is:
> >>>
> >>> all the native virtio functionalities are between guest driver and
> >>> the actual
> >> device.
> >> see above.
> >>>> and still, do you want to audit every PCI features? at least you
> >>>> didn't do that in your series.
> >>> Can you please list which PCI features audit you are talking about?
> >> you audit FLR, then do you want to check everyone?
> >> If no, how to decide which one should be audited, why others not?
> > I really find it hard to follow your question.
> >
> > I explained in patch 5 and 8 about interactions with the FLR and its support.
> > Not sure what you want me to check.
> >
> > You mentioned that "I didn’t audit every PCI features"? So can you please list
> which one and in relation to which admin commands?
> Your job to audit everyone if you talk about FLR. Because FLR is PCI spec, not
> virtio, you need to explain why other PCI features not need to be audited.
> 
Sure, but when you point figure as I didn’t audit, please mention what is not audited.

> We have explained why FLR is not a concern for many times, and I don't want
> to repeat, please refer to previous discussions.
You seem to ignore the first paragraph of theory of operation that FLR is not trapped.

> >
> >>> Keep in mind, that will all the mediation, one now must equally
> >>> audit all this
> >> giant software stack too.
> >>> So maybe it is fine for those who are ok with it.
> >> so you agree FLR is not a problem, at least for config space solution?
> > I don’t know what you mean "FLR is not a problem".
> >
> > FLR on the VF must work as it works without live migration for passthrough
> device as today.
> > And admin commands have some interactions with it.
> > And this proposal covers it.
> > I am missing some text that Michael and Jason pointed out.
> > I am working on v2 to annotate or better word them.
> When guest reset the device, the device should be reset for sure. then it forgets
> everything, how do you expect the reset-ed device still work for live migration?
> is it a race?
I don’t expect it live migration to work at all with such a approach.
This is why in my proposal live migration occurs on the owner device, while controlled function (member device) is undergoing the device reset.

> >
> >>>> For migration, you know the hypervisor takes the ownership of the
> >>>> device in the stop_window.
> >>> I do not know what stop_window means.
> >>> Do you mean stop_copy of vfio or it is qemu term?
> >> when guest freeze.
> >>>>> 5. Any PASID to separate out admin vq on the VF does not work for
> >>>>> two
> >>>> reasons.
> >>>>> R_1: device flr and device reset must stop all the dmas.
> >>>>> R_2: PASID by most leading vendors is still not mature enough
> >>>>> R_3: One also needs to do inversion to not expose PASID capability
> >>>>> of the member PCI device to not expose
> >>>> see above and what if guest shutdown? the same answer, right?
> >>> Not sure, I follow.
> >>> If the guest shutdown, the guest specific shutdown APIs are called.
> >>>
> >>> With passthrough device, R_1 just works as is.
> >>> R_3 is not needed as they are directly given to the guest.
> >>> R_2 platform dependency is not needed either.
> >> I think we already have a concussion for FLR.
> > I don’t have any concussion.
> > I wrote what to be supported for the FLR above.
> OK, again, our discussions has been ignored again, and all start over again.
> 
> Would you please read our previous discussions?

You asked the question about why it wont work, I answered.
I don’t see a point of debating same thing over again.

> >
> >> For PASID, what blocks the solution?
> > When the device is passthrough, PASID capabilities cannot be emulated.
> > PASID space is owned fully by the guest.
> >
> > There is no single known cpu vendor support splitting pasid between
> hypervisor and guest.
> > I can double check, but last I recall that Linux kernel removed such weird
> support.
> do you know there is something called vIOMMU?
Probably yes.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:26                         ` Michael S. Tsirkin
@ 2023-10-13 11:41                           ` Parav Pandit
  2023-10-13 11:52                             ` Michael S. Tsirkin
  2023-10-17  1:42                           ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-13 11:41 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Friday, October 13, 2023 4:56 PM

> > This is the question you never answer even if I keep asking.
> 
> It is, fundamentally, a question of supporting as many architectures as we can
> as opposed to being opinionated.
> 
> On the one end of the spectrum, device is completely under guest control and
> anything external has to trap to hypervisor.
> None of existing implementations are there, at least pci config space is typically
> under hypervisor control.
> What Parav calls "passthrough" is built I think along these lines:
> memory and interrupts go straight to guest, config space is trapped and
> emulated.
> On the other side of the spectrum is trapping everything in hypervisor.
> Your "2 to 3 registers" is also not there, but is I think closer to that end of the
> arc.
> 
> Any new feature should ideally be a building block supporting as many
> approaches as possible. Fundamentally that requires a level of indirection, as
> usual :) Having two completely distict interfaces for that straight off the bat?
> Gimme a break.

There are two approaches.

1. Passthrough a virtio member device to guest
Only PCI config space and MSI-X table is trapped.
MSI-X table is also trapped due to a cpu/platform limitation. I will not go in that detail for a moment.
All the rest of the virtio member device interface is passthrough to the guest.
This includes,
(a) virtio common config space
(b) virtio device specific config space
(c) cvq if present
(d) io vqs and more vqs
(e) any shared memory
(f) any new construct that arise in coming years

If one wants to do nesting, the member device should support nesting and it will be still able to do to next level.
To my knowledge, most cpus support single level nesting, that is VMM and VM.
Any higher-level nesting involves good amount of emulation in privileged operations.

If virtio to do even more efficient than rest of the platform, I propose that member device can support nesting, so VMM->VM_L1 and VM_L1->VM_L2 constructs are same.
This gives the best of both. Nesting support and passthrough both.
And since its layered approach, it naturally works for nested case.

2. Data path accelerated in device, rest all emulated.
This method make sense when underlying device is not a native virtio device.

But for some reason, ok, one wants to build the infrastructure, we can attempt to find common pieces between #1 and #2 methods.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13  1:15               ` Jason Wang
  2023-10-13  6:36                 ` Parav Pandit
@ 2023-10-13 11:41                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-13 11:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Fri, Oct 13, 2023 at 09:15:31AM +0800, Jason Wang wrote:
> > > 1) Only works for SR-IOV member device like VF
> > It can be extended to SIOV member device in future.
> > Today these are the only type of member device virtio has.
> 
> That is exactly what I want to say, it can only work for the
> owner/member model. It can't work when the virtio device is not
> structured like that. And you missed that most of the existing virtio
> devices are not implemented in this model. It means they can't be
> migrated with a pure virtio specific extension. For you, SR-IOV is all
> but this is not true for virtio. PCI is not the only transport and
> SR-IOV is not the only architecture in PCI.

Original version of admin command work supported another group type
including just the device itself. In the end these commands are
just a level of indirection.
Using them for migration was explicitly listed as a use case when they
were introduced.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:28                         ` Michael S. Tsirkin
@ 2023-10-13 11:42                           ` Parav Pandit
  2023-10-16  8:41                           ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-13 11:42 UTC (permalink / raw)
  To: Michael S. Tsirkin, Zhu, Lingshan
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 13, 2023 4:59 PM
> 
> On Fri, Oct 13, 2023 at 05:06:02PM +0800, Zhu, Lingshan wrote:
> > Hint: how do you define device context for every device type, e.g,
> > virtio-fs.
> > Don't say you only migrate virito-net or blk.
> 
> Indeed. I don't think anyone can avoid either defining that or leaving that
> completely up to implementations and just hoping they don't miss anything.
> Given the choice I'm definitely for defining. Each device must grow a
> "migration" section documenting the context.

+1.
Just like how we have defined "device configuration layout" and "control operation" for every device.
I expect the above section you define to grow.
We don't need to define all in one patch series. One infrastructure is present, it is very easy to add per device section.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:28                         ` Parav Pandit
@ 2023-10-13 11:49                           ` Michael S. Tsirkin
  2023-10-13 12:00                             ` Parav Pandit
  2023-10-16  8:46                             ` Zhu, Lingshan
  2023-10-16  9:44                           ` Zhu, Lingshan
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-13 11:49 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 13, 2023 at 11:28:50AM +0000, Parav Pandit wrote:
> > >>>> inflight descriptor tracking will be implemented by Eugenio in V2.
> > >>> When we have near complete proposal from two device vendors, you
> > >>> want to push something to unknown future without reviewing the work;
> > >>> does not
> > >> make sense.
> > >> Didn't I ever provide feedback to you? Really?
> > > No. I didn’t see why you need to post a new patch for dirty page tracking,
> > when it is already present in this series.
> This is plain ignorance and shows non_cooperative mode of working in technical committee.

I personally think it's fine to have multiple proposals on the table.
For example, current Zhu Lingshan's patch is clearly incomplete without
memory change tracking.
Why shouldn't he post a patchset demonstrating
how that is supposed to work in his view?
I am personally interested in seeing how is that
supposed to work   - his latest proposal relies on migrating by
trapping memory accesses but DMA can't be trapped like this generally.

In the end we want something addressing all use cases though
and integrating reasonably well with existing ecosystem.

-- 
MST

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:41                           ` Parav Pandit
@ 2023-10-13 11:52                             ` Michael S. Tsirkin
  2023-10-13 11:57                               ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-13 11:52 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 13, 2023 at 11:41:21AM +0000, Parav Pandit wrote:
> 
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Michael S. Tsirkin
> > Sent: Friday, October 13, 2023 4:56 PM
> 
> > > This is the question you never answer even if I keep asking.
> > 
> > It is, fundamentally, a question of supporting as many architectures as we can
> > as opposed to being opinionated.
> > 
> > On the one end of the spectrum, device is completely under guest control and
> > anything external has to trap to hypervisor.
> > None of existing implementations are there, at least pci config space is typically
> > under hypervisor control.
> > What Parav calls "passthrough" is built I think along these lines:
> > memory and interrupts go straight to guest, config space is trapped and
> > emulated.
> > On the other side of the spectrum is trapping everything in hypervisor.
> > Your "2 to 3 registers" is also not there, but is I think closer to that end of the
> > arc.
> > 
> > Any new feature should ideally be a building block supporting as many
> > approaches as possible. Fundamentally that requires a level of indirection, as
> > usual :) Having two completely distict interfaces for that straight off the bat?
> > Gimme a break.
> 
> There are two approaches.

I know much more than 2.
There are as many approaches as hypervisor implementations.

> 1. Passthrough a virtio member device to guest
> Only PCI config space and MSI-X table is trapped.
> MSI-X table is also trapped due to a cpu/platform limitation. I will not go in that detail for a moment.
> All the rest of the virtio member device interface is passthrough to the guest.
> This includes,
> (a) virtio common config space
> (b) virtio device specific config space
> (c) cvq if present
> (d) io vqs and more vqs
> (e) any shared memory
> (f) any new construct that arise in coming years
> 
> If one wants to do nesting, the member device should support nesting and it will be still able to do to next level.
> To my knowledge, most cpus support single level nesting, that is VMM and VM.
> Any higher-level nesting involves good amount of emulation in privileged operations.
> 
> If virtio to do even more efficient than rest of the platform, I propose that member device can support nesting, so VMM->VM_L1 and VM_L1->VM_L2 constructs are same.
> This gives the best of both. Nesting support and passthrough both.
> And since its layered approach, it naturally works for nested case.
> 
> 2. Data path accelerated in device, rest all emulated.
> This method make sense when underlying device is not a native virtio device.
> 
> But for some reason, ok, one wants to build the infrastructure, we can attempt to find common pieces between #1 and #2 methods.

We can't just build new interfaces each time someone wants a slightly
different point on the pass through/emulation curve.
I feel this is an important point for TC members to agree on.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-13  9:44                         ` Zhu, Lingshan
@ 2023-10-13 11:54                           ` Parav Pandit
  2023-10-16  9:47                             ` Zhu, Lingshan
  2023-10-13 13:49                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-13 11:54 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, October 13, 2023 3:14 PM

> >>>> How do you transfer the ownership?
> >>> An additional ownership deletgation by a new admin command.
> >> if you think this can work, do you want to cook a patch to implement
> >> this before you submitting this live migration series?
> > I answered this already above.
> talk is cheap, show me your patch

Huh. We presented the infrastructure that migrates, 30+ device types, covering device context ideas from Oracle.
Covering P2P, supporting device_reset, FLR, dirty page tracking.

Please have some respect for other members who covered more ground than your series.

What more? Apply the same nested concept on the member device as Michael suggested, it is nested virtualization maintain exact same semantics.
So a VF is mapped as PF to the L1 guest.
L1 guest can enable SR-IOV on it, and map one VF to L2 guest.

This nested work can be extended in future, once first level nesting is covered.

> Answer all questions above, if you think a management VF can work, please
> show me your patch.
The idea evolves from technical debate then pointing fingers like your comment.

I think a positive discussion with Michael and a pointer to the paper from Jason gave a good direction of doing _right_ nesting that follows two principles.
a. efficiency property
b. equivalence property

(c. resource control is natural already)

Both apply at VMM and at VM level enabling recursive virtualization, by having VF that can act as PF inside the guest.

[1] https://dl.acm.org/doi/pdf/10.1145/361011.361073

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:52                             ` Michael S. Tsirkin
@ 2023-10-13 11:57                               ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-13 11:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 13, 2023 5:22 PM

> 
> I know much more than 2.
> There are as many approaches as hypervisor implementations.
> 
> > 1. Passthrough a virtio member device to guest Only PCI config space
> > and MSI-X table is trapped.
> > MSI-X table is also trapped due to a cpu/platform limitation. I will not go in
> that detail for a moment.
> > All the rest of the virtio member device interface is passthrough to the guest.
> > This includes,
> > (a) virtio common config space
> > (b) virtio device specific config space
> > (c) cvq if present
> > (d) io vqs and more vqs
> > (e) any shared memory
> > (f) any new construct that arise in coming years
> >
> > If one wants to do nesting, the member device should support nesting and it
> will be still able to do to next level.
> > To my knowledge, most cpus support single level nesting, that is VMM and
> VM.
> > Any higher-level nesting involves good amount of emulation in privileged
> operations.
> >
> > If virtio to do even more efficient than rest of the platform, I propose that
> member device can support nesting, so VMM->VM_L1 and VM_L1->VM_L2
> constructs are same.
> > This gives the best of both. Nesting support and passthrough both.
> > And since its layered approach, it naturally works for nested case.
> >
> > 2. Data path accelerated in device, rest all emulated.
> > This method make sense when underlying device is not a native virtio device.
> >
> > But for some reason, ok, one wants to build the infrastructure, we can
> attempt to find common pieces between #1 and #2 methods.
> 
> We can't just build new interfaces each time someone wants a slightly different
> point on the pass through/emulation curve.
> I feel this is an important point for TC members to agree on.

Right. Above two are most common (or least known to me that is in use) that has clear requirements and existing stack to integrate with.
So better to converge than keep opposing #1.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:49                           ` Michael S. Tsirkin
@ 2023-10-13 12:00                             ` Parav Pandit
  2023-10-16  8:46                             ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-13 12:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 13, 2023 5:19 PM
> 
> On Fri, Oct 13, 2023 at 11:28:50AM +0000, Parav Pandit wrote:
> > > >>>> inflight descriptor tracking will be implemented by Eugenio in V2.
> > > >>> When we have near complete proposal from two device vendors, you
> > > >>> want to push something to unknown future without reviewing the
> > > >>> work; does not
> > > >> make sense.
> > > >> Didn't I ever provide feedback to you? Really?
> > > > No. I didn’t see why you need to post a new patch for dirty page
> > > > tracking,
> > > when it is already present in this series.
> > This is plain ignorance and shows non_cooperative mode of working in
> technical committee.
> 
> 
> I personally think it's fine to have multiple proposals on the table.
> For example, current Zhu Lingshan's patch is clearly incomplete without
> memory change tracking.
> Why shouldn't he post a patchset demonstrating how that is supposed to work
> in his view?
> I am personally interested in seeing how is that
> supposed to work   - his latest proposal relies on migrating by
> trapping memory accesses but DMA can't be trapped like this generally.
> 
> In the end we want something addressing all use cases though and integrating
> reasonably well with existing ecosystem.

True.
Let me fix comments of yours and Jason, which has some important points to cover in v2.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-13  9:44                         ` Zhu, Lingshan
  2023-10-13 11:54                           ` Parav Pandit
@ 2023-10-13 13:49                           ` Michael S. Tsirkin
  2023-10-16  9:50                             ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-13 13:49 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 13, 2023 at 05:44:27PM +0800, Zhu, Lingshan wrote:
> > As Michael said, software based nesting is used..
> > See if actual hw based devices can implement it or not. Many components of cpu cannot do N level nesting either, but may be virtio can.
> > I don’t know how yet.
> two facts:
> 1. virito works for nested for years
> 2. your admin vq lm solution does not work for nested

First virtio works but nested migration doesn't: the way virtio works doesn't
allow L1 hypervisor to migrate L2 if it passes virtio through to L2,
except by using full emulation approaches such as shadow vq -
surely, these are still possible.

Second doesn't work is an overstatement.  It remains to be seen what else can
be done though.  You hinted at a depdenency on PASID.  Have owner be a
PASID and Parav's approach seems to work with little to no change.

-- 
MST

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:28                         ` Michael S. Tsirkin
  2023-10-13 11:42                           ` Parav Pandit
@ 2023-10-16  8:41                           ` Zhu, Lingshan
  2023-10-16  9:00                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-16  8:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/13/2023 7:28 PM, Michael S. Tsirkin wrote:
> On Fri, Oct 13, 2023 at 05:06:02PM +0800, Zhu, Lingshan wrote:
>> Hint: how do you define device context for every device type, e.g,
>> virtio-fs.
>> Don't say you only migrate virito-net or blk.
> Indeed. I don't think anyone can avoid either defining that or
> leaving that completely up to implementations and just hoping
> they don't miss anything. Given the choice I'm definitely for
> defining. Each device must grow a "migration" section documenting
> the context.
>
Yes, I agree, so let's implement a stateless live migration first.

Thanks,
Zhu Lingshan

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:49                           ` Michael S. Tsirkin
  2023-10-13 12:00                             ` Parav Pandit
@ 2023-10-16  8:46                             ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-16  8:46 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/13/2023 7:49 PM, Michael S. Tsirkin wrote:
> On Fri, Oct 13, 2023 at 11:28:50AM +0000, Parav Pandit wrote:
>>>>>>> inflight descriptor tracking will be implemented by Eugenio in V2.
>>>>>> When we have near complete proposal from two device vendors, you
>>>>>> want to push something to unknown future without reviewing the work;
>>>>>> does not
>>>>> make sense.
>>>>> Didn't I ever provide feedback to you? Really?
>>>> No. I didn’t see why you need to post a new patch for dirty page tracking,
>>> when it is already present in this series.
>> This is plain ignorance and shows non_cooperative mode of working in technical committee.
>
> I personally think it's fine to have multiple proposals on the table.
> For example, current Zhu Lingshan's patch is clearly incomplete without
> memory change tracking.
> Why shouldn't he post a patchset demonstrating
> how that is supposed to work in his view?
> I am personally interested in seeing how is that
> supposed to work   - his latest proposal relies on migrating by
> trapping memory accesses but DMA can't be trapped like this generally.
Yes, dirty page tracking will be included in V2, the config space bar/cap
look like the draft I present before, and the dirty page bitmap placed in
host memory that isolated by the PASID, device DMA writing to the bitmap,
and that can avoid RMW process.

Let me sycn with Eugenio, see whether we can include his "inflight 
descriptors tracking"
in this V2.

Thanks
>
> In the end we want something addressing all use cases though
> and integrating reasonably well with existing ecosystem.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-16  8:41                           ` Zhu, Lingshan
@ 2023-10-16  9:00                             ` Michael S. Tsirkin
  2023-10-16  9:44                               ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-16  9:00 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Mon, Oct 16, 2023 at 04:41:46PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/13/2023 7:28 PM, Michael S. Tsirkin wrote:
> > On Fri, Oct 13, 2023 at 05:06:02PM +0800, Zhu, Lingshan wrote:
> > > Hint: how do you define device context for every device type, e.g,
> > > virtio-fs.
> > > Don't say you only migrate virito-net or blk.
> > Indeed. I don't think anyone can avoid either defining that or
> > leaving that completely up to implementations and just hoping
> > they don't miss anything. Given the choice I'm definitely for
> > defining. Each device must grow a "migration" section documenting
> > the context.
> > 
> Yes, I agree, so let's implement a stateless live migration first.
> 
> Thanks,
> Zhu Lingshan

Not sure what is stateless migration - I don't see how
you can both agree and say let's not define how to migrate state.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:28                         ` Parav Pandit
  2023-10-13 11:49                           ` Michael S. Tsirkin
@ 2023-10-16  9:44                           ` Zhu, Lingshan
  2023-10-18  5:00                             ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-16  9:44 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/13/2023 7:28 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, October 13, 2023 2:36 PM
> [..]
>>> Because it does not work for passthrough mode.
>> what are you talking about?
>> Config space does not work passthrough?
> Once the register space of the VF that is supposed to be used by the live migration is passed to the guest, it is under guest control.
> Hence, live migration driver won't be able to use it.
Does guest control device status to reset itself? harmful?
These facilities can be trapped and emulated, even the feature bits, right?
You know the guest actually don't direct access the device config space,
there is a vfio/vdpa driver, right?
>
>> Have you ever tried pass through a virtio device to a guest?
> :)
> Please explain how the question is relevant to this discussion in separate thread, so that one can keep technical focus.
> (Please keep your discussion technical, instead of derogatory to other members).
if you want me to answer your question, at least you SHOULD NOT cut off 
the context, or you are trying to confuse everyone.
Or did you try to avoid or hide anything? I am not sure this is a good 
practice.

The context in last discussion is:

me: OK, I pop-ed Jason's proposal to make everything easier, and I see 
it is refused.
you: Because it does not work for passthrough mode.
me: what are you talking about?
     Config space does not work passthrough?
     Have you ever tried pass through a virtio device to a guest?

So I ask you try to pass through a virito-pci device to a guest,
then check whether the config space work for pass-through mode.

again, don't cut off threads before the discussion is closed.
>
>> Let me repeat again, these live migration facilities are
>> per-device(per-VF) facility, so it only migrates itself.
>>
> Since they are per device (per VF), they reside in the guest VM. Hence, VMM cannot live migrate it.
you know the config space can be trapped and emulated, and the 
hypervisor takes the ownership of
the device once the guest freeze in the stop window.
>
>> And for pass through, you can try passthrough a virito device to a guest, see
>> how the guest initialize the device through the config space.
>>
>> That is really basic virtualization, not hard to test.
> Repeated points, I am omitting.
ok, if you get it, let's close it.
>
>>>>>> inflight descriptor tracking will be implemented by Eugenio in V2.
>>>>> When we have near complete proposal from two device vendors, you
>>>>> want to push something to unknown future without reviewing the work;
>>>>> does not
>>>> make sense.
>>>> Didn't I ever provide feedback to you? Really?
>>> No. I didn’t see why you need to post a new patch for dirty page tracking,
>> when it is already present in this series.
> This is plain ignorance and shows non_cooperative mode of working in technical committee.
you have cut off the tread again, so I can't read the context.
>
>>> I would like to understand and review this aspects.
>>> Same for the device context.
>> you will see dirty page tracking in my V2, as I repeated for many times.
> Since you are not co-operative, I have less sympathy to see V2.
> I don’t see a reason to see when, it is fully presented here.
Again, please don't take it personal and please be professional.

Speaking of collaboration, please at least respect others' time and answers.
Both Jason and I have responded to you multiple times on the same 
questions(for example, FLR, nested, passthrough).
If our answers are ignored again and again, and then after a few days or 
hours
you come back asking the same question again, what's the point?

And please don't cut off any threads before we close the discussion.
>
>> For device context, we have discussed this in other threads, did you ignored that
>> again?
> No. I didn’t. I replied that the generic infrastructure is built the enables every device type to migrate by defining their device context.
don't we have a conclusion there or did you miss anything? Since you 
refuse to define device context for
every device type, how do you migrate stateful devices?

So we should implement a stateless live migration solution, right?
>
>> Hint: how do you define device context for every device type, e.g, virtio-fs.
>> Don't say you only migrate virito-net or blk.
> I didn’t say it. I said to migrate all 30+ device types.
> And infrastructure is presented here.
so please define device context for all the devices.
how about starting from virtio-fs?
>
>>>>> You are still in the mode of _take_ what we did with near zero explanation.
>>>>> You asked question of why passthrough proposal cannot advantage of
>>>>> in_band
>>>> config registers.
>>>>> I explained technical reason listed here.
>>>> I have answered the questions, and asked questions for many times.
>>>> What do you mean by "why passthrough proposal cannot advantage of
>>>> in_band config registers."?
>>>> Config space work for passthrough for sure.
>>> Config space registers are passthrough the guest VM.
>>> Hence hypervisor messing it with, programming some address would result in
>> either security issue.
>>> Or functionally broken, to sustain the functionality, each nested layer needs
>> one copy of these registers for each nest level.
>>> So they must be trapped somehow.
>> trap and emulated are basic virtualization.
> Not for passthrough devices, sorry.
> See the paper that Jason pointed out.
> Control program/vmm is trap is involved only on the privileged operation of the VMM.
> Virtio cvqs, virtio registers are not the privileged operation of the VMM, because they are of the native virtio device itself.
> Period.
since the context is cut of again, I failed to read the context.

But config space can be trapped and emulated, right?
When guest accessing device config space, actually
it access the hypervisor-presented config space.
>
>>> Secondly I don’t see how one can read 1M flows using config registers.
>> Not sure what you are talking about, beyond the spec?
> The spec which is under works for few months by multiple technical members.
> Please subscribe to virtio-comment mailing list.
> How come you changed your point from cvq to different argument of out of spec? :)
I mean, what is your 1M flows? is it beyond spec?
>
>>>>> So please don’t jump to conclusions before finishing the discussion
>>>>> on how
>>>> both side can take advantage of each other.
>>>>> Lets please do that.
>>>> We have proposed a solution, right?
>>>>
>>> Which one? To do something in future?
>>> I don’t see a suggestion on how one can use device context and dirty page
>> tracking for nested and passthrough uniformly.
>>> I see a technical difficulty in making both work with uniform interface.
>> Please don't ignore previous answers, don't force us repeat again and again.
>>
> You didn’t answer, how.
> Your answer was "you will post dirty page tracking without reviewing current" and Eugenio will post v2....
Yes, will do. and you can check the patch when it posted.

Eugenio will cook a patch for in-flight descriptors, not dirty page, 
that is mine.
>
>> It is Jason's proposal. Please refer to previous threads, also for device context
>> and dirty pages.
>>>> I still need to point out: admin vq LM does not work, one example is nested.
>>> As Michael said, please don’t confuse between admin commands and admin
>> vq.
>> anyway, admin vq live migration don't work for nested.
> I am convicned with the paper that Jason pointed out.
>
> A nested solution involves a member device supporting the nesting without trap and emulation so that it follows the two properties:
> The efficiency property and equivalence property.
>
> Hence a member device which wants to support nested case, should present itself with attributes to support nesting.
failed to process the sentence, but I am glad you are convinced by the 
paper.
>
>
>>>>>> There are no scale problem as I repeated for many time, they are
>>>>>> per-device basic facilities, just migrate the VF by its own
>>>>>> facility, so there are no 40000 member devices, this is not per PF.
>>>>>>
>>>>> I explained that device reset, flr etc flow cannot work when
>>>>> controlling and
>>>> controlled functions are single entity for passthrough mode.
>>>>> The scale problem is, one needs to duplicate the registers on each VF.
>>>>> The industry is moving away from the register interface in many
>>>>> _real_ hw
>>>> devices implementation.
>>>>> Some of the examples are IMS, SIOV, NVMe and more.
>>>> we have discussed this for many times, please refer to previous
>>>> threads, even with Jason.
>>> I do not agree for any registers to add to the VF which are reset on
>> device_reset and FLR.
>>> As it does not work for passthrough mode.
>> Jason has answered your these FLR questions for many times, I don't want to
>> repeat his words, even myself have answered many times. If you keep ignoring
>> the answers, and ask again and again, what is the point?
>>
>> So please refer to the previous threads.
> I don’t think I asked the question above. Please re-read.
you cut if off again, what question? if about FLR, I believe
Jason has answered for many times.
>
>>>>>> The device context can be read from config space or trapped, like
>>>>>> shadow
>>>>> There are 1 million flows of the net device flow filters in progress.
>>>>> Each flow is 64B in size.
>>>>> Total size is 64MB.
>>>>> I don’t see how one can read such amount of memory using config
>> registers.
>>>> control vq?
>>> The control vq and flow filter vqs are owned by the guest driver, not the
>> hypervisor.
>>> So no, cvq cannot be used.
>> first, don't cut off the threads, don't delete words, that really confusing readers.
>>
> Your comments are so long that it is hard to follow such a long thread.
> Hence only the related comments are kept.
> But I understand, will try to avoid.
>
>> And I think you misunderstand a lot of virtualization fundamentals, at least have
>> a look at how shadow control vq works.
>>
> In case if you don’t know, the shadow cvq acceleration for Nvidia ConnectX6-DX is done jointly with Dragos and me, with recent patches from Sie-Wei.
>
> I don’t think so I missed.
>
> Shadow vq is great when you don’t have underlying support from the device.
>
> When you have passthrough member devices, they are not trapped or emulated.
> The future hypervisor must not be able to see things of cvq, datavq or addressed programmed by the guest.
> And hence the infrastructure is geared towards such approach.
I failed to read the full context as you cut off them. I can't even read 
your original questions, they are truncated.

Anyway, lets migrate device without device-context first.
>
>> And the parameters set to config vq are also device context as we discussed for
>> many times.
>>>> Or do you want to migrate non-virtio context?
>>> Every thing is virtio device context.
>> see above
>>>>>> control vq which is already done, that is basic virtualization.
>>>>> There is nothing like "basic virtualization".
>>>>> What is proposed here is fulfilling the requirement of passthrough mode.
>>>>>
>>>>> Your comment is implying, "I don’t care for passthrough
>>>>> requirements, do
>>>> non_passthrough".
>>>> that is your understanding, and you misunderstood it. Config space
>>>> servers passthrough for many years.
>>> "Config space servers" ?
>>> I do not understand it, can you please explain what does that mean?
>>>
>>> I do not see your suggestion on how one can implement passthrough member
>> device when passthrough device does the dma and migration framework also
>> need to do the dma.
>> Try pass through a virtio device to a guest and learn how the guest take
>> advantage the config space before you comment.
> Right. It does not work. The guest is doing the device_reset and flr.
> Hence, it is resetting everything. All the dirty page log is lost.
> All the device context is lost.
> Hypervisor didn’t see any of this happening, because it didn’t do the trap.
>
> Look, if you are going to continue to argue that you must do trap + emulation and don’t talk about passthrough,
> Please stop here, because discussion won't go anywhere.
>
> I made my best to answer the limitations in very first email where you asked.
OK, I see the gap, and I am sure we can help you here.
Try consider a question:
how do you define pass-through? Can a guest access the device without a 
host driver helper?
>
>>> That basic facility is missing dirty page tracking, P2P support, device context,
>> FLR, device reset support.
>>> Hence, it is unusable right now for passthough member device.
>>> And 6th problemetic thing in it is, it does not scale with member devices.
>> Please refer to previous discussions, it is meaningless if you keep ignoring our
>> answers and keep asking the same questions.
> Again, please re-read, I didn’t ask the question.
> I replied 6 problems that are not solved.
I believe we have answered for many times. The questions are cut off again,
but how about search for previous answers?
>
>>>>>> If you want to migrate device context, you need to specify device
>>>>>> context for every type of device, net maybe easy, how do you see virtio-fs?
>>>>> Virtio-fs will have its on device context too.
>>>>> Every device has some sort of backend in varied degree.
>>>>> Net being widely used and moderate complex device.
>>>>> Fs being slightly stateful but less complex than net, as it has far
>>>>> less control
>>>> operations.
>>>> so, do you say you have implement a live migration solution which can
>>>> migrate device context, but only work for net or block?
>>> I don’t think this question about implementation has any relevance.
>>> Frankly feels like a court to me. :(
>>> No. I dint say that.
>>> We have implemented net, fs, block devices and single framework proposed
>> here can support all 3 and rest 28+.
>>> The device context part in this series do not cover special/optional things of
>> all the device type.
>>> This is something I promised to do gradually, once the framework looks good.
>> If you don't define them, only talking about "migrate the device context" but
>> don't tell us what do migrate, does this make sense to anybody?
>>>> Then you should call it virtio net/blk migration and implement in
>>>> net/block section.
>>> No. you misunderstood. My point was showing orthogonal complexities of net
>> vs fs.
>>> I likely failed to explain that.
>> see above, anyway you need to define them, how about starting form virito FS?
>>>>> In fact virtio-fs device already discusses the migrating the device
>>>>> side state, as
>>>> listed in device context.
>>>>> So virtio-fs device will have its own device-context defined.
>>>> if you want to migrate it, you need to define it
>>> Sure.
>>> Only device specific things to be defined in future.
>> Now, not future if you want to migrate device context.
> It is not mandatory, and it is impractical do everything in one series.
> It is planned for 1.4.
really, you want to define device context for every device time?

Remember don't migrate device-context before you define them or how can 
the HW
implementions know how to do.
>
>>> Rest is already present.
>>> We are not going to define all the device context in one patch series that no
>> one can review reliably.
>>> It will be done incrementally.
>> so you agree at least for now we should migrate stateless devices, right?
>>> But the feedback, I am taking is, we need to add a command that indicates
>> which TLVs are supported in the device migration.
>>> So virtio-fs or other device migration capabilities can be discovered.
>>> I will cover this in v2.
>> so you propose a solution as "virtio migration", but only migrate selective types
>> of devices?
>> You should rename it to be "virtio-net live migration".
> Sorry, I wont. Because infrastructure is for majority device types.
>
> Which field did you observe which is net specific?
> We want to cover all the device types.
> Don’t need to cook their context in one series.
so, not work for all device types? limited to some specific types?
you still need to rename it what ever.
>
>>> Thanks a lot for this thoughts.
>>>
>>>>> The infrastructure and basic facilities are setup in this series,
>>>>> that one can
>>>> easily extend for all the current and new device types.
>>>> really? how?
>>>>>> And we are migrating stateless devices, or no? How do you migrate virtio-
>> fs?
>>>>>>> 2. sharing such large context and write addresses in parallel for
>>>>>>> multiple devices cannot be done using single register file
>>>>>> see above
>>>>>>> 3. These registers cannot be residing in the VF because VF can
>>>>>>> undergo FLR, and device reset which must clear these registers
>>>>>> do you mean you want to audit all PCI features? When FLR, the
>>>>>> device is rested, do you expect a device remember anything after FLR?
>>>>> Not at all. VF member device will not remember anything after FLR.
>>>>>> Do you want to trap FLR? Why?
>>>>> This proposal does _not_ want to trap the FLR in the hypervisor virtio driver.
>>>>>
>>>>> When one does the mediation-based design, it must trap/emulate/fake
>>>>> the
>>>> FLR.
>>>>> It helps to address the case of nested as you mentioned.
>>>> once passthrough, the guest driver can access the config space to
>>>> reset the device, right?
>>>>>> Why FLR block or conflict with live migration?
>>>>> It does not block or conflict.
>>>> OK, cool, so let's make this a conclusion
>>>>> The whole point is, when you put live migration functionality on the
>>>>> VF itself,
>>>> you just cannot FLR this device.
>>>>> One must trap the FLR and do fake FLR and build the whole
>>>>> infrastructure to
>>>> not FLR The device.
>>>>> Above is not passthrough device.
>>>> No, the guest can reset the device, even causing a failed live migration.
>>> Not in the proposal here.
>>> Can you please prove how in the current v1 proposal, device reset will fail the
>> migration?
>>> I would like to fix it.
>> if the device is reset, it forgets everything right?
> Right. This is why all dirty page track; device context is lost on device reset.
> Hence, the controlling function and controlled function are two different entities.
so there can be inconsistent migrations and races, right? And if the 
guest reset the
device, actually the hypervisor should let it be, right?
>
>>>>>>> 4. When VF does the DMA, all dma occurs in the guest address
>>>>>>> space, not in
>>>>>> hypervisor space; any flr and device reset must stop such dma.
>>>>>>> And device reset and flr are controlled by the guest (not mediated
>>>>>>> by
>>>>>> hypervisor).
>>>>>> if the guest reset the device, it is totally reasonable operation,
>>>>>> and the guest own the risk, right?
>>>>> Sure, but the guest still expects its dirty pages and device context
>>>>> to be
>>>> migrated across device_reset.
>>>>> Device_reset will lose all this information within the device if
>>>>> done without
>>>> mediation and special care.
>>>> No, if the guest reset a device, that means the device should be
>>>> RESET, to forget its config, that would be really wired to migrate a
>>>> fresh device at the source side, to be a running device at the destination
>> side.
>>> Device reset not doing the role of reset is just a plain broken spec.
>> why? The reset behavior is well defined in the spec, and works fine for years.
> So any new construct that one adds, it will be reset as well and dirty page track is lost.
Yes and do you want to prevent that? You may surprise the guest.
>
>>>>> So, to avoid that now one needs to have fake device reset too and
>>>>> build that
>>>> infrastructure to not reset.
>>>>> The passthrough proposal fundamental concept is:
>>>>>
>>>>> all the native virtio functionalities are between guest driver and
>>>>> the actual
>>>> device.
>>>> see above.
>>>>>> and still, do you want to audit every PCI features? at least you
>>>>>> didn't do that in your series.
>>>>> Can you please list which PCI features audit you are talking about?
>>>> you audit FLR, then do you want to check everyone?
>>>> If no, how to decide which one should be audited, why others not?
>>> I really find it hard to follow your question.
>>>
>>> I explained in patch 5 and 8 about interactions with the FLR and its support.
>>> Not sure what you want me to check.
>>>
>>> You mentioned that "I didn’t audit every PCI features"? So can you please list
>> which one and in relation to which admin commands?
>> Your job to audit everyone if you talk about FLR. Because FLR is PCI spec, not
>> virtio, you need to explain why other PCI features not need to be audited.
>>
> Sure, but when you point figure as I didn’t audit, please mention what is not audited.
well, we are migrating virtio devices, but you keep talking PCI, so do 
you want to
take every PCI functionalities into considerations>
>
>> We have explained why FLR is not a concern for many times, and I don't want
>> to repeat, please refer to previous discussions.
> You seem to ignore the first paragraph of theory of operation that FLR is not trapped.
this is the guest issue FLR, right? If so the guest owns the risks and 
the hypervisor should
not prevent that.
>
>>>>> Keep in mind, that will all the mediation, one now must equally
>>>>> audit all this
>>>> giant software stack too.
>>>>> So maybe it is fine for those who are ok with it.
>>>> so you agree FLR is not a problem, at least for config space solution?
>>> I don’t know what you mean "FLR is not a problem".
>>>
>>> FLR on the VF must work as it works without live migration for passthrough
>> device as today.
>>> And admin commands have some interactions with it.
>>> And this proposal covers it.
>>> I am missing some text that Michael and Jason pointed out.
>>> I am working on v2 to annotate or better word them.
>> When guest reset the device, the device should be reset for sure. then it forgets
>> everything, how do you expect the reset-ed device still work for live migration?
>> is it a race?
> I don’t expect it live migration to work at all with such a approach.
> This is why in my proposal live migration occurs on the owner device, while controlled function (member device) is undergoing the device reset.
see above
>
>>>>>> For migration, you know the hypervisor takes the ownership of the
>>>>>> device in the stop_window.
>>>>> I do not know what stop_window means.
>>>>> Do you mean stop_copy of vfio or it is qemu term?
>>>> when guest freeze.
>>>>>>> 5. Any PASID to separate out admin vq on the VF does not work for
>>>>>>> two
>>>>>> reasons.
>>>>>>> R_1: device flr and device reset must stop all the dmas.
>>>>>>> R_2: PASID by most leading vendors is still not mature enough
>>>>>>> R_3: One also needs to do inversion to not expose PASID capability
>>>>>>> of the member PCI device to not expose
>>>>>> see above and what if guest shutdown? the same answer, right?
>>>>> Not sure, I follow.
>>>>> If the guest shutdown, the guest specific shutdown APIs are called.
>>>>>
>>>>> With passthrough device, R_1 just works as is.
>>>>> R_3 is not needed as they are directly given to the guest.
>>>>> R_2 platform dependency is not needed either.
>>>> I think we already have a concussion for FLR.
>>> I don’t have any concussion.
>>> I wrote what to be supported for the FLR above.
>> OK, again, our discussions has been ignored again, and all start over again.
>>
>> Would you please read our previous discussions?
> You asked the question about why it wont work, I answered.
> I don’t see a point of debating same thing over again.
Is that cut off again?

if still about FLR, so please see above comments.
And I agree if the answers are ignored again, we don't need to repeat.
>
>>>> For PASID, what blocks the solution?
>>> When the device is passthrough, PASID capabilities cannot be emulated.
>>> PASID space is owned fully by the guest.
>>>
>>> There is no single known cpu vendor support splitting pasid between
>> hypervisor and guest.
>>> I can double check, but last I recall that Linux kernel removed such weird
>> support.
>> do you know there is something called vIOMMU?
> Probably yes.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-16  9:00                             ` Michael S. Tsirkin
@ 2023-10-16  9:44                               ` Zhu, Lingshan
  0 siblings, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-16  9:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/16/2023 5:00 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 16, 2023 at 04:41:46PM +0800, Zhu, Lingshan wrote:
>>
>> On 10/13/2023 7:28 PM, Michael S. Tsirkin wrote:
>>> On Fri, Oct 13, 2023 at 05:06:02PM +0800, Zhu, Lingshan wrote:
>>>> Hint: how do you define device context for every device type, e.g,
>>>> virtio-fs.
>>>> Don't say you only migrate virito-net or blk.
>>> Indeed. I don't think anyone can avoid either defining that or
>>> leaving that completely up to implementations and just hoping
>>> they don't miss anything. Given the choice I'm definitely for
>>> defining. Each device must grow a "migration" section documenting
>>> the context.
>>>
>> Yes, I agree, so let's implement a stateless live migration first.
>>
>> Thanks,
>> Zhu Lingshan
> Not sure what is stateless migration - I don't see how
> you can both agree and say let's not define how to migrate state.
Yes, that is the point


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-13 11:54                           ` Parav Pandit
@ 2023-10-16  9:47                             ` Zhu, Lingshan
  2023-10-18  5:02                               ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-16  9:47 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/13/2023 7:54 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, October 13, 2023 3:14 PM
>>>>>> How do you transfer the ownership?
>>>>> An additional ownership deletgation by a new admin command.
>>>> if you think this can work, do you want to cook a patch to implement
>>>> this before you submitting this live migration series?
>>> I answered this already above.
>> talk is cheap, show me your patch
> Huh. We presented the infrastructure that migrates, 30+ device types, covering device context ideas from Oracle.
> Covering P2P, supporting device_reset, FLR, dirty page tracking.
>
> Please have some respect for other members who covered more ground than your series.
>
> What more? Apply the same nested concept on the member device as Michael suggested, it is nested virtualization maintain exact same semantics.
> So a VF is mapped as PF to the L1 guest.
> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
>
> This nested work can be extended in future, once first level nesting is covered.
>
>> Answer all questions above, if you think a management VF can work, please
>> show me your patch.
> The idea evolves from technical debate then pointing fingers like your comment.
>
> I think a positive discussion with Michael and a pointer to the paper from Jason gave a good direction of doing _right_ nesting that follows two principles.
> a. efficiency property
> b. equivalence property
>
> (c. resource control is natural already)
>
> Both apply at VMM and at VM level enabling recursive virtualization, by having VF that can act as PF inside the guest.
>
> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
Please just show me your patch resolving these opens, how about start 
from defining virito-fs device context and your management VF?

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-13 13:49                           ` Michael S. Tsirkin
@ 2023-10-16  9:50                             ` Zhu, Lingshan
  0 siblings, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-16  9:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/13/2023 9:49 PM, Michael S. Tsirkin wrote:
> On Fri, Oct 13, 2023 at 05:44:27PM +0800, Zhu, Lingshan wrote:
>>> As Michael said, software based nesting is used..
>>> See if actual hw based devices can implement it or not. Many components of cpu cannot do N level nesting either, but may be virtio can.
>>> I don’t know how yet.
>> two facts:
>> 1. virito works for nested for years
>> 2. your admin vq lm solution does not work for nested
> First virtio works but nested migration doesn't: the way virtio works doesn't
> allow L1 hypervisor to migrate L2 if it passes virtio through to L2,
> except by using full emulation approaches such as shadow vq -
> surely, these are still possible.
hope our live migration basic facilities can help resolve these problems.
>
> Second doesn't work is an overstatement.  It remains to be seen what else can
> be done though.  You hinted at a depdenency on PASID.  Have owner be a
> PASID and Parav's approach seems to work with little to no change.
for nested, it requires vIOMMU for sure.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13  6:36                 ` Parav Pandit
@ 2023-10-17  1:41                   ` Jason Wang
  2023-10-18  8:16                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-17  1:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Fri, Oct 13, 2023 at 2:36 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Friday, October 13, 2023 6:46 AM
>
> [..]
> > > > > > It's still not clear to me how this is done.
> > > > > >
> > > > > > 1) guest starts FLR
> > > > > > 2) adminq freeze the VF
> > > > > > 3) FLR is done
> > > > > >
> > > > > > If the freezing doesn't wait for the FLR, does it mean we need
> > > > > > to migrate to a state like FLR is pending? If yes, do we need to
> > > > > > migrate the other sub states like this? If not, why?
> > > > > >
> > > > > In most practical cases #2 followed by #1 should not happen as on
> > > > > the source
> > > > side the expected is mode change to stop from active.
> > > >
> > > > How does the hypervisor know if a guest is doing what without trapping?
> > > >
> > > Hypervisor does not know. The device knows being the recipient of #1 and #2.
> >
> > We are discussing the possibility in software/driver side isn't it?
> >
> > 1) is initiated from the guest
> > 2) is initiated from the hypervisor
> >
> > Both are softwares, and you're saying 2) should not happen after 1) since the
> > device knows what is being done by guests? How can devices control software
> > behaviour?
> >
> Device do not control software behavior.
> i.e. either hypervisor can initiate device mode change to stop (not freeze) or guest can initiate FLR.
> Device knows which is initiated first as single recipient of both.
> Therefore, device responds accordingly.
> For example, in the sequence you described,
> A device will delay mode change command response, until the FLR is completed.

Finally but ok.

>
>
> > This only possible thing is to make sure 3) is done before 2) That is what I'm
> > asking but you are saying freeze doesn't need to wait for FLR...
> >
> I think I responded in previous email further down on synchronization point being fw.
> I meant to say software do not need to wait for initiation of the freeze mode command.

For software, did you mean the hypervisor?

> Just the command will complete at right time.
>
> This is anyway very corner case.
> On source hypervisor as written in the theory of operation, the sequence is active->stop->freeze.
> When mode change is done to stop, the vcpus are already suspended.

The problem here is not the vcpu but the when FLR is being done since
it may change the device context.

>
> I agree FLR may have been initiated and driver is waiting now for 100msec.

For driver, did you mean the driver in the guest?

>
> So yes, device single entity synchronized it.
>
> > >
> > > > > But ok, since we active to freeze mode change is allowed, lets discuss
> > above.
> > > > >
> > > > > A device is the single synchronization point for any device reset,
> > > > > FLR or admin
> > > > command operation.
> > > >
> > > > So you agree we need synchronization? And I'm not sure I get the
> > > > meaning of synchronization point, do you mean the synchronization
> > > > between freeze/stop and virtio facilities?
> > > >
> > > Synchronization means, handling two events in parallel such as FLR and other.
> >
> > Great. So we have a perfect race:
> >
> > 1) guest initiates FLR
> > 2) device start FLR
> > 3) hypervisor stop and freeze the device
> > 4) device is freeze
> > 5) hypervisor read device context A
> > 6) migrate device contextA
> > 8) migration is done
> > 9) FLR is done
> > 10) hypervisor read device context B
> >
> > So we end up with inconsistent device context, no? Dest want B or A+B, but you
> > give A.
> >
> Since #1 and #2 is done before #3, the device knows to finish the FLR, hence #9 is completed before #4.

Ok, that's my understanding and that's why I'm asking, but you said
freeze/stop doesn't need to wait for FLR before.

>
> Alternatively, in above sequence when destination sees #10, it can immediately finish the FLR as dest device is not under FLR, treating it as no-op.
>
> Both ways to handle are fine. (and rare in practice, but yes, its possible).
>
> I will write both the options in the device requirements.
>
> > >
> > > > > So, the migration driver do not need to wait for FLR to complete.
> > > >
> > > > I'm confused, you said below that device context could be changed by FLR.
> > > >
> > > Yes.
> > > > If FLR needs to clear device context, we can have a race where
> > > > device context is cleared when we are trying to read it?
> > > >
> > > I didn’t say clear the context.
> > > FLR updates the device context.
> >
> > In what sense?
> >
> Indicating a new device context indicating a new device context and discard the old one.

For example, what will queue_address have after an FLR?

> I am glad you asked this. I wanted to get the basic part captured before adding this optimization.

Ok.

> Probably it is good to add it now in the v2 as we crossed this stage now.
>
> > > Device is serving the device context read write commands, serving FLR,
> > > answering mode change command, So device knows the best how to avoid
> > any race.
> >
> > You want to leave those details for the vendor to figure out? If devices know
> > everything, why do we need device normative?
> >
> Device knows its implementation.
> Implementation guidelines to be in the normative.
> I will add it to the normative.
>
> > I see issues at least for FLR, I'm pretty sure they are others. If a design requires
> > us to audit all the possible conflicts between virtio facilities and transport. It's a
> > strong hint of layer violation and when it happens it for sure may hit a lot of
> > problems that are very hard to find or debug thus we should drop such a design.
> > I suggest using the RFC tag since the next version (if there is one) as I see it is
> > immature in many ways.
> >
> Technical committee audits the required touch points like rest of the industry committees that I participated.
> I disagree to your above point.
> If you do not want to review, that is fine.

I don't want to hold my breath if I see something that is obviously
wrong. Using RFC may help people to know that it is a draft that has
something to be improved before it can be merged.

> We are reviewing with other members and also contributed by them.
>
> > What's more, solving races is much easier if the device functionality is self
> > contained. For example, for a self contained device with the transport as the
> > single interface, we can leverage from transport
> > (PCI) for dealing with races, arbitration, ordering, QOS etc which is probably
> > required in the internal channel between the owner and the member. But all of
> > these were missed in your series and even if you can I'm not sure it's
> > worthwhile to reinvent all of them.
> >
> At the end there is one physical device serving owner and member devices.

This doesn't happen yet. For example, a VF with adminq that can be
isolated with PASID makes some sense.

> So a claim like things are on the VF hence you magically get 200% QoS guarantee is myth.

That's not my point, I'm saying VF could benefit from the e.g QOS
support in PCIE. I'm not saying it's perfect.

>
> Quoting "all of these" is also incorrect.
>
> Things added gradually, first functionally with reasonable performance, followed by notion and extension for QoS.
> By definition of PCI transport for SR-IOV there is internal channel.
>
> It is reasonably well proposal in current form.
> There are few race condition that you highlight are extremely rare in nature.

It's not rare since there's no way to know what the guest is doing.
It's actually the critical part for live migration to be correct. You
are proposing migration so it must cover all those cases to make sure
there is no case to make your proposal a dead end.

> Suggestions are welcome to improve.

I have given some and I will give more.

> There were couple of them by Michael too, I am addressing them in the v2.
>
> > For example, for the architecture like owner/member, if the virtio or transport
> > facility could be controlled via device internal channels besides the transport,
> > such a channel may complicate the synchronization a lot.
> Two vendors who actually make the hw sriov devices are authoring these and others are also reviewing.
> So I am more confident that it is solid enough.
> Also, a similar design has been seen with other device for more than a year as GPL integrated with QEMU for a year now and with upstream kernel.
>
> > The device needs to
> > be able to handle or synchronize requests from both PCI and owner in parallel.
> > They are just too many possible races and most of my questions so far come
> > from this viewpoint. I wouldn't go further for other stuff since I believe I've
> > spotted sufficient issues and that's why I must stop at this patch before looking
> > at the rest.
> It is your call to stop or progress.
> I find your reviews useful to improve this proposal, so I will fix them.

My point is to make the theory correct before looking at the others as
I had a lot of questions (as demonstrated in this thread). I think
it's not hard to understand as the rest of the series are based on the
theory.

>
> >
> > Admin commands are fine if it does real administrative jobs such as provisioning
> > since such work is beyond the core virtio functionality.
> >
> > Again, the goal of virtio spec is to have a device with sufficient guidelines that is
> > easy to implement but not leave the vendors to waste their engineering
> > resources in figuring or fuzzing the corner cases.
> I have not seen an industry standard spec or a software that does not have corner cases.

Corner cases are probably not accurate. I meant, for you, it's
probably a corner case, but for me it's kind of obvious.

> The spec proposal is from > 1 device vendors.

That's good but it doesn't mean it doesn't have any (major) issues.
E.g vendors may choose to just implement part of the PCIE capabilities
so they don't do audits for the rest.

>
> I will focus on more practical aspects to progress and improve this spec.
> >
> > >
> > > > > When admin cmd freeze the VF it can expect FLR_completed VF.
> > > >
> > > > We need to explain why and how about the resume? For example, is
> > > > resuming required to wait for the completion of FLR, if not, why?
> >
> > This question is ignored.
> >
> I probably missed. Sorry about it.
> No, the driver does not need to wait for FLR to finish to issue resume command,

Good but I want to know if stop/freeze->active requires to wait for
the completion of FLR. I guess the answer is yes.

> as this typically done on the destination member device which should not be under FLR.
> I will write up the requirements further.
>
> > > > In another thread you are saying that the PCI composition is done by
> > > > hypervisor, so passthrough is really confusing at least for me.
> > > >
> > > I explained there what vPCI composition is done there.
> > > PCI config space and msix side of composition is done.
> > > The whole virtio interface is not composed.
> >
> > You need to describe this somewhere, no? That's what I'm saying.
> >
> Mostly not. What is not done is not written.
>
> > And passthrough is misleading here.
> >
> Passthrough is mentioned in theory of operation.
> It is not present in requirements section.
> So, it is fine.

I suggest documenting or defining the "passthrough" methodology
somewhere. Michael tries to define it in another thread, if it's ok,
let's use that. We can't require people to read VFIO code in order to
know what happens in the virtio spec.

>
> > >
> > > > > Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> > > > > This section is not normative section, so using an extra word like
> > > > > "flow" does
> > > > not confuse anyone.
> > > > > I will link to the section anyway.
> > > >
> > > > Probably, but you mention FLR flow as well.
> > > As I said, not repeating the PCIe spec here. The reader knows what FLR of the
> > PCIe transport.
> >
> > Ok, I'm not a native speaker, but I really don't know the difference between
> > "FLR" and "FLR flow".
> >
> Lets keep it simple. I will write it as FLR, as pci transport has it as FLR.

Ok.

>
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > > and may also undergo PCI function level
> > > > > > > > > +reset(FLR) flow.
> > > > > > > >
> > > > > > > > Why is only FLR special here? I've asked FRS but you ignore the
> > question.
> > > > > > > >
> > > > > > > FLR is special to bring clarity that guest owns the VF doing
> > > > > > > FLR, hence
> > > > > > hypervisor cannot mediate any registers of the VF.
> > > > > >
> > > > > > It's not about mediation at all, it's about how the device can
> > > > > > implement what you want here correctly.
> > > > > >
> > > > > > See my above question.
> > > > > >
> > > > > Ok. it is clear that live migration commands cannot stay on the
> > > > > member device
> > > > because the member device can undergo device reset and FLR flows
> > > > owned by the guest.
> > > >
> > > > I disagree, hypervisors can emulate FLR and never send FLR to real devices.
> > > >
> > > That would be some other trap alternative that needs to dissect the device
> > and build infrastructure for such dissection is not desired in the listed use case.
> >
> > Do you need to trap FLR or not? You're saying the hypervisor is in charge of
> > vPCI, how is this differ to what you proposed? If not, how can vPCI be
> > composed?
> >
> Live migration driver do not need to trap FLR.

Maybe I misunderstood your vPCI composition, but it's really helpful
to document how it is expected to be done.

>
> > I believe you need to document how vpci is supposed to be done, since I believe
> > your proposal can only work with such specific types of PCI composition. This is
> > one of the important things that is missed in this series.
> >
> I don’t see a need to describe vpci composition as there may be more than one way to do it.

More than one way for sure, but this contradicts what you say: you
said you don't trap FLR ...

People like me may wonder for example why FLR is mentioned, as FLR can
be trapped and emulated.

Another example, when a device can be saved and restored, the
hypervisor may schedule the device among multiple VMs, in that case,
trapping FLR is a must.

> What I think it is worth to describe is the whole pci device is not stored in device context.
> I will try to add a short description around it.
>
> > >
> > > So your disagreement is fine for non-passthrough devices.
> > >
> > > > > (and hypervisor is not involved in these two flows, hence the
> > > > > admin command
> > > > interface is designed such that it can fullfil above requirements).
> > > > >
> > > > > Theory of operation brings out this clarity. Please notice that it
> > > > > is in
> > > > introductory section with an example.
> > > > > Not normative line.
> > > > >
> > > > > > >
> > > > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > > > +virtio specification;
> > > > > > > >
> > > > > > > > This seems unnecessary and obvious as it applies to all
> > > > > > > > other PCI and virtio functionality.
> > > > > > > >
> > > > > > > Great. But your comment is contradicts.
> > > > > > >
> > > > > > > > What's more, for the things that need to be synchronized, I
> > > > > > > > don't see any descriptions in this patch. And if it doesn't need, why?
> > > > > > > With which operation should it be synchronized and why?
> > > > > > > Can you please be specific?
> > > > > >
> > > > > > See my above question regarding FLR. And it may have others
> > > > > > which I haven't had time to audit.
> > > > > >
> > > > > Ok. when you get chance to audit, lets discuss that time.
> > > >
> > > > Well, I'm not the author of this series, it should be your job
> > > > otherwise it would be too late.
> > > >
> > > As author, what we think, I will cover. If you have specific points to add value,
> > please share, I will look into it.
> >
> > I've pointed out sufficient issues. I have a lot of others but I don't want to have a
> > giant thread once again.
> >
> I see following things to improve in the requirements which I will do in v2.
>
> 1. Document race around FLR and admin commands for really rare corner case.
> 2. Some text around not migrating the pci device registers
> 3. Interaction with PM commands
>
> > >
> > > > For example, how is the power management interaction with the
> > freeze/stop?
> > > >
> > > Power management is owned by the guest, like any other virtio interface.
> > > So freeze/stop do not interfere with it.
> >
> > I don't think this is a good answer. I'm asking how the PM interacts with
> > freeze/stop, you answer it works well.
>
> >
> > I'm not obliged to design hardware for you but figuring out the bad design for
> > virtio. I'm not convinced with a proposal that misses a lot of obvious critical
> > cases and for sure it's not my job to solve them.
> >
> I am not asking you to solve.

My point is that, it's better for you to have some investigation on
the PM instead of me.

>
> > I've demonstrated the possible races with FLR. So did the PM. For example, if VF
> > is in D3cold state, can we still read its device context?
> I think yes, but I will double check.
>  If yes, is it a violation of the PCIE spec? If not, why?

So you are emulating the state instead of a real suspension?

> No, because device context is owned by the owner device and not the VF. SR-PCIM interface has defined it be outside of scope of PCIe spec.
>
> > How about other states? Can the device be freezed
> > in the middle of PM state transitions? If yes, how can it work without migrating
> > PCI states?
> I will double check, but unlikely, it should be similar to FLR case to keep the device to avoid treating it differently.

The reason why I see it is different from FLR is that

1) D3cold requires the VF to be off the power
2) State transition might takes more than what FLR did, PCI seems only
cover the minimum delay but not maximum which may have implications
for downtime

>
> > Well, I meant we need a more precise definition of each state otherwise it
> > could be ambiguous (as I pointed above).
> Ok. so, few things about read and other messages, I will add.
>
> >
> > >
> > > > >
> > > > > > >
> > > > > > > In "stop" mode, the device wont process descriptors.
> > > > > >
> > > > > > If the device won't process descriptors, why still allow it to
> > > > > > receive
> > > > notifications?
> > > > > Because notification may still arrive and if the device may update
> > > > > any counters as part of
> > > >
> > > > Which counters did you mean here?
> > > >
> > > The counter that Xuan is adding and any other state that device may have to
> > update as result of driver notification.
> > > For example caching the posted avail index in the notification.
> >
> > A link to those proposals?
> [1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00048.html
>

I don't see how this is related to "posted avail index" etc.

> > If the device must depend on those cached features to
> > work it's really fragile. If not, we don't need to care about them.
> It is not dependent.
> It is the infrastructure to enable it.
> Same for other shared memory region accesses.
>
> >
> > >
> > > > > it which needs to be migrated or store the received notification.
> > > > >
> > > > > > Or does it really matter if the device can receive or not here?
> > > > > >
> > > > > From device point of view, the device is given the chance to
> > > > > update its device
> > > > context as part of notifications or access to it.
> > > >
> > > > This is in conflict with what you said above " Device cannot process
> > > > the queue ..."
> > > >
> > > No, it does not.
> > > Device context is updated within the device without accessing the queue
> > memory of the guest.
> >
> > This is not documented or explained anywhere?
> >
> Why should it be explained?
> device is not accessing the guest memory -> this is mentioned in stop mode.

Isn't it hard to see the difference between the following two?

1) In stop mode, device is not accessing guest memory
2) device context is updated without accessing the queue memory of the guest

1) is to define the stop mode, 2) is to define the behaviour of device context

Or are you saying device context can only be fetched after the device
is stopped?

> Hence, there is no need to write above.
>
> > >
> > > > Maybe you can give a concrete example.
> > > >
> > > The above one.
> > >
> > > > >
> > > > > > >
> > > > > > > > > + the member device context
> > > > > > > >
> > > > > > > > I don't think we define "device context" anywhere.
> > > > > > > >
> > > > > > > It is defined further in the description.
> > > > > >
> > > > > > Like this?
> > > > > >
> > > > > > """
> > > > > >  +The member device has a device context which the owner driver
> > > > > > can
> > > > > > +either read or write. The member device context consist of any
> > > > > > device  +specific data which is needed by the device to resume
> > > > > > its operation  +when the device mode """
> > > > > >
> > > > > Yes.
> > > > > Further patch-3 adds the device context and also add the link to
> > > > > it in the
> > > > theory of operation section so reader can read more detail about it.
> > > > >
> > > > > > "Any" is probably too hard for vendors to implement. And in
> > > > > > patch 3 I only see virtio device context. Does this mean we
> > > > > > don't need transport
> > > > > > (PCI) context at all? If yes, how can it work?
> > > > > >
> > > > > Right. PCI member device is present at source and destination with
> > > > > its layout,
> > > > only the virtio device context is transferred.
> > > > > Which part cannot work?
> > > >
> > > > It is explained in another thread where you are saying the PCI
> > > > requires mediation. I think any author should not ignore such
> > > > important assumptions in both the change log and the patch.
> > > >
> > > > And again, the more I review the more I see how narrow this series can be
> > used:
> > > >
> > > I explained this before and also covered in the cover letter.
> > >
> > > > 1) Only works for SR-IOV member device like VF
> > > It can be extended to SIOV member device in future.
> > > Today these are the only type of member device virtio has.
> >
> > That is exactly what I want to say, it can only work for the owner/member
> > model. It can't work when the virtio device is not structured like that. And you
> > missed that most of the existing virtio devices are not implemented in this
> > model. It means they can't be migrated with a pure virtio specific extension. For
> > you, SR-IOV is all but this is not true for virtio. PCI is not the only transport and
> > SR-IOV is not the only architecture in PCI.
> >
> Each transport will have its own way to handle it.
> When there is MMIO owner-member relationship arise, one will be able to do so as well.
> In fact other transports will likely miss out as they have not established such pace.
>
> > And I'm pretty sure the owner/member is not the only requirement, there are a
> > lot of other assumptions which are missed in this series.
> >
> One proposal does not do everything.
> It is just impractical.

For other assumptions, I meant:

1) how vpci is composed, if it can be composed as vhost, why do we
need to mention "passthrough"
2) the cap/bar layout, for example if a cap shares BARs with others,
it can't be "passthrough", no?

>
> > >
> > > > 2) Mediate PCI but not virtio which is tricky
> > > > 3) Can only work for a specific BAR/capability register layout
> > > >
> > > > Only 1) is described in the change log.
> > > >
> > > > The other important assumptions like 2) and 3) are not documented
> > anywhere.
> > > > And this patch never explains why 2) and 3) is needed or why it can
> > > > be used for subsystems other than VFIO/Linux.
> > > >
> > > Since I am not mentioning vfio now, I will refrain from mentioning
> > > others as well. :)
> >
> > It's not about VFIO at all. It's about to let people know under which case this
> > proposal could work. Otherwise if a vendor develops a BAR/cap which is not at
> > page boundary. How could you make it work with your proposal here?
> >
> Vendor is a cloud operator which is building the device, so it will always work it has the matching capabilities on source and destination.

I meant, for example, if common_cfg shares a BAR with others but
doesn't own a page exclusively, you need to trap, no?

>
> > >
> > > > >
> > > > > > >
> > > > > > > > >and device configuration space may change. \\
> > > > > > > > > +\hline
> > > > > > > >
> > > > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > > > >
> > > > > > > All pci devices which belong to a single guest VM are not
> > > > > > > stopped
> > > > atomically.
> > > > > > > Hence, one device which is in freeze mode, may still receive
> > > > > > > driver notifications from other pci device,
> > > > > >
> > > > > > Device may choose to ignore those notifications, no?
> > > > > >
> > > > > > > or it may experience a read from the shared memory and get
> > > > > > > garbage
> > > > data.
> > > > > >
> > > > > > Could you give me an example for this?
> > > > > >
> > > > > Section 2.10 Shared Memory Regions.
> > > >
> > > > How can it experience a read in this case?
> > > >
> > > MMIO read/write can be initiated by the peer device while the device is in
> > stopped state.
> >
> > Ok, but what I want to say is how it can get the garbage data here?
> >
> If the device mode is changed to freeze while it is being read by the peer device, it can get garbage data or last data.
> Which may not be the one that is expected.
> So first all the initiator devices are stopped, ensure that they do not make any requests.
>
> And there are requests, which gets proper answer.

Ok.

>
> > >
> > > > Btw, shared regions are tricky for hardware.
> > > >
> > > > >
> > > > > > > And things can break.
> > > > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > > > chance to stop
> > > > > > themselves, and later when freezed, to not change anything internally.
> > > > > > >
> > > > > > > > > +0x2   & Freeze &
> > > > > > > > > + In this mode, the member device does not accept any
> > > > > > > > > +driver notifications,
> > > > > > > >
> > > > > > > > This is too vague. Is the device allowed to be freezed in
> > > > > > > > the middle of any virtio or PCI operations?
> > > > > > > >
> > > > > > > > For example, in the middle of feature negotiation etc. It
> > > > > > > > may cause implementation specific sub-states which can't be
> > migrated easily.
> > > > > > > >
> > > > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > > > It is passthrough device, hence hypervisor layer do not get to
> > > > > > > see sub-
> > > > state.
> > > > > > >
> > > > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > > > The device context already covers this sub-state.
> > > > > >
> > > > > > 1) driver writes driver_features
> > > > > > 2) driver sets FEAUTRES_OK
> > > > > >
> > > > > > 3) device receive driver_features
> > > > > > 4) device validating driver_features
> > > > > > 5) device clears FEATURES_OK
> > > > > >
> > > > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > > > >
> > > > > > Is it valid to be frozen of the above?
> > > > > No. device mode is frozen when hypervisor is sure that no more
> > > > > access by the
> > > > guest will be done.
> > > >
> > > > How, you don't trap so 1) and 2) are posted, how can hypervisor know
> > > > if there's inflight transactions to any registers?
> > > >
> > > Because hypervisor has stopped the vcpus which are issuing them.
> >
> > MMIO are posted. vCPU is stopped but the transactions are inflight.
> > How could the hypervisor/device know if there's any inflight PCIE transactions
> > here? So I can imagine what happens in fact is the TLP for freezing is ordered
> > with the TLP for posted MMIO. This is probably guaranteed for typical PCIE
> > setup but how about the relaxed ordering?
>
> Vcpus do not generated relaxed ordering MMIOs.
> In pci spec: " If this bit is Set, the Function is permitted to set the Relaxed Ordering bit in
> the Attributes field of transactions it initiates".
>
> Function initiates RO requests, not the vcpu.
> Hence, it is fine.
>

Ok.

> > >
> > > > > What can happen between #2 and #3, is device mode may change to stop.
> > > >
> > > > Why can't be freezed in this case? It's really hard to deduce why it
> > > > can't just from your above descriptions.
> > > >
> > > On the source hypervisor, the mode changes are active->stop->freeze.
> > > Hence when freeze is done, the hypervisor knows that all inflight has been
> > stopped by now.
> >
> > Ok, but how about freezing between 3) and 4). If we allow it, do we need to
> > migrate to this state? If yes, how can it work with your device context? If not,
> > shouldn't we document this?
> >
> May be, some of these are implementation details. I am not sure it belongs to spec.

The point is to make sure that your deivce context covers this case.
If it can't be covered, it's a design defect.

> Like RSS update while packets are received.. such implementation details are not part of the spec.

This is definitely different, the driver can choose to synchronize or
the end user can tolerate the possible out of order packets in this
case.

This is not the case here, if freezing between 3) and 4) is allowed,
your current device context can't cover this case and guests can't
tolerate such kinds of errors after migration for sure.

>
> > >
> > > > Even if it had, is it even possible to list all the places where
> > > > freezing is prohibited? We don't want to end up with a spec that is
> > > > hard to implement or leave the vendor to figure out those tricky parts.
> > > >
> > > The general idea is not prohibiting the freeze/stop mode.
> > > If the device needs more time, let device take time to do it.
> >
> > Ok, it means:
> >
> > 1) there're conditions from stop to freeze, then what are they?
> No, there isn’t condition.
> May be I didn’t follow the question.

E.g under which condition could the device change the status from
active to stop etc. That's something I keep asking with a concrete
example (e.g FLR).

> > 2) how much time at most? E.g FLR takes at most 100ms.
> From the driver side, it is 100msec for device side it can be less too.
> As soon as FLR is done or enough to record it, is done, stop can continue.
>
> > 3) If it needs more time, can this time satisfy the downtime requirement?
> >
> Guest VM for all practical purposes is not busy in doing FLR, it is a corner case, yet we have to cover it.

Corner case in what sense? A loop in a simple shell script can trigger
this easily.

> And yes, it satisfy the downtime requirements, because VM is already not interested in the packets, it is busy doing the FLR.

Well, it has subtle differences. VM may have more than one interface,
just one of the interfaces is doing FLR.

>
> > >
> > >
> > > > > And in stop mode, device context would capture #5 or #4, depending
> > > > > where is
> > > > device at that point.
> > > > >
> > > > > > >
> > > > > > > > And what's more, the above state machine seems to be virtio
> > > > > > > > specific, but you don't explain the interaction with the
> > > > > > > > device status state
> > > > > > machine.
> > > > > > > First, above is not a state machine.
> > > > > >
> > > > > > So how do readers know if a state can go to another state and when?
> > > > > >
> > > > > Not sure what you mean by reader. Can you please explain.
> > > >
> > > > The people who read virtio spec.
> > > >
> > > So question is "how reader knows if a state can go to another state and
> > when"?
> > > It is described and listed in the table, when a mode can change.
> >
> > It's not only "if" but also "when". Your table partially answers the "if '' but not
> > "when". I think you should know now the state transition is conditional. So let's
> > try our best to ease the life of the vendor.
> What do you mean when?
> I do not understand that "mode change is conditional"? it is not based on the condition.
> [..]

See above.

>
> > > > Let's define the synchronization point first. And it demonstrates at
> > > > least devices need to synchronize between the free/stop and virtio
> > > > device status machine which is not as easy as what is done in this patch.
> > > >
> > > Synchronization point = device.
> >
> > This is obvious as we can't rule stuff outside virtio, and we are talking about
> > devices not drivers here. But the spec needs sufficient guidance/normative for
> > the vendor to implement. It's more than just saying "device is synchronization
> > point".
> >
> The requirements are already covering what device needs to do.
> Some interaction points are missing, as I acked above, I will add them.
>
> [..]
> > > > Until virtio reset, this is how virtio works now. I've pointed out
> > > > that it may cause extra troubles when trying to resume, but you
> > > > don't tell me what's wrong to keep that?
> > > >
> > > If kept, hypervisor may not be able to decide when to change the mode from
> > active->stop.
> >
> > Why? It is simply done when mgmt requires a migration?
> >
> Mgmt is bit higher level entity. Underneath the software layers may wait until the time is right to migrate.

I don't understand, anyhow the migration request could not be sent to
the device directly without the assistance in hypervisor.

> The fundamental point is, the device context is expected to return the incremental value, that is changed content from last time.
> So once all changed content is read, its empty.

You can't easily define an incremental value for all types of states
or structures:

1) device with complicated states like RAM or other
2) the device state has complicated data structures

>
> > What's more important, PCI allows multiple common_cfgs. So the hypervisor
> > can choose to reserve one common_cfg for live migration. In this case we don't
> > have to read to clear semantics.
> Common_cfg does not serve large device context, nor it serves DMA.

Well, I'd think e.g the address of the descriptor table is part of the
device context, and it can be read some common_cfg.

>
> >
> > Or, are you saying the value read from common_cfg is not device context?
> The value of common config is part of the device context that represents current common config.
>
> > Isn't this conflict with your vague definition of device context?
> >
> You mentioned you stop at this patch,

Stop means stopping comment.

> so likely you didn’t read device context patch, hence you quote it vague.
> So I don’t know what you mean by vague.

So in this patch you define device context as:

"The member device context consist of any device specific
data which is needed by the device to resume its operation"

So the address of the descriptor table satisfy this definition? If not, why?

> Please let me know what you additional thing you want to see in device context after you reach that patch.
>
>
> > > We can opt for a mode where full device context is read in each mode
> > without clearing it.
> > > But than it can be very specific to a version of qemu, which we are avoiding it
> > here.
> > >
> > > > > 2. device context returns incremental value from the previous
> > > > > read. So, it
> > > > needs to clear it.
> > > >
> > > > I don't understand here. This is not the case for most of the devices.
> > > >
> > > Not sure which devices you mean here with "most of the devices".
> > > Device context functions like a write record pages (aka dirty pages).
> >
> > It's definitely different. We want to migrate dirty pages lively which can
> > consume a lot of bandwidth. So reporting delta makes a lot of sense here since
> > it would have a lot of rounds of syncing and it doesn't result in blockers
> > resuming.
> >
> Write records are reported as delta from the previous read.
>
> > For device context, how many rounds of syncing did you expect, and if we have
> > N rounds, we need to restore N rounds in order to resume? Do you want to live
> > migrating device states? If it's only 1 or 2 rounds, why bother?
> >
> Live migrate the device context. Typically in current software using it, it is 2 rounds.

If it's just 2 rounds, why bother for delta? It is only helpful is we
want to live migrate some device with giant states with sevreal
rounds, and in that can we should leave it as a device specific state.

> The interface is generic that if needed more rounds are possible.
>
> Even device for most practical purpose will implement 2 rounds.
>
> > And for the delta, how do you know you can easily define deltas for every type
> > of device, especially the ones with complicated internal states? Defining states
> > has already been demonstrated as a complicated task for some devices like
> > virtio-FS and you want to complicate it furtherly?
> >
> What is your question? If you say virtio-fs is complicated state, may be it should not have existed itself in the virtio spec as first place.

We have just more than FS that can't work for live migration. Crypto
and GPU are two other examples, and I'm pretty sure we have more.

Until we figure out how they can, we can't say a device context work
for all types. No?

1) Trying to define a format that works for all types of devices
2) Leavce the states to be defined by individual device types

Which method is esay?

> But I differ to think that.
> Virtio-fs guest side state wont be changed as part of it.
> Virtio-fs is the first device which has considered and listed to migrate the device state.
> So it should be possible.

I wouldn't repeart the discussion of virtio-FS migration here, you can
serach the archives for more details.

But the point is obvious, it's really hard to say a simple device
context can work for all type of devices. We should allow a device
specific states definition. This seems to be agreed by Michale and
LingShan.

>
> > What is proposed in this series is an ad-hoc optimization for a specific deivce
> > type within a specific subsystem (e.g VFIO) in a specific operating system which
> > is not the general.
> >
> Oh now you mention vfio. Not me. :)
>
> I am not going to comment on this. It is not ad-hoc.

You need to justify how it is not. Based on the current discussion,
you have demonstreated a lot of asusmptions in order to make your
proposal to work.

> It uses similar dirty page tracking like technique present in cpu hw and other devices.
>
> > As demsonsted many times, starting from something simple and stupid is the
> > most easy way.
> >
>
> > > Whatever is already returned is/should not be repeated in subsequent reads,
> > though device can choose to do so.
> > >
> > > > >
> > > > > > > And which software stack may find this useful?
> > > > > > > Is there any existing software that can utilize it?
> > > > > >
> > > > > > Libvirt.
> > > > > >
> > > > > Does libvirt restore on migration failure?
> > > >
> > > > Yes.
> > > >
> > > Ok. the device will be able to resume when it is marked active.
> > > The device context returned  is the incremental delta as explained above.
> >
> > I disagree, see my above reply.
> I replied above.
>
> >
> > >
> > > > >
> > > > > > > Why that device context present with the software vanished, in
> > > > > > > your
> > > > > > assumption, if it is?
> > > > > > >
> > > > > > > > > Typically, on
> > > > > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > > > > +context once when the device is in \field{Active} or
> > > > > > > > > +\field{Stop} mode and later once the member device is in
> > > > \field{Freeze} mode.
> > > > > > > >
> > > > > > > > Why need the read while device context could be changed? Or
> > > > > > > > is the dirty page part of the device context?
> > > > > > > >
> > > > > > > It is not part of the dirty page.
> > > > > > > It needs to read in the active/stop mode, so that it can be
> > > > > > > shared with
> > > > > > destination hypervisor, which will pre-setup the complex context
> > > > > > of the device, while it is still running on the source side.
> > > > > >
> > > > > > Is such a method used by any hypervisor?
> > > > > Yes. qemu which uses vfio interface uses it.
> > > >
> > > > Ok, such software technology could be used for all types of devices,
> > > > I don't see any advantages to mention it here unless it's unique to virtio.
> > > >
> > > It is theory of operation that brings the clarity and rationale.
> >
> > I think it's not. Since it's not something that is unique to virtio.
> >
> > > So I will keep it.
> > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > > +
> > > > > > > > > +Typically, the device context is read and written one
> > > > > > > > > +time on the source and the destination hypervisor
> > > > > > > > > +respectively once the device is in \field{Freeze} mode.
> > > > > > > > > +On the destination hypervisor, after writing the device
> > > > > > > > > +context, when the device mode set to \field{Active}, the
> > > > > > > > > +device uses the most recently set device context and
> > > > > > > > > +resumes the device
> > > > > > > > operation.
> > > > > > > >
> > > > > > > > There's no context sequence, so this is obvious. It's the
> > > > > > > > semantic of all other existing interfaces.
> > > > > > > >
> > > > > > > Can you please what which existing interfaces do you mean here?
> > > > > >
> > > > > > For any common cfg member. E.g queue_addr.
> > > > > >
> > > > > > The driver wrote 100 different values to queue_addr and the
> > > > > > device used the value written last time.
> > > > > >
> > > > > o.k. I don’t see any problem in stating what is done, which is
> > > > > less vague. 😊
> > > > >
> > > > > > >
> > > > > > > > > +
> > > > > > > > > +In an alternative flow, on the source hypervisor the
> > > > > > > > > +owner driver may choose to read the device context first
> > > > > > > > > +time while the device is in \field{Active} mode and
> > > > > > > > > +second time once the device is in \field{Freeze}
> > > > > > > > mode.
> > > > > > > >
> > > > > > > > Who is going to synchronize the device context with possible
> > > > > > > > configuration from the driver?
> > > > > > > >
> > > > > > > Not sure I understand the question.
> > > > > > > If I understand you right, do you mean that, When
> > > > > > > configuration change is done by the guest driver, how does device
> > context change?
> > > > > > >
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > If so, device context reading will reflect the new configuration.
> > > > > >
> > > > > > How do you do that? For example:
> > > > > >
> > > > > > static inline void vp_iowrite64_twopart(u64 val,
> > > > > >                                         __le32 __iomem *lo,
> > > > > >                                         __le32 __iomem *hi) {
> > > > > >         vp_iowrite32((u32)val, lo);
> > > > > >         vp_iowrite32(val >> 32, hi); }
> > > > > >
> > > > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > > > >
> > > > > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > > > section captures the partial value.
> > > >
> > > > There's no way for the device to know whether or not it's a partial value or
> > not.
> > > > No?
> > > >
> > > Device does not need to know, because when the guest vm and the device is
> > resumed on the destination, it the guest vm will continue with writing the 2nd
> > part.
> > >
> > > > >
> > > > > > >
> > > > > > > > > Similarly, on the
> > > > > > > > > +destination hypervisor writes the device context first
> > > > > > > > > +time while the device is still running in \field{Active}
> > > > > > > > > +mode on the source hypervisor and writes the device
> > > > > > > > > +context second time while the device is in
> > > > > > > > \field{Freeze} mode.
> > > > > > > > > +This flow may result in very short setup time as the
> > > > > > > > > +device context likely have minimal changes from the
> > > > > > > > > +previously written device
> > > > > > context.
> > > > > > > >
> > > > > > > > Is the hypervisor who is in charge of doing the comparison
> > > > > > > > and writing only the delta?
> > > > > > > >
> > > > > > > The spec commands allow to do so. So possibility exists from spec
> > wise.
> > > > > >
> > > > > > There are various optimizations for migration for sure, I don't
> > > > > > think mentioning any specific one is good.
> > > > > >
> > > > > The text is informative text similar to,
> > > > >
> > > > > " However, some devices benefit from the ability to find out the
> > > > > amount of available data in the queue without accessing the
> > > > > virtqueue in
> > > > memory"
> > > > >
> > > > > " To help with these optimizations, when
> > > > > VIRTIO_F_NOTIFICATION_DATA has
> > > > been negotiated".
> > > > >
> > > > > Is this the only optimization in virtio? No, but we still mention
> > > > > the rationale of
> > > > why it exists.
> > > >
> > > > The above is a good example as it explain VIRTIO_F_NOTIFICATION_DATA
> > > > is the only way without accessing the virtqueue. But this is not the case of
> > migration.
> > > > You said it's just a possibility but not a must which is not the
> > > > case for VIRTIO_F_NOTIFICATION_DATA.
> > > >
> > > It is one of the optimization apart. The comparison is of one_of_example or
> > not.
> >
> > I don't get this.
> Theory of operation is describing a flow how things are done and how the constructs are helpful to achieve it.

Immature optimzation doesn't belong to theory for sure. I see your
delta reporting immature in many ways. That's the point.

Thanks

> And it is not the end of the list.
> That does not mean one should not write those.



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13 11:26                         ` Michael S. Tsirkin
  2023-10-13 11:41                           ` Parav Pandit
@ 2023-10-17  1:42                           ` Jason Wang
  1 sibling, 0 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-17  1:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 13, 2023 at 7:26 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Oct 13, 2023 at 09:16:43AM +0800, Jason Wang wrote:
> > On Thu, Oct 12, 2023 at 6:58 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > Sent: Thursday, October 12, 2023 3:51 PM
> > > >
> > > > On 10/11/2023 7:43 PM, Parav Pandit wrote:
> > > > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > >> Sent: Wednesday, October 11, 2023 3:55 PM
> > > > >>>>>>> I don’t have any strong opinion to keep it or remove it as most
> > > > >>>>>>> stakeholders
> > > > >>>>>> has the clear view of requirements now.
> > > > >>>>>>> Let me know.
> > > > >>>>>> So some people use VFs with VFIO. Hence the module name.  This
> > > > >>>>>> sentence by itself seems to have zero value for the spec. Just drop it.
> > > > >>>>> Ok. Will drop.
> > > > >>>> So why not build your admin vq live migration on our config space
> > > > >>>> solution, get out of the troubles, to make your life easier?
> > > > >>>>
> > > > >>> Your this question is completely unrelated to this reply or you
> > > > >>> misunderstood
> > > > >> what dropping commit log means.
> > > > >> if you can rebase admin vq LM on our basic facilities, I think you
> > > > >> dont need to talk about vfio in the first place, so I ask you to re-consider
> > > > Jason's proposal.
> > > > > I don’t really know why you are upset with the vfio term.
> > > > > It is the use case of the cloud operator and it is listed to indicate how proposal
> > > > fits in a such use case.
> > > > > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> > > > >
> > > > > I already answered that I will remove from the commit log, because the
> > > > requirements are well understood now by the committee.
> > > > >
> > > > > Your comment is again unrelated (repeated) to your past two questions.
> > > > >
> > > > > I explained you the technical problem that admin command (not admin VQ)
> > > > of basic facilities cannot be done using config registers without any mediation
> > > > layer.
> > > > OK, I pop-ed Jason's proposal to make everything easier, and I see it is refused.
> > > Because it does not work for passthrough mode.
> >
> > How and why? What's wrong with just passing through the newly
> > introduced 2 or 3 registers to guests?
> >
> > This is the question you never answer even if I keep asking.
>
> It is, fundamentally, a question of supporting as many architectures
> as we can as opposed to being opinionated.
>
> On the one end of the spectrum, device is completely under guest control
> and anything external has to trap to hypervisor.
> None of existing implementations are there, at least pci config space
> is typically under hypervisor control.
> What Parav calls "passthrough" is built I think along these lines:
> memory and interrupts go straight to guest, config space
> is trapped and emulated.
> On the other side of the spectrum is trapping everything in hypervisor.
> Your "2 to 3 registers" is also not there, but is I think closer to that end
> of the arc.

Those simple registers could be used by both trapping or
"passthrough". Depending on the viewpoint, it could be treated as a
simple extension of existing common cfg, it can be used beyond just
migration. Nothing prevents those registers from coexisting with
things like admin virtqueue or commands.

>
> Any new feature should ideally be a building block supporting as many
> approaches as possible. Fundamentally that requires a level of
> indirection, as usual :)

Exactly. So an transport/interface independent section in the basic
facility makes a lot of sense.

Thanks


> Having two completely distict interfaces for
> that straight off the bat?  Gimme a break.
> --
> MST
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-13  6:36                         ` Parav Pandit
@ 2023-10-17  1:53                           ` Jason Wang
  2023-10-17  2:02                             ` Jason Wang
  2023-10-17  3:26                             ` Parav Pandit
  0 siblings, 2 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-17  1:53 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Oct 13, 2023 at 2:36 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Friday, October 13, 2023 6:47 AM
> >
> > On Thu, Oct 12, 2023 at 6:58 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > Sent: Thursday, October 12, 2023 3:51 PM
> > > >
> > > > On 10/11/2023 7:43 PM, Parav Pandit wrote:
> > > > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > >> Sent: Wednesday, October 11, 2023 3:55 PM
> > > > >>>>>>> I don’t have any strong opinion to keep it or remove it as
> > > > >>>>>>> most stakeholders
> > > > >>>>>> has the clear view of requirements now.
> > > > >>>>>>> Let me know.
> > > > >>>>>> So some people use VFs with VFIO. Hence the module name.
> > > > >>>>>> This sentence by itself seems to have zero value for the spec. Just
> > drop it.
> > > > >>>>> Ok. Will drop.
> > > > >>>> So why not build your admin vq live migration on our config
> > > > >>>> space solution, get out of the troubles, to make your life easier?
> > > > >>>>
> > > > >>> Your this question is completely unrelated to this reply or you
> > > > >>> misunderstood
> > > > >> what dropping commit log means.
> > > > >> if you can rebase admin vq LM on our basic facilities, I think
> > > > >> you dont need to talk about vfio in the first place, so I ask you
> > > > >> to re-consider
> > > > Jason's proposal.
> > > > > I don’t really know why you are upset with the vfio term.
> > > > > It is the use case of the cloud operator and it is listed to
> > > > > indicate how proposal
> > > > fits in a such use case.
> > > > > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> > > > >
> > > > > I already answered that I will remove from the commit log, because
> > > > > the
> > > > requirements are well understood now by the committee.
> > > > >
> > > > > Your comment is again unrelated (repeated) to your past two questions.
> > > > >
> > > > > I explained you the technical problem that admin command (not
> > > > > admin VQ)
> > > > of basic facilities cannot be done using config registers without
> > > > any mediation layer.
> > > > OK, I pop-ed Jason's proposal to make everything easier, and I see it is
> > refused.
> > > Because it does not work for passthrough mode.
> >
> > How and why? What's wrong with just passing through the newly introduced 2
> > or 3 registers to guests?
> >
> If passed to the guest who is not involved in the live migration flow, cannot operate the device.
> VF = member device = controlled function
> PF = owner device = controlling function
> Device migration commands from the hypervisor are not forwarded inside the guest.

I don't see why.

For example, can you explain how a virtio reset in guests can survive
with your proposal but not a virtio suspend?

>
> > This is the question you never answer even if I keep asking.
> >
>
> > > Sure, as I explained the config register method do not work for passthrough
> > mode, and does not scale.
> >
> > We need to make sure your migration proposal can work for 1 VF which is still
> > questionable then we can talk about others like scaling. No?
> Sure but in making sure that, the interface is built so that it can work for N VFs too.
>
> >
> > And most of your concern regarding scalability seems more like a limitation of a
> > transport. Let's not mix the scalability for a specific transport with the one for
> > core virtio devices.
> Virtio device is for the defined transport.
> So it needs to work for the defined transport.
> Therefore, scalability cannot be ignored.

I'm not convinced that the scalability is broken by just having 2 or 3
more registers. We all know MSIX requires much more than this.

You can't solve all the issues in one series. As stated many times, if
you really care about the scalability, the correct way is to behave
like a real transport through virtqueue instead of trying to duplicate
the functionality of transport virtqueue slowly.

>
> It is not a question anymore as for any bulk transfer virtqueue is the specification choice for obvious technical gains.

See above.

Thanks

>
> >
> > >
> > > > >
> > > > >> inflight descriptor tracking will be implemented by Eugenio in V2.
> > > > > When we have near complete proposal from two device vendors, you
> > > > > want to push something to unknown future without reviewing the
> > > > > work; does not
> > > > make sense.
> > > > Didn't I ever provide feedback to you? Really?
> > > No. I didn’t see why you need to post a new patch for dirty page tracking,
> > when it is already present in this series.
> >
> > You know there are various ways to do dirty paging? For example, the well
> > known bitmap and its variants. I think we've discussed several times in many
> > places in the past. I don't see where you explain why you choose one of them
> > but not the others but you want to forbid other types of dirty page logging?
> > Why?
> The well known bitmap simply do not work for the pci transport in atomic way, effectively.
> You tend to derive many conclusions to oppose the work frankly. :)
> Other types of dirty page logging is not forbidden.
>
> I don’t see the need to explain every single word why a given scheme is chosen in the spec language.
> There is line drawn to avoid writing a book and a spec.
>
> > > Secondly I don’t see how one can read 1M flows using config registers.
> >
> > Why can't we trap them? vIOMMU even migrates internal translation tables.
>
> Because they are added and removed over the virtqueues almost as data path operations.
> Virtio queues are not mediated/trapped when the native device is virtio member device itself.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-17  1:53                           ` Jason Wang
@ 2023-10-17  2:02                             ` Jason Wang
  2023-10-17  3:19                               ` Parav Pandit
  2023-10-17  3:26                             ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-17  2:02 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Oct 17, 2023 at 9:53 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Oct 13, 2023 at 2:36 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Friday, October 13, 2023 6:47 AM
> > >
> > > On Thu, Oct 12, 2023 at 6:58 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Thursday, October 12, 2023 3:51 PM
> > > > >
> > > > > On 10/11/2023 7:43 PM, Parav Pandit wrote:
> > > > > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > >> Sent: Wednesday, October 11, 2023 3:55 PM
> > > > > >>>>>>> I don’t have any strong opinion to keep it or remove it as
> > > > > >>>>>>> most stakeholders
> > > > > >>>>>> has the clear view of requirements now.
> > > > > >>>>>>> Let me know.
> > > > > >>>>>> So some people use VFs with VFIO. Hence the module name.
> > > > > >>>>>> This sentence by itself seems to have zero value for the spec. Just
> > > drop it.
> > > > > >>>>> Ok. Will drop.
> > > > > >>>> So why not build your admin vq live migration on our config
> > > > > >>>> space solution, get out of the troubles, to make your life easier?
> > > > > >>>>
> > > > > >>> Your this question is completely unrelated to this reply or you
> > > > > >>> misunderstood
> > > > > >> what dropping commit log means.
> > > > > >> if you can rebase admin vq LM on our basic facilities, I think
> > > > > >> you dont need to talk about vfio in the first place, so I ask you
> > > > > >> to re-consider
> > > > > Jason's proposal.
> > > > > > I don’t really know why you are upset with the vfio term.
> > > > > > It is the use case of the cloud operator and it is listed to
> > > > > > indicate how proposal
> > > > > fits in a such use case.
> > > > > > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> > > > > >
> > > > > > I already answered that I will remove from the commit log, because
> > > > > > the
> > > > > requirements are well understood now by the committee.
> > > > > >
> > > > > > Your comment is again unrelated (repeated) to your past two questions.
> > > > > >
> > > > > > I explained you the technical problem that admin command (not
> > > > > > admin VQ)
> > > > > of basic facilities cannot be done using config registers without
> > > > > any mediation layer.
> > > > > OK, I pop-ed Jason's proposal to make everything easier, and I see it is
> > > refused.
> > > > Because it does not work for passthrough mode.
> > >
> > > How and why? What's wrong with just passing through the newly introduced 2
> > > or 3 registers to guests?
> > >
> > If passed to the guest who is not involved in the live migration flow, cannot operate the device.
> > VF = member device = controlled function
> > PF = owner device = controlling function
> > Device migration commands from the hypervisor are not forwarded inside the guest.
>
> I don't see why.
>
> For example, can you explain how a virtio reset in guests can survive
> with your proposal but not a virtio suspend?
>
> >
> > > This is the question you never answer even if I keep asking.
> > >
> >
> > > > Sure, as I explained the config register method do not work for passthrough
> > > mode, and does not scale.
> > >
> > > We need to make sure your migration proposal can work for 1 VF which is still
> > > questionable then we can talk about others like scaling. No?
> > Sure but in making sure that, the interface is built so that it can work for N VFs too.
> >
> > >
> > > And most of your concern regarding scalability seems more like a limitation of a
> > > transport. Let's not mix the scalability for a specific transport with the one for
> > > core virtio devices.
> > Virtio device is for the defined transport.
> > So it needs to work for the defined transport.
> > Therefore, scalability cannot be ignored.
>
> I'm not convinced that the scalability is broken by just having 2 or 3
> more registers. We all know MSIX requires much more than this.
>
> You can't solve all the issues in one series. As stated many times, if
> you really care about the scalability, the correct way is to behave
> like a real transport through virtqueue instead of trying to duplicate
> the functionality of transport virtqueue slowly.
>
> >
> > It is not a question anymore as for any bulk transfer virtqueue is the specification choice for obvious technical gains.
>
> See above.
>
> Thanks
>
> >
> > >
> > > >
> > > > > >
> > > > > >> inflight descriptor tracking will be implemented by Eugenio in V2.
> > > > > > When we have near complete proposal from two device vendors, you
> > > > > > want to push something to unknown future without reviewing the
> > > > > > work; does not
> > > > > make sense.
> > > > > Didn't I ever provide feedback to you? Really?
> > > > No. I didn’t see why you need to post a new patch for dirty page tracking,
> > > when it is already present in this series.
> > >
> > > You know there are various ways to do dirty paging? For example, the well
> > > known bitmap and its variants. I think we've discussed several times in many
> > > places in the past. I don't see where you explain why you choose one of them
> > > but not the others but you want to forbid other types of dirty page logging?
> > > Why?
> > The well known bitmap simply do not work for the pci transport in atomic way, effectively.
> > You tend to derive many conclusions to oppose the work frankly. :)
> > Other types of dirty page logging is not forbidden.
> >
> > I don’t see the need to explain every single word why a given scheme is chosen in the spec language.
> > There is line drawn to avoid writing a book and a spec.
> >
> > > > Secondly I don’t see how one can read 1M flows using config registers.
> > >
> > > Why can't we trap them? vIOMMU even migrates internal translation tables.
> >
> > Because they are added and removed over the virtqueues almost as data path operations.

I don't understand, we trap RSS commands already.

Thanks

> > Virtio queues are not mediated/trapped when the native device is virtio member device itself.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-13  6:40                           ` Parav Pandit
@ 2023-10-17  2:10                             ` Jason Wang
  2023-10-17  3:45                               ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-17  2:10 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Oct 13, 2023 at 2:40 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Friday, October 13, 2023 6:48 AM
> >
> > On Thu, Oct 12, 2023 at 7:37 PM Parav Pandit <parav@nvidia.com> wrote:>
> > > As Michael said, software based nesting is used..
> >
> > I've pointed out in another thread when hardware has less abstraction level
> > than nesting, trap/emulation is a must.
> >
> > > See if actual hw based devices can implement it or not. Many components of
> > cpu cannot do N level nesting either, but may be virtio can.
> > > I don’t know how yet.
> >
> > I would not repeat the lessons given by Gerald J. Popek and Robert P.
> > Goldberg[1] in 1976, but I think you miss a lot of fundamental things in the
> > methodology of virtualization.
> Weekend is coming. I will read it.
>
> > For example, nesting is a very important criteria
> > to examine whether an architecture is well designed for virtualization.
> >
>
> In my reading of a leading OS vendor documentation, I leant that OS vendor do not recommend nested virtualization for production at [1].
> Snippet:
> "In addition, Red Hat does not recommend using nested virtualization in production user environments, due to various limitations in functionality. Instead, nested virtualization is primarily intended for development and testing scenarios."
>
> [1] https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_virtualization/creating-nested-virtual-machines_configuring-and-managing-virtualization
>
> 2nd leading hypervisor listed nested virtualization to be not used for "performance sensitive applications".

Another concept shift.

I'm not going to comment on the choice for individual distros. But the
points are whether we can deploy a nesting virtualization easily under
a specific hardware architecture. In this regard, the above is a good
example.

Again, just a simple google will tell you the instances that support
nesting have been available for almost all the major cloud vendors for
a while.

>
> I want to repeat and emphasize that I am not ignoring the nested case.
>
> An extension for nesting would be the VF presented to the guest itself with SR-IOV capability can work as_is as proposed here.

How can a VF have the SR-IOV capability?

> Michael presented the idea of the dummy PF, which is to represent the VF as dummy PF which can do the SR-IOV with one VF.

Why do we need the complicated SR-IOV emulation at the nesting level?
How can you make sure such a design can result in a live migration to
be done at any levels?

E.g in LN, you had a PF and a VF. How to migrate the PF to this level?
You want two PFs in the L(N-1) level?

> You need the support from the platform too, I guess TC can extend it.
> May be a different interface more suitable for nested case which do not have performance needs.

I disagree, it's about if the performance can satisfy the requirement
at N level.

>
> How about a nested user to have AQ located on the VF so that mediation sw can operate admin commands over self?

I would go with such complicated architecture.

> Device mode commands will not be applicable there, instead some other things to be done.
> So non passthrough mode software possibly can make use of it?

It would be a great burden if you

1) use passthrough in L0
2) use trap/emulation in L(N+1)

>
> > That is to say for any CPU/hypervisor vendors, the architecture should be
> > designed to run any levels of nesting instead of just an awkward 2 levels (but
> > what you proposed can not work for even 2).
> Huh, some missing text for corner case as making claim, _not_working in not a healthy discussion.
>
> > For x86 and KVM, any level of
> > nesting has been done for about 10 years ago.
> >
> I didn’t find hw for PML support in x86 for N or 3 level nesting. Did I miss?
> I didn’t find hw for nested page tables upto N level walking on the PCIe read/writes in any cpu. Did I miss?

You need first asking why it is a must to achieve nested
virtualization. All of those obstacles come only if you want to use
"passthrough" for any levels.

> Have you seen nesting in hw works at N level?

Again, hardware can't have endless resources for endless levels. Trap
and emulation is a must for achieving nesting virtualization. If you
try to invent a passthrough method that can work for any level, you
will probably fail

Thanks

>
> > For virtio, it can do any level. So did for vhost/vDPA. For example, I usually
> > develop and test virtio/vDPA/vhost in a nesting environment.
> >
> Great.
> Can you share the performance test results relative number with 2 and 3 level nesting covering the cpu utilization, latency?
>
> > Thanks
> >
> > [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-17  2:02                             ` Jason Wang
@ 2023-10-17  3:19                               ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-17  3:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Jason Wang
> Sent: Tuesday, October 17, 2023 7:33 AM

> I don't understand, we trap RSS commands already.
The definition of "we" is subjective depending on which stack you see. So lets stay away from that. :)

For passthrough member devices RSS commands are not trapped.
And do not want to trap more commands of different type of vqs.
And the user do not want to keep adding new type of software for new device type in hypervisor for trapping.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-17  1:53                           ` Jason Wang
  2023-10-17  2:02                             ` Jason Wang
@ 2023-10-17  3:26                             ` Parav Pandit
  2023-10-18  0:52                               ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-17  3:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 17, 2023 7:23 AM
> 
> On Fri, Oct 13, 2023 at 2:36 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Friday, October 13, 2023 6:47 AM
> > >
> > > On Thu, Oct 12, 2023 at 6:58 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Thursday, October 12, 2023 3:51 PM
> > > > >
> > > > > On 10/11/2023 7:43 PM, Parav Pandit wrote:
> > > > > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > >> Sent: Wednesday, October 11, 2023 3:55 PM
> > > > > >>>>>>> I don’t have any strong opinion to keep it or remove it
> > > > > >>>>>>> as most stakeholders
> > > > > >>>>>> has the clear view of requirements now.
> > > > > >>>>>>> Let me know.
> > > > > >>>>>> So some people use VFs with VFIO. Hence the module name.
> > > > > >>>>>> This sentence by itself seems to have zero value for the
> > > > > >>>>>> spec. Just
> > > drop it.
> > > > > >>>>> Ok. Will drop.
> > > > > >>>> So why not build your admin vq live migration on our config
> > > > > >>>> space solution, get out of the troubles, to make your life easier?
> > > > > >>>>
> > > > > >>> Your this question is completely unrelated to this reply or
> > > > > >>> you misunderstood
> > > > > >> what dropping commit log means.
> > > > > >> if you can rebase admin vq LM on our basic facilities, I
> > > > > >> think you dont need to talk about vfio in the first place, so
> > > > > >> I ask you to re-consider
> > > > > Jason's proposal.
> > > > > > I don’t really know why you are upset with the vfio term.
> > > > > > It is the use case of the cloud operator and it is listed to
> > > > > > indicate how proposal
> > > > > fits in a such use case.
> > > > > > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> > > > > >
> > > > > > I already answered that I will remove from the commit log,
> > > > > > because the
> > > > > requirements are well understood now by the committee.
> > > > > >
> > > > > > Your comment is again unrelated (repeated) to your past two
> questions.
> > > > > >
> > > > > > I explained you the technical problem that admin command (not
> > > > > > admin VQ)
> > > > > of basic facilities cannot be done using config registers
> > > > > without any mediation layer.
> > > > > OK, I pop-ed Jason's proposal to make everything easier, and I
> > > > > see it is
> > > refused.
> > > > Because it does not work for passthrough mode.
> > >
> > > How and why? What's wrong with just passing through the newly
> > > introduced 2 or 3 registers to guests?
> > >
> > If passed to the guest who is not involved in the live migration flow, cannot
> operate the device.
> > VF = member device = controlled function PF = owner device =
> > controlling function Device migration commands from the hypervisor are
> > not forwarded inside the guest.
> 
> I don't see why.
> 
> For example, can you explain how a virtio reset in guests can survive with your
> proposal but not a virtio suspend?
>
In this proposal virtio reset goes directly from guest to the device without intervention of hypervisor.
When such device reset is done, it does not reset the device context, nor it clears the dirty page records, because they are done by the controlling function.

> I'm not convinced that the scalability is broken by just having 2 or 3 more
> registers. We all know MSIX requires much more than this.
>
Saying there is some other high resource consumer method exists, so lets consume more in new interface we do now, is not good approach.
MSI-X on its v2 is underway. Hopefully it will be finished this year, which is already cutting down O(N) resources.

> You can't solve all the issues in one series. As stated many times, 
I am not solving all issues in one series. This series builds the infrastructure.

> if you really
> care about the scalability, the correct way is to behave like a real transport
> through virtqueue instead of trying to duplicate the functionality of transport
> virtqueue slowly.
> 
This will be impossible because no device will transport driver notifications using a virtqueue.
Therefore, virtqueue is not some generic transport that does everything - as simple as that. hence there is no transport virtqueue.

And virtqueue for bulk data transfer exists so no need to invent yet another thing without a good reason.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-17  2:10                             ` Jason Wang
@ 2023-10-17  3:45                               ` Parav Pandit
  2023-10-18  0:52                                 ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-17  3:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 17, 2023 7:41 AM
> 
> On Fri, Oct 13, 2023 at 2:40 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Friday, October 13, 2023 6:48 AM
> > >
> > > On Thu, Oct 12, 2023 at 7:37 PM Parav Pandit <parav@nvidia.com>
> > > wrote:>
> > > > As Michael said, software based nesting is used..
> > >
> > > I've pointed out in another thread when hardware has less
> > > abstraction level than nesting, trap/emulation is a must.
> > >
> > > > See if actual hw based devices can implement it or not. Many
> > > > components of
> > > cpu cannot do N level nesting either, but may be virtio can.
> > > > I don’t know how yet.
> > >
> > > I would not repeat the lessons given by Gerald J. Popek and Robert P.
> > > Goldberg[1] in 1976, but I think you miss a lot of fundamental
> > > things in the methodology of virtualization.
> > Weekend is coming. I will read it.
> >
> > > For example, nesting is a very important criteria to examine whether
> > > an architecture is well designed for virtualization.
> > >
> >
> > In my reading of a leading OS vendor documentation, I leant that OS vendor
> do not recommend nested virtualization for production at [1].
> > Snippet:
> > "In addition, Red Hat does not recommend using nested virtualization in
> production user environments, due to various limitations in functionality.
> Instead, nested virtualization is primarily intended for development and testing
> scenarios."
> >
> > [1]
> > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux
> > /8/html/configuring_and_managing_virtualization/creating-nested-virtua
> > l-machines_configuring-and-managing-virtualization
> >
> > 2nd leading hypervisor listed nested virtualization to be not used for
> "performance sensitive applications".
> 
> Another concept shift.
> 
> I'm not going to comment on the choice for individual distros. But the points are
> whether we can deploy a nesting virtualization easily under a specific hardware
> architecture. In this regard, the above is a good example.
>
And most of such nesting seems for non production use, helpful for debugging and more.

And the nesting is not working without trap + emulation for > 2 level of nesting outside of virtio as far as I understand.
Like Intel PML. How many levels of nesting is done by hw for PML?
 
> Again, just a simple google will tell you the instances that support nesting have
> been available for almost all the major cloud vendors for a while.
> 
From cpu data sheets, it does not appear that hw is able to do such nesting.

> >
> > I want to repeat and emphasize that I am not ignoring the nested case.
> >
> > An extension for nesting would be the VF presented to the guest itself with
> SR-IOV capability can work as_is as proposed here.
> 
> How can a VF have the SR-IOV capability?
>
One option is by trap + emulation.
Second is having it actually on the VF, which will follow the true definition of nesting.
 
> > Michael presented the idea of the dummy PF, which is to represent the VF as
> dummy PF which can do the SR-IOV with one VF.
> 
> Why do we need the complicated SR-IOV emulation at the nesting level?
You have to complicate one way or the other.
And here it does not look complicated because it uses all existing defined constructs available at VMM and GVM level.
It follows both the principles you listed in the paper, i.e. (a) efficiency and (b) equivalence property.

> How can you make sure such a design can result in a live migration to be done at
> any levels?
>
I will propose design that is practical and has some use case.
I will not propose theoretical work that no one will implement.
 
> E.g in LN, you had a PF and a VF. How to migrate the PF to this level?
> You want two PFs in the L(N-1) level?
> 
Likely yes as dummy PF with emulated caps.

> > You need the support from the platform too, I guess TC can extend it.
> > May be a different interface more suitable for nested case which do not have
> performance needs.
> 
> I disagree, it's about if the performance can satisfy the requirement at N level.
> 
> >
> > How about a nested user to have AQ located on the VF so that mediation sw
> can operate admin commands over self?
> 
> I would go with such complicated architecture.
>
You like meant, you wouldn't, Right?

Also, following your paper which clearly highlights, "execution of privileged instruction in vm occurs, which would have effect of changing machine resources".
In the passthrough case it is not the privileged instruction because the resource is not composed by the the machine, it is already done by the device".
Hence for such cvq operation trap is not to be done for member virtio device.

It would make sense to trap cvq for non virtio device, where cvq is composed as part of the machine resource.
 
> > Device mode commands will not be applicable there, instead some other
> things to be done.
> > So non passthrough mode software possibly can make use of it?
> 
> It would be a great burden if you
> 
> 1) use passthrough in L0
> 2) use trap/emulation in L(N+1)
>
How is this different than Intel PML hw?
 
> >
> > > That is to say for any CPU/hypervisor vendors, the architecture
> > > should be designed to run any levels of nesting instead of just an
> > > awkward 2 levels (but what you proposed can not work for even 2).
> > Huh, some missing text for corner case as making claim, _not_working in not a
> healthy discussion.
> >
> > > For x86 and KVM, any level of
> > > nesting has been done for about 10 years ago.
> > >
> > I didn’t find hw for PML support in x86 for N or 3 level nesting. Did I miss?
> > I didn’t find hw for nested page tables upto N level walking on the PCIe
> read/writes in any cpu. Did I miss?
> 
> You need first asking why it is a must to achieve nested virtualization. All of
> those obstacles come only if you want to use "passthrough" for any levels.
> 
> > Have you seen nesting in hw works at N level?
> 
> Again, hardware can't have endless resources for endless levels. 
Can you please list two or 3 hw features that are in hw, for > 2 levels?

> Trap and
> emulation is a must for achieving nesting virtualization. If you try to invent a
> passthrough method that can work for any level, you will probably fail

It at least follows the design principle of the paper you suggested.
I don’t see a point of designing something for N level nesting in first go when rest eco system is not there to support it at hw level.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-17  3:26                             ` Parav Pandit
@ 2023-10-18  0:52                               ` Jason Wang
  2023-10-18  4:30                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-18  0:52 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Oct 17, 2023 at 11:26 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 17, 2023 7:23 AM
> >
> > On Fri, Oct 13, 2023 at 2:36 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Friday, October 13, 2023 6:47 AM
> > > >
> > > > On Thu, Oct 12, 2023 at 6:58 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > Sent: Thursday, October 12, 2023 3:51 PM
> > > > > >
> > > > > > On 10/11/2023 7:43 PM, Parav Pandit wrote:
> > > > > > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > >> Sent: Wednesday, October 11, 2023 3:55 PM
> > > > > > >>>>>>> I don’t have any strong opinion to keep it or remove it
> > > > > > >>>>>>> as most stakeholders
> > > > > > >>>>>> has the clear view of requirements now.
> > > > > > >>>>>>> Let me know.
> > > > > > >>>>>> So some people use VFs with VFIO. Hence the module name.
> > > > > > >>>>>> This sentence by itself seems to have zero value for the
> > > > > > >>>>>> spec. Just
> > > > drop it.
> > > > > > >>>>> Ok. Will drop.
> > > > > > >>>> So why not build your admin vq live migration on our config
> > > > > > >>>> space solution, get out of the troubles, to make your life easier?
> > > > > > >>>>
> > > > > > >>> Your this question is completely unrelated to this reply or
> > > > > > >>> you misunderstood
> > > > > > >> what dropping commit log means.
> > > > > > >> if you can rebase admin vq LM on our basic facilities, I
> > > > > > >> think you dont need to talk about vfio in the first place, so
> > > > > > >> I ask you to re-consider
> > > > > > Jason's proposal.
> > > > > > > I don’t really know why you are upset with the vfio term.
> > > > > > > It is the use case of the cloud operator and it is listed to
> > > > > > > indicate how proposal
> > > > > > fits in a such use case.
> > > > > > > If for some reason, you don’t like vfio, fine. Ignore it and move on.
> > > > > > >
> > > > > > > I already answered that I will remove from the commit log,
> > > > > > > because the
> > > > > > requirements are well understood now by the committee.
> > > > > > >
> > > > > > > Your comment is again unrelated (repeated) to your past two
> > questions.
> > > > > > >
> > > > > > > I explained you the technical problem that admin command (not
> > > > > > > admin VQ)
> > > > > > of basic facilities cannot be done using config registers
> > > > > > without any mediation layer.
> > > > > > OK, I pop-ed Jason's proposal to make everything easier, and I
> > > > > > see it is
> > > > refused.
> > > > > Because it does not work for passthrough mode.
> > > >
> > > > How and why? What's wrong with just passing through the newly
> > > > introduced 2 or 3 registers to guests?
> > > >
> > > If passed to the guest who is not involved in the live migration flow, cannot
> > operate the device.
> > > VF = member device = controlled function PF = owner device =
> > > controlling function Device migration commands from the hypervisor are
> > > not forwarded inside the guest.
> >
> > I don't see why.
> >
> > For example, can you explain how a virtio reset in guests can survive with your
> > proposal but not a virtio suspend?
> >
> In this proposal virtio reset goes directly from guest to the device without intervention of hypervisor.

Why can suspend go directly from guest to device then?

> When such device reset is done, it does not reset the device context, nor it clears the dirty page records, because they are done by the controlling function.
>
> > I'm not convinced that the scalability is broken by just having 2 or 3 more
> > registers. We all know MSIX requires much more than this.
> >
> Saying there is some other high resource consumer method exists, so lets consume more in new interface we do now, is not good approach.

This is self-contradictory and double standard. You allow #MSI-X
vectors to grow but not config?

> MSI-X on its v2 is underway. Hopefully it will be finished this year, which is already cutting down O(N) resources.

Why couldn't such a method be applied to config registers and others?

>
> > You can't solve all the issues in one series. As stated many times,
> I am not solving all issues in one series. This series builds the infrastructure.

It's you that is raising the scalability issue, and what you've
ignored is that such an "issue" has existed for many years and various
hardware has been built on top of that.

Again, we should make sure the function is correct before we can talk
about others, otherwise it would be an endless discussion.

>
> > if you really
> > care about the scalability, the correct way is to behave like a real transport
> > through virtqueue instead of trying to duplicate the functionality of transport
> > virtqueue slowly.
> >
> This will be impossible because no device will transport driver notifications using a virtqueue.
> Therefore, virtqueue is not some generic transport that does everything - as simple as that. hence there is no transport virtqueue.

You won't get such a wrong conclusion if you read that proposal.

>
> And virtqueue for bulk data transfer exists so no need to invent yet another thing without a good reason.

I don't understand why this is related to transport virtqueue anyhow,
it's also a queue interface, no?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-17  3:45                               ` Parav Pandit
@ 2023-10-18  0:52                                 ` Jason Wang
  2023-10-18  5:28                                   ` Parav Pandit
  2023-10-18  6:13                                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-18  0:52 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Oct 17, 2023 at 11:46 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 17, 2023 7:41 AM
> >
> > On Fri, Oct 13, 2023 at 2:40 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Friday, October 13, 2023 6:48 AM
> > > >
> > > > On Thu, Oct 12, 2023 at 7:37 PM Parav Pandit <parav@nvidia.com>
> > > > wrote:>
> > > > > As Michael said, software based nesting is used..
> > > >
> > > > I've pointed out in another thread when hardware has less
> > > > abstraction level than nesting, trap/emulation is a must.
> > > >
> > > > > See if actual hw based devices can implement it or not. Many
> > > > > components of
> > > > cpu cannot do N level nesting either, but may be virtio can.
> > > > > I don’t know how yet.
> > > >
> > > > I would not repeat the lessons given by Gerald J. Popek and Robert P.
> > > > Goldberg[1] in 1976, but I think you miss a lot of fundamental
> > > > things in the methodology of virtualization.
> > > Weekend is coming. I will read it.
> > >
> > > > For example, nesting is a very important criteria to examine whether
> > > > an architecture is well designed for virtualization.
> > > >
> > >
> > > In my reading of a leading OS vendor documentation, I leant that OS vendor
> > do not recommend nested virtualization for production at [1].
> > > Snippet:
> > > "In addition, Red Hat does not recommend using nested virtualization in
> > production user environments, due to various limitations in functionality.
> > Instead, nested virtualization is primarily intended for development and testing
> > scenarios."
> > >
> > > [1]
> > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux
> > > /8/html/configuring_and_managing_virtualization/creating-nested-virtua
> > > l-machines_configuring-and-managing-virtualization
> > >
> > > 2nd leading hypervisor listed nested virtualization to be not used for
> > "performance sensitive applications".
> >
> > Another concept shift.
> >
> > I'm not going to comment on the choice for individual distros. But the points are
> > whether we can deploy a nesting virtualization easily under a specific hardware
> > architecture. In this regard, the above is a good example.
> >
> And most of such nesting seems for non production use, helpful for debugging and more.

I'm asking you to google, but you refuse to spent 1 minutes to do that
but spending several days to debate on this fact:

https://cloud.google.com/compute/docs/instances/nested-virtualization/overview

Please don't waste the time of both of us.

>
> And the nesting is not working without trap + emulation for > 2 level of nesting outside of virtio as far as I understand.

Read the above link.

> Like Intel PML. How many levels of nesting is done by hw for PML?
>
> > Again, just a simple google will tell you the instances that support nesting have
> > been available for almost all the major cloud vendors for a while.
> >
> From cpu data sheets, it does not appear that hw is able to do such nesting.

For PML, it's up to the CPU vendor to consider a good way to be self
virtualized. If it's not, it's a design defect. This is not the place
to discuss the design choice of a specific CPU vendor, if you are
really interested in this, you can go back in the archive to figure
out why AMD nesting is done much earlier than Intel.

>
> > >
> > > I want to repeat and emphasize that I am not ignoring the nested case.
> > >
> > > An extension for nesting would be the VF presented to the guest itself with
> > SR-IOV capability can work as_is as proposed here.
> >
> > How can a VF have the SR-IOV capability?
> >
> One option is by trap + emulation.

Great.

> Second is having it actually on the VF, which will follow the true definition of nesting.

How is VF allowed to have SR-IOV capability by the spec?

>
> > > Michael presented the idea of the dummy PF, which is to represent the VF as
> > dummy PF which can do the SR-IOV with one VF.
> >
> > Why do we need the complicated SR-IOV emulation at the nesting level?
> You have to complicate one way or the other.

How? I've demonstrated that you won't end up with such complications
if everything is self contained.

> And here it does not look complicated because it uses all existing defined constructs available at VMM and GVM level.
> It follows both the principles you listed in the paper, i.e. (a) efficiency and (b) equivalence property.

In order to achieve (b), you need to have many PFs and many levels
which is an obvious unnecessary complication.

>
> > How can you make sure such a design can result in a live migration to be done at
> > any levels?
> >
> I will propose design that is practical and has some use case.
> I will not propose theoretical work that no one will implement.

Again, it's only a matter if you want to do everything in a
passthrough mode, this is not to the methodology proven by [1]. It's
not a matter if you stick to trapping.

>
> > E.g in LN, you had a PF and a VF. How to migrate the PF to this level?
> > You want two PFs in the L(N-1) level?
> >
> Likely yes as dummy PF with emulated caps.

Ok, so you will have N PFs in L0 which is unrealistic. Not only
because of the limitation of the resources but also because there's no
way for the hypervisor to know how many levels of nesting are being
used.

>
> > > You need the support from the platform too, I guess TC can extend it.
> > > May be a different interface more suitable for nested case which do not have
> > performance needs.
> >
> > I disagree, it's about if the performance can satisfy the requirement at N level.
> >
> > >
> > > How about a nested user to have AQ located on the VF so that mediation sw
> > can operate admin commands over self?
> >
> > I would go with such complicated architecture.
> >
> You like meant, you wouldn't, Right?

Right.

>
> Also, following your paper which clearly highlights, "execution of privileged instruction in vm occurs, which would have effect of changing machine resources".
> In the passthrough case it is not the privileged instruction because the resource is not composed by the the machine, it is already done by the device".

How do you know that? With save/load of a device state, you can
schedule/share a VF among multiple VMs. Then you still want to pass
through everything? Let's just not invent a mechanism that can only
work for a very limited use case.

> Hence for such cvq operation trap is not to be done for member virtio device.
>
> It would make sense to trap cvq for non virtio device, where cvq is composed as part of the machine resource.
>
> > > Device mode commands will not be applicable there, instead some other
> > things to be done.
> > > So non passthrough mode software possibly can make use of it?
> >
> > It would be a great burden if you
> >
> > 1) use passthrough in L0
> > 2) use trap/emulation in L(N+1)
> >
> How is this different than Intel PML hw?

Let me clarify my points, I meant.

You can't simply use pass through in order to live migrate at any
level. So what you can did is:

1) using passthrough to VF in L0
2) using trap/emulation for PF/VF in L1 and LN

Isn't this much more complicated than simply having a self contained
device for VF, then you don't need the composition of PF in any level.
No?

>
> > >
> > > > That is to say for any CPU/hypervisor vendors, the architecture
> > > > should be designed to run any levels of nesting instead of just an
> > > > awkward 2 levels (but what you proposed can not work for even 2).
> > > Huh, some missing text for corner case as making claim, _not_working in not a
> > healthy discussion.
> > >
> > > > For x86 and KVM, any level of
> > > > nesting has been done for about 10 years ago.
> > > >
> > > I didn’t find hw for PML support in x86 for N or 3 level nesting. Did I miss?
> > > I didn’t find hw for nested page tables upto N level walking on the PCIe
> > read/writes in any cpu. Did I miss?
> >
> > You need first asking why it is a must to achieve nested virtualization. All of
> > those obstacles come only if you want to use "passthrough" for any levels.
> >
> > > Have you seen nesting in hw works at N level?
> >
> > Again, hardware can't have endless resources for endless levels.
> Can you please list two or 3 hw features that are in hw, for > 2 levels?

Why do I need to do this? What I'm saying is that hardware doesn't
need to be designed for N levels. What it needs to make sure to
satisfy the requirement proved by [1].

>
> > Trap and
> > emulation is a must for achieving nesting virtualization. If you try to invent a
> > passthrough method that can work for any level, you will probably fail
>
> It at least follows the design principle of the paper you suggested.

I don't see it this way, see the above reply. The paper is for trap
and emulation for sure but you propose to pass through everything.

> I don’t see a point of designing something for N level nesting in first go when rest eco system is not there to support it at hw level.

Your design complicates the nesting a lot. We have hands-on
methodology which has been well studied since the 1970s where you
refuse to start with. Then you may end up with a lot of issues.

What's more you design is incomplete as it can't be used for migrating:

1) owner
2) virtio devices that doesn't structure as owner/member

That's why I see this as incomplete and immature.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  0:52                               ` Jason Wang
@ 2023-10-18  4:30                                 ` Parav Pandit
  2023-10-18  6:14                                   ` Michael S. Tsirkin
  2023-10-19  2:41                                   ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  4:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, October 18, 2023 6:23 AM

> Why can suspend go directly from guest to device then?
>
Because all the virtio registers are treated equally by the live migration driver.
So why not?
As explained device synchronizing all the operations which are not mediated by VMM.

If somehow you claim that all the synchronization is possible _only_ in software in various mediation layer,
And it is impossible in single place in device, than I do not agree.
V2 listed most of the synchronization points of the device.
 
> > When such device reset is done, it does not reset the device context, nor it
> clears the dirty page records, because they are done by the controlling function.
> >
> > > I'm not convinced that the scalability is broken by just having 2 or
> > > 3 more registers. We all know MSIX requires much more than this.
> > >
> > Saying there is some other high resource consumer method exists, so lets
> consume more in new interface we do now, is not good approach.
> 
> This is self-contradictory and double standard. You allow #MSI-X vectors to
> grow but not config?
>

 
> > MSI-X on its v2 is underway. Hopefully it will be finished this year, which is
> already cutting down O(N) resources.
> 
> Why couldn't such a method be applied to config registers and others?
>
Because config registers are not de-duplicating type.
Meaning each config register is unique in nature. For such work a queue/dma approach is taken.
So it is applied to config registers and others to not place as _always_available registers.
We just need to do that in virtio too.
All the recent work of flow filters, counters no longer rely on the config registers as we both agreed in discussion [1] with your comment

"Adding cvq is much easier than inventing(duplicating) the work of a transport."

[1] https://lore.kernel.org/virtio-comment/CACGkMEseZeT4VX8Ut-7KraxLKNOMKOgFDNxqKofXSFT8yHfg-w@mail.gmail.com/#t
> >
> > > You can't solve all the issues in one series. As stated many times,
> > I am not solving all issues in one series. This series builds the infrastructure.
> 
> It's you that is raising the scalability issue, and what you've ignored is that such
> an "issue" has existed for many years and various hardware has been built on
> top of that.
>
That does not mean one should continue with such issue.
And that hardware consumes more power and memory that results in overall device inefficiency.
You should have objected the IMS patches in Linux kernel, you should also object new MSI-X proposal and say just use registers.

> Again, we should make sure the function is correct before we can talk about
> others, otherwise it would be an endless discussion.
>
And using registers is not the way to make it correct.
Lets make sure that basic function for the member device to first level is correct.

 
> >
> > > if you really
> > > care about the scalability, the correct way is to behave like a real
> > > transport through virtqueue instead of trying to duplicate the
> > > functionality of transport virtqueue slowly.
> > >
> > This will be impossible because no device will transport driver notifications
> using a virtqueue.
> > Therefore, virtqueue is not some generic transport that does everything - as
> simple as that. hence there is no transport virtqueue.
> 
> You won't get such a wrong conclusion if you read that proposal.
I have read those 4 or 5 patches posted by Lingshan and showed you that time that driver notifications are not coming via virtqueue.
And if I missed it, and if they are coming via virtqueue, it does not meet the performance and "efficiency principle" from the paper you pointed.

> 
> >
> > And virtqueue for bulk data transfer exists so no need to invent yet another
> thing without a good reason.
> 
> I don't understand why this is related to transport virtqueue anyhow, it's also a
> queue interface, no?
Transport virtqueue is diversion of unrelated topic here.

A guest vm driver must be able to talk to the member device for all queue configuration etc through its own channel not mediated by the hypervisor.
Otherwise such plumbing does not work for any confidential compute workload. Hence, I wouldn’t discuss transport virtqueue for now.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-16  9:44                           ` Zhu, Lingshan
@ 2023-10-18  5:00                             ` Parav Pandit
  2023-10-18  6:32                               ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  5:00 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, October 16, 2023 3:14 PM
> 
> On 10/13/2023 7:28 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, October 13, 2023 2:36 PM
> > [..]
> >>> Because it does not work for passthrough mode.
> >> what are you talking about?
> >> Config space does not work passthrough?
> > Once the register space of the VF that is supposed to be used by the live
> migration is passed to the guest, it is under guest control.
> > Hence, live migration driver won't be able to use it.
> Does guest control device status to reset itself? harmful?
No. it is not harmful.
Is owner device reseting itself, harmful? No.
Is member device resetting isetlf, harmful? No.
Should member device reset 
> These facilities can be trapped and emulated, even the feature bits, right?
> You know the guest actually don't direct access the device config space, there is
> a vfio/vdpa driver, right?
You can practically trap and emulated everything.
If you continue to ignore passthrough requirements and keep repeating that do trap and emulate, this discussion does not go anywhere.


> >
> >> Have you ever tried pass through a virtio device to a guest?
> > :)
> > Please explain how the question is relevant to this discussion in separate
> thread, so that one can keep technical focus.
> > (Please keep your discussion technical, instead of derogatory to other
> members).
> if you want me to answer your question, at least you SHOULD NOT cut off the
> context, or you are trying to confuse everyone.
> Or did you try to avoid or hide anything? I am not sure this is a good practice.
> 
> The context in last discussion is:
> 
> me: OK, I pop-ed Jason's proposal to make everything easier, and I see it is
> refused.
> you: Because it does not work for passthrough mode.
> me: what are you talking about?
>      Config space does not work passthrough?
>      Have you ever tried pass through a virtio device to a guest?
> 
> So I ask you try to pass through a virito-pci device to a guest, then check
> whether the config space work for pass-through mode.
> 
> again, don't cut off threads before the discussion is closed.
> >
> >> Let me repeat again, these live migration facilities are
> >> per-device(per-VF) facility, so it only migrates itself.
> >>
> > Since they are per device (per VF), they reside in the guest VM. Hence, VMM
> cannot live migrate it.
> you know the config space can be trapped and emulated, and the hypervisor
> takes the ownership of the device once the guest freeze in the stop window.
When you say config space, do you mean PCI config space of 4K size?

> >
> >> And for pass through, you can try passthrough a virito device to a
> >> guest, see how the guest initialize the device through the config space.
> >>
> >> That is really basic virtualization, not hard to test.
> > Repeated points, I am omitting.
> ok, if you get it, let's close it.
> >
> >>>>>> inflight descriptor tracking will be implemented by Eugenio in V2.
> >>>>> When we have near complete proposal from two device vendors, you
> >>>>> want to push something to unknown future without reviewing the
> >>>>> work; does not
> >>>> make sense.
> >>>> Didn't I ever provide feedback to you? Really?
> >>> No. I didn’t see why you need to post a new patch for dirty page
> >>> tracking,
> >> when it is already present in this series.
> > This is plain ignorance and shows non_cooperative mode of working in
> technical committee.
> you have cut off the tread again, so I can't read the context.
Enjoy long threads. 😊
> >
> >>> I would like to understand and review this aspects.
> >>> Same for the device context.
> >> you will see dirty page tracking in my V2, as I repeated for many times.
> > Since you are not co-operative, I have less sympathy to see V2.
> > I don’t see a reason to see when, it is fully presented here.
> Again, please don't take it personal and please be professional.
> 
> Speaking of collaboration, please at least respect others' time and answers.
> Both Jason and I have responded to you multiple times on the same
> questions(for example, FLR, nested, passthrough).
> If our answers are ignored again and again, and then after a few days or hours
> you come back asking the same question again, what's the point?
> 
I didn’t ask questions in area of FLR and passthrough, please check again.

> And please don't cut off any threads before we close the discussion.
> >
> >> For device context, we have discussed this in other threads, did you
> >> ignored that again?
> > No. I didn’t. I replied that the generic infrastructure is built the enables every
> device type to migrate by defining their device context.
> don't we have a conclusion there or did you miss anything? Since you refuse to
> define device context for every device type, how do you migrate stateful
> devices?
> 
> So we should implement a stateless live migration solution, right?
No. device context is basic facility that intent to cover most virtio devices.
I didn’t not refuse to define context.
I said, device context will be incrementally defined subsequently.
Like Michael said, I expect every device to define device context section in coming months for 1.4 time frame.

> >
> >> Hint: how do you define device context for every device type, e.g, virtio-fs.
> >> Don't say you only migrate virito-net or blk.
> > I didn’t say it. I said to migrate all 30+ device types.
> > And infrastructure is presented here.
> so please define device context for all the devices.
> how about starting from virtio-fs?
Should be done incrementally.

> >
> >>>>> You are still in the mode of _take_ what we did with near zero
> explanation.
> >>>>> You asked question of why passthrough proposal cannot advantage of
> >>>>> in_band
> >>>> config registers.
> >>>>> I explained technical reason listed here.
> >>>> I have answered the questions, and asked questions for many times.
> >>>> What do you mean by "why passthrough proposal cannot advantage of
> >>>> in_band config registers."?
> >>>> Config space work for passthrough for sure.
> >>> Config space registers are passthrough the guest VM.
> >>> Hence hypervisor messing it with, programming some address would
> >>> result in
> >> either security issue.
> >>> Or functionally broken, to sustain the functionality, each nested
> >>> layer needs
> >> one copy of these registers for each nest level.
> >>> So they must be trapped somehow.
> >> trap and emulated are basic virtualization.
> > Not for passthrough devices, sorry.
> > See the paper that Jason pointed out.
> > Control program/vmm is trap is involved only on the privileged operation of
> the VMM.
> > Virtio cvqs, virtio registers are not the privileged operation of the VMM,
> because they are of the native virtio device itself.
> > Period.
> since the context is cut of again, I failed to read the context.
> 
> But config space can be trapped and emulated, right?
Answered above.

> When guest accessing device config space, actually it access the hypervisor-
> presented config space.
> >
> >>> Secondly I don’t see how one can read 1M flows using config registers.
> >> Not sure what you are talking about, beyond the spec?
> > The spec which is under works for few months by multiple technical
> members.
> > Please subscribe to virtio-comment mailing list.
> > How come you changed your point from cvq to different argument of out
> > of spec? :)
> I mean, what is your 1M flows? is it beyond spec?

No. it is not beyond the spec.
It is the spec in work for several months by multiple device, OS and cloud operators.

> >
> >>>>> So please don’t jump to conclusions before finishing the
> >>>>> discussion on how
> >>>> both side can take advantage of each other.
> >>>>> Lets please do that.
> >>>> We have proposed a solution, right?
> >>>>
> >>> Which one? To do something in future?
> >>> I don’t see a suggestion on how one can use device context and dirty
> >>> page
> >> tracking for nested and passthrough uniformly.
> >>> I see a technical difficulty in making both work with uniform interface.
> >> Please don't ignore previous answers, don't force us repeat again and again.
> >>
> > You didn’t answer, how.
> > Your answer was "you will post dirty page tracking without reviewing current"
> and Eugenio will post v2....
> Yes, will do. and you can check the patch when it posted.
>
Does not make sense to me at all.
 
> Eugenio will cook a patch for in-flight descriptors, not dirty page, that is mine.
> >
> >> It is Jason's proposal. Please refer to previous threads, also for
> >> device context and dirty pages.
> >>>> I still need to point out: admin vq LM does not work, one example is
> nested.
> >>> As Michael said, please don’t confuse between admin commands and
> >>> admin
> >> vq.
> >> anyway, admin vq live migration don't work for nested.
> > I am convicned with the paper that Jason pointed out.
> >
> > A nested solution involves a member device supporting the nesting without
> trap and emulation so that it follows the two properties:
> > The efficiency property and equivalence property.
> >
> > Hence a member device which wants to support nested case, should present
> itself with attributes to support nesting.
> failed to process the sentence, but I am glad you are convinced by the paper.
> >
> >
> >>>>>> There are no scale problem as I repeated for many time, they are
> >>>>>> per-device basic facilities, just migrate the VF by its own
> >>>>>> facility, so there are no 40000 member devices, this is not per PF.
> >>>>>>
> >>>>> I explained that device reset, flr etc flow cannot work when
> >>>>> controlling and
> >>>> controlled functions are single entity for passthrough mode.
> >>>>> The scale problem is, one needs to duplicate the registers on each VF.
> >>>>> The industry is moving away from the register interface in many
> >>>>> _real_ hw
> >>>> devices implementation.
> >>>>> Some of the examples are IMS, SIOV, NVMe and more.
> >>>> we have discussed this for many times, please refer to previous
> >>>> threads, even with Jason.
> >>> I do not agree for any registers to add to the VF which are reset on
> >> device_reset and FLR.
> >>> As it does not work for passthrough mode.
> >> Jason has answered your these FLR questions for many times, I don't
> >> want to repeat his words, even myself have answered many times. If
> >> you keep ignoring the answers, and ask again and again, what is the point?
> >>
> >> So please refer to the previous threads.
> > I don’t think I asked the question above. Please re-read.
> you cut if off again, what question? if about FLR, I believe Jason has answered
> for many times.
> >
Again, please read. I didn’t ask the question for FLR.
You keep saying "what question".

> >>>>>> The device context can be read from config space or trapped, like
> >>>>>> shadow
> >>>>> There are 1 million flows of the net device flow filters in progress.
> >>>>> Each flow is 64B in size.
> >>>>> Total size is 64MB.
> >>>>> I don’t see how one can read such amount of memory using config
> >> registers.
> >>>> control vq?
> >>> The control vq and flow filter vqs are owned by the guest driver,
> >>> not the
> >> hypervisor.
> >>> So no, cvq cannot be used.
> >> first, don't cut off the threads, don't delete words, that really confusing
> readers.
> >>
> > Your comments are so long that it is hard to follow such a long thread.
> > Hence only the related comments are kept.
> > But I understand, will try to avoid.
> >
> >> And I think you misunderstand a lot of virtualization fundamentals,
> >> at least have a look at how shadow control vq works.
> >>
> > In case if you don’t know, the shadow cvq acceleration for Nvidia ConnectX6-
> DX is done jointly with Dragos and me, with recent patches from Sie-Wei.
> >
> > I don’t think so I missed.
> >
> > Shadow vq is great when you don’t have underlying support from the device.
> >
> > When you have passthrough member devices, they are not trapped or
> emulated.
> > The future hypervisor must not be able to see things of cvq, datavq or
> addressed programmed by the guest.
> > And hence the infrastructure is geared towards such approach.
> I failed to read the full context as you cut off them. I can't even read your
> original questions, they are truncated.
> 
> Anyway, lets migrate device without device-context first.

Passthrough device cannot migrate without device-context as listed.

> >
> >> And the parameters set to config vq are also device context as we
> >> discussed for many times.
> >>>> Or do you want to migrate non-virtio context?
> >>> Every thing is virtio device context.
> >> see above
> >>>>>> control vq which is already done, that is basic virtualization.
> >>>>> There is nothing like "basic virtualization".
> >>>>> What is proposed here is fulfilling the requirement of passthrough mode.
> >>>>>
> >>>>> Your comment is implying, "I don’t care for passthrough
> >>>>> requirements, do
> >>>> non_passthrough".
> >>>> that is your understanding, and you misunderstood it. Config space
> >>>> servers passthrough for many years.
> >>> "Config space servers" ?
> >>> I do not understand it, can you please explain what does that mean?
> >>>
> >>> I do not see your suggestion on how one can implement passthrough
> >>> member
> >> device when passthrough device does the dma and migration framework
> >> also need to do the dma.
> >> Try pass through a virtio device to a guest and learn how the guest
> >> take advantage the config space before you comment.
> > Right. It does not work. The guest is doing the device_reset and flr.
> > Hence, it is resetting everything. All the dirty page log is lost.
> > All the device context is lost.
> > Hypervisor didn’t see any of this happening, because it didn’t do the trap.
> >
> > Look, if you are going to continue to argue that you must do trap +
> > emulation and don’t talk about passthrough, Please stop here, because
> discussion won't go anywhere.
> >
> > I made my best to answer the limitations in very first email where you asked.
> OK, I see the gap, and I am sure we can help you here.
> Try consider a question:
> how do you define pass-through? 
As defined in the cover letter and theory of operation.
Repeat here:
A device whose virtio interfaces are not intercepted by VMM.
In future, may be even MSI-X and MSI-X_v2 or newer interrupt method will be passthrough at device level too.
(only cpu level interrupt remapping will be hypercall at interrupt controller level).

A PCI spec defined config space to stay as emulated as it is generic and not supposed to have any virtio specific things in it as directed by the PCI-SIG.

> Can a guest access the device without a host driver helper?
Yes for all the virtio interfaces which includes, virtio device common and device config space, cvq, data vq, flow filter vqs, shared memory and anything new of the future.

> >
> >>> That basic facility is missing dirty page tracking, P2P support,
> >>> device context,
> >> FLR, device reset support.
> >>> Hence, it is unusable right now for passthough member device.
> >>> And 6th problemetic thing in it is, it does not scale with member devices.
> >> Please refer to previous discussions, it is meaningless if you keep
> >> ignoring our answers and keep asking the same questions.
> > Again, please re-read, I didn’t ask the question.
> > I replied 6 problems that are not solved.
> I believe we have answered for many times. The questions are cut off again, but
> how about search for previous answers?
> >
> >>>>>> If you want to migrate device context, you need to specify device
> >>>>>> context for every type of device, net maybe easy, how do you see virtio-
> fs?
> >>>>> Virtio-fs will have its on device context too.
> >>>>> Every device has some sort of backend in varied degree.
> >>>>> Net being widely used and moderate complex device.
> >>>>> Fs being slightly stateful but less complex than net, as it has
> >>>>> far less control
> >>>> operations.
> >>>> so, do you say you have implement a live migration solution which
> >>>> can migrate device context, but only work for net or block?
> >>> I don’t think this question about implementation has any relevance.
> >>> Frankly feels like a court to me. :( No. I dint say that.
> >>> We have implemented net, fs, block devices and single framework
> >>> proposed
> >> here can support all 3 and rest 28+.
> >>> The device context part in this series do not cover special/optional
> >>> things of
> >> all the device type.
> >>> This is something I promised to do gradually, once the framework looks
> good.
> >> If you don't define them, only talking about "migrate the device
> >> context" but don't tell us what do migrate, does this make sense to anybody?
> >>>> Then you should call it virtio net/blk migration and implement in
> >>>> net/block section.
> >>> No. you misunderstood. My point was showing orthogonal complexities
> >>> of net
> >> vs fs.
> >>> I likely failed to explain that.
> >> see above, anyway you need to define them, how about starting form virito
> FS?
> >>>>> In fact virtio-fs device already discusses the migrating the
> >>>>> device side state, as
> >>>> listed in device context.
> >>>>> So virtio-fs device will have its own device-context defined.
> >>>> if you want to migrate it, you need to define it
> >>> Sure.
> >>> Only device specific things to be defined in future.
> >> Now, not future if you want to migrate device context.
> > It is not mandatory, and it is impractical do everything in one series.
> > It is planned for 1.4.
> really, you want to define device context for every device time?
>
Yes.
 
> Remember don't migrate device-context before you define them or how can
> the HW implementions know how to do.
I disagree. The infrastructure is defined. And incrementally device context will also be defined.
See an example work from Michael, i.e. admin command and aq generic facility is defined.
And device migration is able to utilize it incrementally. The lower layer fulfill the requirements.
This is exactly what is done here.

Device context framework is defined and many device spec owners will be easily define their device context making it migratable.

> >
> >>> Rest is already present.
> >>> We are not going to define all the device context in one patch
> >>> series that no
> >> one can review reliably.
> >>> It will be done incrementally.
> >> so you agree at least for now we should migrate stateless devices, right?
> >>> But the feedback, I am taking is, we need to add a command that
> >>> indicates
> >> which TLVs are supported in the device migration.
> >>> So virtio-fs or other device migration capabilities can be discovered.
> >>> I will cover this in v2.
> >> so you propose a solution as "virtio migration", but only migrate
> >> selective types of devices?
> >> You should rename it to be "virtio-net live migration".
> > Sorry, I wont. Because infrastructure is for majority device types.
> >
> > Which field did you observe which is net specific?
> > We want to cover all the device types.
> > Don’t need to cook their context in one series.
> so, not work for all device types? limited to some specific types?
> you still need to rename it what ever.
No. framework works for all device types.

> >
> >>> Thanks a lot for this thoughts.
> >>>
> >>>>> The infrastructure and basic facilities are setup in this series,
> >>>>> that one can
> >>>> easily extend for all the current and new device types.
> >>>> really? how?
> >>>>>> And we are migrating stateless devices, or no? How do you migrate
> >>>>>> virtio-
> >> fs?
> >>>>>>> 2. sharing such large context and write addresses in parallel
> >>>>>>> for multiple devices cannot be done using single register file
> >>>>>> see above
> >>>>>>> 3. These registers cannot be residing in the VF because VF can
> >>>>>>> undergo FLR, and device reset which must clear these registers
> >>>>>> do you mean you want to audit all PCI features? When FLR, the
> >>>>>> device is rested, do you expect a device remember anything after FLR?
> >>>>> Not at all. VF member device will not remember anything after FLR.
> >>>>>> Do you want to trap FLR? Why?
> >>>>> This proposal does _not_ want to trap the FLR in the hypervisor virtio
> driver.
> >>>>>
> >>>>> When one does the mediation-based design, it must
> >>>>> trap/emulate/fake the
> >>>> FLR.
> >>>>> It helps to address the case of nested as you mentioned.
> >>>> once passthrough, the guest driver can access the config space to
> >>>> reset the device, right?
> >>>>>> Why FLR block or conflict with live migration?
> >>>>> It does not block or conflict.
> >>>> OK, cool, so let's make this a conclusion
> >>>>> The whole point is, when you put live migration functionality on
> >>>>> the VF itself,
> >>>> you just cannot FLR this device.
> >>>>> One must trap the FLR and do fake FLR and build the whole
> >>>>> infrastructure to
> >>>> not FLR The device.
> >>>>> Above is not passthrough device.
> >>>> No, the guest can reset the device, even causing a failed live migration.
> >>> Not in the proposal here.
> >>> Can you please prove how in the current v1 proposal, device reset
> >>> will fail the
> >> migration?
> >>> I would like to fix it.
> >> if the device is reset, it forgets everything right?
> > Right. This is why all dirty page track; device context is lost on device reset.
> > Hence, the controlling function and controlled function are two different
> entities.
> so there can be inconsistent migrations and races, right? And if the guest reset
> the device, actually the hypervisor should let it be, right?
No. it should not be in because hypervisor has not composed the member device. It is in the hw controlled function itself.

> >
> >>>>>>> 4. When VF does the DMA, all dma occurs in the guest address
> >>>>>>> space, not in
> >>>>>> hypervisor space; any flr and device reset must stop such dma.
> >>>>>>> And device reset and flr are controlled by the guest (not
> >>>>>>> mediated by
> >>>>>> hypervisor).
> >>>>>> if the guest reset the device, it is totally reasonable
> >>>>>> operation, and the guest own the risk, right?
> >>>>> Sure, but the guest still expects its dirty pages and device
> >>>>> context to be
> >>>> migrated across device_reset.
> >>>>> Device_reset will lose all this information within the device if
> >>>>> done without
> >>>> mediation and special care.
> >>>> No, if the guest reset a device, that means the device should be
> >>>> RESET, to forget its config, that would be really wired to migrate
> >>>> a fresh device at the source side, to be a running device at the
> >>>> destination
> >> side.
> >>> Device reset not doing the role of reset is just a plain broken spec.
> >> why? The reset behavior is well defined in the spec, and works fine for years.
> > So any new construct that one adds, it will be reset as well and dirty page
> track is lost.
> Yes and do you want to prevent that? You may surprise the guest.
Yes, want to prevent that.
Not sure what you mean by surprise the guest. Unlikely.
Why because guest did the reset, it knows what it is doing.
(Keep in mind that guest does not expect to lose its dirty pages).

> >
> >>>>> So, to avoid that now one needs to have fake device reset too and
> >>>>> build that
> >>>> infrastructure to not reset.
> >>>>> The passthrough proposal fundamental concept is:
> >>>>>
> >>>>> all the native virtio functionalities are between guest driver and
> >>>>> the actual
> >>>> device.
> >>>> see above.
> >>>>>> and still, do you want to audit every PCI features? at least you
> >>>>>> didn't do that in your series.
> >>>>> Can you please list which PCI features audit you are talking about?
> >>>> you audit FLR, then do you want to check everyone?
> >>>> If no, how to decide which one should be audited, why others not?
> >>> I really find it hard to follow your question.
> >>>
> >>> I explained in patch 5 and 8 about interactions with the FLR and its support.
> >>> Not sure what you want me to check.
> >>>
> >>> You mentioned that "I didn’t audit every PCI features"? So can you
> >>> please list
> >> which one and in relation to which admin commands?
> >> Your job to audit everyone if you talk about FLR. Because FLR is PCI
> >> spec, not virtio, you need to explain why other PCI features not need to be
> audited.
> >>
> > Sure, but when you point figure as I didn’t audit, please mention what is not
> audited.
> well, we are migrating virtio devices, but you keep talking PCI, so do you want to
> take every PCI functionalities into considerations>
For pci transport, yes.

> >
> >> We have explained why FLR is not a concern for many times, and I
> >> don't want to repeat, please refer to previous discussions.
> > You seem to ignore the first paragraph of theory of operation that FLR is not
> trapped.
> this is the guest issue FLR, right? If so the guest owns the risks and the
> hypervisor should not prevent that.
Exactly, hypervisor do not prevent it.
The owner device still has the ownership to not lose previously logged dirty pages addresses.
And device still need to report device reset occurred, so that destination side can wipe off and start fresh.

> >
> >>>>> Keep in mind, that will all the mediation, one now must equally
> >>>>> audit all this
> >>>> giant software stack too.
> >>>>> So maybe it is fine for those who are ok with it.
> >>>> so you agree FLR is not a problem, at least for config space solution?
> >>> I don’t know what you mean "FLR is not a problem".
> >>>
> >>> FLR on the VF must work as it works without live migration for
> >>> passthrough
> >> device as today.
> >>> And admin commands have some interactions with it.
> >>> And this proposal covers it.
> >>> I am missing some text that Michael and Jason pointed out.
> >>> I am working on v2 to annotate or better word them.
> >> When guest reset the device, the device should be reset for sure.
> >> then it forgets everything, how do you expect the reset-ed device still work
> for live migration?
> >> is it a race?
> > I don’t expect it live migration to work at all with such a approach.
> > This is why in my proposal live migration occurs on the owner device, while
> controlled function (member device) is undergoing the device reset.
> see above
> >
> >>>>>> For migration, you know the hypervisor takes the ownership of the
> >>>>>> device in the stop_window.
> >>>>> I do not know what stop_window means.
> >>>>> Do you mean stop_copy of vfio or it is qemu term?
> >>>> when guest freeze.
> >>>>>>> 5. Any PASID to separate out admin vq on the VF does not work
> >>>>>>> for two
> >>>>>> reasons.
> >>>>>>> R_1: device flr and device reset must stop all the dmas.
> >>>>>>> R_2: PASID by most leading vendors is still not mature enough
> >>>>>>> R_3: One also needs to do inversion to not expose PASID
> >>>>>>> capability of the member PCI device to not expose
> >>>>>> see above and what if guest shutdown? the same answer, right?
> >>>>> Not sure, I follow.
> >>>>> If the guest shutdown, the guest specific shutdown APIs are called.
> >>>>>
> >>>>> With passthrough device, R_1 just works as is.
> >>>>> R_3 is not needed as they are directly given to the guest.
> >>>>> R_2 platform dependency is not needed either.
> >>>> I think we already have a concussion for FLR.
> >>> I don’t have any concussion.
> >>> I wrote what to be supported for the FLR above.
> >> OK, again, our discussions has been ignored again, and all start over again.
> >>
> >> Would you please read our previous discussions?
> > You asked the question about why it wont work, I answered.
> > I don’t see a point of debating same thing over again.
> Is that cut off again?
> 
No it is not cut off here.

> if still about FLR, so please see above comments.
> And I agree if the answers are ignored again, we don't need to repeat.
I didn’t ask questions. Please re-read.

> >
> >>>> For PASID, what blocks the solution?
> >>> When the device is passthrough, PASID capabilities cannot be emulated.
> >>> PASID space is owned fully by the guest.
> >>>
> >>> There is no single known cpu vendor support splitting pasid between
> >> hypervisor and guest.
> >>> I can double check, but last I recall that Linux kernel removed such
> >>> weird
> >> support.
> >> do you know there is something called vIOMMU?
> > Probably yes.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-16  9:47                             ` Zhu, Lingshan
@ 2023-10-18  5:02                               ` Parav Pandit
  2023-10-18  6:20                                 ` Michael S. Tsirkin
  2023-10-18  6:35                                 ` Zhu, Lingshan
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  5:02 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Zhu, Lingshan
> Sent: Monday, October 16, 2023 3:18 PM
> 
> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, October 13, 2023 3:14 PM
> >>>>>> How do you transfer the ownership?
> >>>>> An additional ownership deletgation by a new admin command.
> >>>> if you think this can work, do you want to cook a patch to
> >>>> implement this before you submitting this live migration series?
> >>> I answered this already above.
> >> talk is cheap, show me your patch
> > Huh. We presented the infrastructure that migrates, 30+ device types,
> covering device context ideas from Oracle.
> > Covering P2P, supporting device_reset, FLR, dirty page tracking.
> >
> > Please have some respect for other members who covered more ground than
> your series.
> >
> > What more? Apply the same nested concept on the member device as
> Michael suggested, it is nested virtualization maintain exact same semantics.
> > So a VF is mapped as PF to the L1 guest.
> > L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> >
> > This nested work can be extended in future, once first level nesting is
> covered.
> >
> >> Answer all questions above, if you think a management VF can work,
> >> please show me your patch.
> > The idea evolves from technical debate then pointing fingers like your
> comment.
> >
> > I think a positive discussion with Michael and a pointer to the paper from
> Jason gave a good direction of doing _right_ nesting that follows two principles.
> > a. efficiency property
> > b. equivalence property
> >
> > (c. resource control is natural already)
> >
> > Both apply at VMM and at VM level enabling recursive virtualization, by
> having VF that can act as PF inside the guest.
> >
> > [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> Please just show me your patch resolving these opens, how about start from
> defining virito-fs device context and your management VF?
As answered, device context infrastructure is done, per device specific device-context will be defined incrementally.
I will not be including virtio-fs in this series. It will be done incrementally in future utilizing the infrastructure build in this series.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  0:52                                 ` Jason Wang
@ 2023-10-18  5:28                                   ` Parav Pandit
  2023-10-19  2:41                                     ` Jason Wang
  2023-10-18  6:13                                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  5:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, October 18, 2023 6:23 AM
> 
> On Tue, Oct 17, 2023 at 11:46 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, October 17, 2023 7:41 AM
> > >
> > > On Fri, Oct 13, 2023 at 2:40 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Friday, October 13, 2023 6:48 AM
> > > > >
> > > > > On Thu, Oct 12, 2023 at 7:37 PM Parav Pandit <parav@nvidia.com>
> > > > > wrote:>
> > > > > > As Michael said, software based nesting is used..
> > > > >
> > > > > I've pointed out in another thread when hardware has less
> > > > > abstraction level than nesting, trap/emulation is a must.
> > > > >
> > > > > > See if actual hw based devices can implement it or not. Many
> > > > > > components of
> > > > > cpu cannot do N level nesting either, but may be virtio can.
> > > > > > I don’t know how yet.
> > > > >
> > > > > I would not repeat the lessons given by Gerald J. Popek and Robert P.
> > > > > Goldberg[1] in 1976, but I think you miss a lot of fundamental
> > > > > things in the methodology of virtualization.
> > > > Weekend is coming. I will read it.
> > > >
> > > > > For example, nesting is a very important criteria to examine
> > > > > whether an architecture is well designed for virtualization.
> > > > >
> > > >
> > > > In my reading of a leading OS vendor documentation, I leant that
> > > > OS vendor
> > > do not recommend nested virtualization for production at [1].
> > > > Snippet:
> > > > "In addition, Red Hat does not recommend using nested
> > > > virtualization in
> > > production user environments, due to various limitations in functionality.
> > > Instead, nested virtualization is primarily intended for development
> > > and testing scenarios."
> > > >
> > > > [1]
> > > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_l
> > > > inux
> > > > /8/html/configuring_and_managing_virtualization/creating-nested-vi
> > > > rtua l-machines_configuring-and-managing-virtualization
> > > >
> > > > 2nd leading hypervisor listed nested virtualization to be not used
> > > > for
> > > "performance sensitive applications".
> > >
> > > Another concept shift.
> > >
> > > I'm not going to comment on the choice for individual distros. But
> > > the points are whether we can deploy a nesting virtualization easily
> > > under a specific hardware architecture. In this regard, the above is a good
> example.
> > >
> > And most of such nesting seems for non production use, helpful for
> debugging and more.
> 
> I'm asking you to google, but you refuse to spent 1 minutes to do that but
> spending several days to debate on this fact:
> 
> https://cloud.google.com/compute/docs/instances/nested-
> virtualization/overview
> 
> Please don't waste the time of both of us.

I showed the link of Redhat and another one is Hyper-V.
You showed link of google cloud.

There are no representatives from Google and Microsoft here to support nested here in the discussion.

I assume you as part of Redhat show some production use, but in public documentation of Redhat it said non production.

Regardless, I want to emphasize that I am not against the use case of nested.

I am highlighting that any L2 nesting involves today hw emulation in the ecosystem.
If this is incorrect, please point to the datasheet. (not user documentation at high level).

And for L2 nesting, virtio doing hw emulation is fine to me.
And one wants to improve that too, lets have the proper nested VF.
Lets discuss in other thread, where you have many questions.


> 
> >
> > And the nesting is not working without trap + emulation for > 2 level of
> nesting outside of virtio as far as I understand.
> 
> Read the above link.
> 
> > Like Intel PML. How many levels of nesting is done by hw for PML?
> >
> > > Again, just a simple google will tell you the instances that support
> > > nesting have been available for almost all the major cloud vendors for a
> while.
> > >
> > From cpu data sheets, it does not appear that hw is able to do such nesting.
> 
> For PML, it's up to the CPU vendor to consider a good way to be self virtualized.
> If it's not, it's a design defect. This is not the place to discuss the design choice
> of a specific CPU vendor, if you are really interested in this, you can go back in
> the archive to figure out why AMD nesting is done much earlier than Intel.

In the google link you posted, I read "VMs powered by AMD processors are not supported".
I wish they should have been able to utilize it.

> 
> >
> > > >
> > > > I want to repeat and emphasize that I am not ignoring the nested case.
> > > >
> > > > An extension for nesting would be the VF presented to the guest
> > > > itself with
> > > SR-IOV capability can work as_is as proposed here.
> > >
> > > How can a VF have the SR-IOV capability?
> > >
> > One option is by trap + emulation.
> 
> Great.
> 
> > Second is having it actually on the VF, which will follow the true definition of
> nesting.
> 
> How is VF allowed to have SR-IOV capability by the spec?
>
To support nesting, PCI-SIG can extend it.
 
> >
> > > > Michael presented the idea of the dummy PF, which is to represent
> > > > the VF as
> > > dummy PF which can do the SR-IOV with one VF.
> > >
> > > Why do we need the complicated SR-IOV emulation at the nesting level?
> > You have to complicate one way or the other.
> 
> How? I've demonstrated that you won't end up with such complications if
> everything is self contained.
The primary problem with self-contained is it is not fitting the requirements of passthrough.
How can we do self-contained interface without mediation where device context, dirty pages are lost, when device reset/flr occurs?
Also the dma occurs in the guest.
We need facility like PML where PML logs the pages in the VMM level, in virtio case to the owner PF.

> 
> > And here it does not look complicated because it uses all existing defined
> constructs available at VMM and GVM level.
> > It follows both the principles you listed in the paper, i.e. (a) efficiency and (b)
> equivalence property.
> 
> In order to achieve (b), you need to have many PFs and many levels which is an
> obvious unnecessary complication.
> 
This is what you wanted to follow the paper.
It does not need many PFs, at L0 there is one PF and N VFs.
At L1, one VF is given with emulated config space that consist of SR-IOV capability.
This L1 VF allows creating new VF, one of the VF will be passed to L2.

> >
> > > How can you make sure such a design can result in a live migration
> > > to be done at any levels?
> > >
> > I will propose design that is practical and has some use case.
> > I will not propose theoretical work that no one will implement.
> 
> Again, it's only a matter if you want to do everything in a passthrough mode,
> this is not to the methodology proven by [1]. It's not a matter if you stick to
> trapping.
>
I didn’t understand, but I don’t see a point of discussing passthrough vs non_passthrough.

 
> >
> > > E.g in LN, you had a PF and a VF. How to migrate the PF to this level?
> > > You want two PFs in the L(N-1) level?
> > >
> > Likely yes as dummy PF with emulated caps.
> 
> Ok, so you will have N PFs in L0 which is unrealistic. Not only because of the
> limitation of the resources but also because there's no way for the hypervisor to
> know how many levels of nesting are being used.
>
Only one PF in L0. Emulated PF in L1. Similar to how rest of the eco-system platform components are doing it.
When whole platform commit to do N level nesting, it make sense for virtio to align.
For example cpu vendors to commit to do N level nested page table traversal on pci read/writes, posted interrupts at N level, PML logging at N level.
At that point virtio for N level nesting make sense.

> >
> > > > You need the support from the platform too, I guess TC can extend it.
> > > > May be a different interface more suitable for nested case which
> > > > do not have
> > > performance needs.
> > >
> > > I disagree, it's about if the performance can satisfy the requirement at N
> level.
> > >
> > > >
> > > > How about a nested user to have AQ located on the VF so that
> > > > mediation sw
> > > can operate admin commands over self?
> > >
> > > I would go with such complicated architecture.
> > >
> > You like meant, you wouldn't, Right?
> 
> Right.
> 
> >
> > Also, following your paper which clearly highlights, "execution of privileged
> instruction in vm occurs, which would have effect of changing machine
> resources".
> > In the passthrough case it is not the privileged instruction because the
> resource is not composed by the the machine, it is already done by the device".
> 
> How do you know that? With save/load of a device state, you can
> schedule/share a VF among multiple VMs. Then you still want to pass through
> everything? 
You cannot share a VF among multiple VMs as each VM has its own isolated memory boundary, isolated by the IOMMU and MMU.
PCI incoming requests of a specific RID cannot split to two different guest VMs.

> Let's just not invent a mechanism that can only work for a very
> limited use case.
> 
The use case you are quoting as limited is common one for passthrough users.

> > Hence for such cvq operation trap is not to be done for member virtio device.
> >
> > It would make sense to trap cvq for non virtio device, where cvq is composed
> as part of the machine resource.
> >
> > > > Device mode commands will not be applicable there, instead some
> > > > other
> > > things to be done.
> > > > So non passthrough mode software possibly can make use of it?
> > >
> > > It would be a great burden if you
> > >
> > > 1) use passthrough in L0
> > > 2) use trap/emulation in L(N+1)
> > >
> > How is this different than Intel PML hw?
> 
> Let me clarify my points, I meant.
> 
> You can't simply use pass through in order to live migrate at any level. So what
> you can did is:
> 
> 1) using passthrough to VF in L0
> 2) using trap/emulation for PF/VF in L1 and LN
> 
> Isn't this much more complicated than simply having a self contained device for
> VF, then you don't need the composition of PF in any level.
> No?
>
The problem in self-contained is it is not able to do even #1.
 
> >
> > > >
> > > > > That is to say for any CPU/hypervisor vendors, the architecture
> > > > > should be designed to run any levels of nesting instead of just
> > > > > an awkward 2 levels (but what you proposed can not work for even 2).
> > > > Huh, some missing text for corner case as making claim,
> > > > _not_working in not a
> > > healthy discussion.
> > > >
> > > > > For x86 and KVM, any level of
> > > > > nesting has been done for about 10 years ago.
> > > > >
> > > > I didn’t find hw for PML support in x86 for N or 3 level nesting. Did I miss?
> > > > I didn’t find hw for nested page tables upto N level walking on
> > > > the PCIe
> > > read/writes in any cpu. Did I miss?
> > >
> > > You need first asking why it is a must to achieve nested
> > > virtualization. All of those obstacles come only if you want to use
> "passthrough" for any levels.
> > >
> > > > Have you seen nesting in hw works at N level?
> > >
> > > Again, hardware can't have endless resources for endless levels.
> > Can you please list two or 3 hw features that are in hw, for > 2 levels?
> 
> Why do I need to do this? What I'm saying is that hardware doesn't need to be
> designed for N levels. What it needs to make sure to satisfy the requirement
> proved by [1].
>
You need it because you want to follow the 3 principles listed in the paper, i.e. efficiency, equivalency and resource control.
 
> >
> > > Trap and
> > > emulation is a must for achieving nesting virtualization. If you try
> > > to invent a passthrough method that can work for any level, you will
> > > probably fail
> >
> > It at least follows the design principle of the paper you suggested.
> 
> I don't see it this way, see the above reply. The paper is for trap and emulation
> for sure but you propose to pass through everything.
> 
> > I don’t see a point of designing something for N level nesting in first go when
> rest eco system is not there to support it at hw level.
> 
> Your design complicates the nesting a lot. We have hands-on methodology
> which has been well studied since the 1970s where you refuse to start with.
> Then you may end up with a lot of issues.
> 
I don’t think so. When the hw eco-system is built for nesting, it make sense for virtio to do nesting acceleration.
Otherwise method done in other nesting is enough for virtio.

> What's more you design is incomplete as it can't be used for migrating:
> 
> 1) owner
Owner migration is not requirement. That is just silly.
If one wants to migrate owner, an admin virtio device can be present outside of owner to migrate.

> 2) virtio devices that doesn't structure as owner/member
> 
As with spec 1.3. they are structured for PCI SR-IOV group type.
MMIO transport is just missing out on the advancement happening on the PCI transport.
If there is user interest, one will do for MMIO too.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  0:52                                 ` Jason Wang
  2023-10-18  5:28                                   ` Parav Pandit
@ 2023-10-18  6:13                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18  6:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 08:52:51AM +0800, Jason Wang wrote:
> What's more you design is incomplete as it can't be used for migrating:
> 
> 1) owner
> 2) virtio devices that doesn't structure as owner/member

Fundamentally, the alternative proposed seems to be using PASID for
sub-function partitioning. All well and good and forward looking but
there are vendors who want to actually ship hardware right now and that
means at least not ignoring SRIOV.


-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  4:30                                 ` Parav Pandit
@ 2023-10-18  6:14                                   ` Michael S. Tsirkin
  2023-10-18  6:26                                     ` Parav Pandit
  2023-10-19  2:41                                   ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18  6:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 04:30:39AM +0000, Parav Pandit wrote:
> You should have objected the IMS patches in Linux kernel, you should also object new MSI-X proposal and say just use registers.

What does "new MSI-X proposal" refer to?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  5:02                               ` Parav Pandit
@ 2023-10-18  6:20                                 ` Michael S. Tsirkin
  2023-10-18  6:28                                   ` Parav Pandit
  2023-10-18  6:35                                 ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18  6:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 05:02:25AM +0000, Parav Pandit wrote:
> > Please just show me your patch resolving these opens, how about start from
> > defining virito-fs device context and your management VF?
> As answered, device context infrastructure is done, per device specific device-context will be defined incrementally.
> I will not be including virtio-fs in this series. It will be done
> incrementally in future utilizing the infrastructure build in this
> series.

virtio fs has a lot of context :) I think we do need some way to
document which devices do and which don't support migration.  Maybe for
each device type where it's not defined yet, we should add text along
the lines of "Device context TBD. Devices of this type SHOULD NOT
support VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET and
VIRTIO_ADMIN_CMD_DEV_CTX_READ commands".


-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  6:14                                   ` Michael S. Tsirkin
@ 2023-10-18  6:26                                     ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  6:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 18, 2023 11:45 AM
> 
> On Wed, Oct 18, 2023 at 04:30:39AM +0000, Parav Pandit wrote:
> > You should have objected the IMS patches in Linux kernel, you should also
> object new MSI-X proposal and say just use registers.
> 
> What does "new MSI-X proposal" refer to?
Hard to discuss in this forum due to non-disclosure agreement in other consortium.
But we can expect something like IMS which is under Linux kernel GPL which promotes staying away from having on_chip registers of MSI-X.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  6:20                                 ` Michael S. Tsirkin
@ 2023-10-18  6:28                                   ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  6:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 18, 2023 11:50 AM
> 
> On Wed, Oct 18, 2023 at 05:02:25AM +0000, Parav Pandit wrote:
> > > Please just show me your patch resolving these opens, how about
> > > start from defining virito-fs device context and your management VF?
> > As answered, device context infrastructure is done, per device specific device-
> context will be defined incrementally.
> > I will not be including virtio-fs in this series. It will be done
> > incrementally in future utilizing the infrastructure build in this
> > series.
> 
> virtio fs has a lot of context :) I think we do need some way to document which
> devices do and which don't support migration.  Maybe for each device type
> where it's not defined yet, we should add text along the lines of "Device context
> TBD. Devices of this type SHOULD NOT support
> VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET and
> VIRTIO_ADMIN_CMD_DEV_CTX_READ commands".
It is really hard to comprehend and conclude that device Foo will not be able to migrate when the device migration work has just began two months ago.
So I wont go overboard to stop them now.

But the requirement you listed about for the commands is covered in the device requirements section in patch_5.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  5:00                             ` Parav Pandit
@ 2023-10-18  6:32                               ` Zhu, Lingshan
  2023-10-18  6:34                                 ` Parav Pandit
  2023-10-18  6:39                                 ` Zhu, Lingshan
  0 siblings, 2 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-18  6:32 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

[-- Attachment #1: Type: text/plain, Size: 30495 bytes --]



On 10/18/2023 1:00 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>> Sent: Monday, October 16, 2023 3:14 PM
>>
>> On 10/13/2023 7:28 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>>>> Sent: Friday, October 13, 2023 2:36 PM
>>> [..]
>>>>> Because it does not work for passthrough mode.
>>>> what are you talking about?
>>>> Config space does not work passthrough?
>>> Once the register space of the VF that is supposed to be used by the live
>> migration is passed to the guest, it is under guest control.
>>> Hence, live migration driver won't be able to use it.
>> Does guest control device status to reset itself? harmful?
> No. it is not harmful.
> Is owner device reseting itself, harmful? No.
> Is member device resetting isetlf, harmful? No.
> Should member device reset
good
>> These facilities can be trapped and emulated, even the feature bits, right?
>> You know the guest actually don't direct access the device config space, there is
>> a vfio/vdpa driver, right?
> You can practically trap and emulated everything.
> If you continue to ignore passthrough requirements and keep repeating that do trap and emulate, this discussion does not go anywhere.
Clearly I did not ignore passthrough and keep answering your question 
for many times.
Maybe you didn't get it, so I would ask how you define your "passthrough"?

You may find that the guest vCPUs(guest vRC) actually are not privileged 
to access to
host pci device(host CPU RC), that's why a pass-through driver like vfio 
is a must.

Therefore the device config space can be trapped. Is that clear now?
>
>
>>>> Have you ever tried pass through a virtio device to a guest?
>>> :)
>>> Please explain how the question is relevant to this discussion in separate
>> thread, so that one can keep technical focus.
>>> (Please keep your discussion technical, instead of derogatory to other
>> members).
>> if you want me to answer your question, at least you SHOULD NOT cut off the
>> context, or you are trying to confuse everyone.
>> Or did you try to avoid or hide anything? I am not sure this is a good practice.
>>
>> The context in last discussion is:
>>
>> me: OK, I pop-ed Jason's proposal to make everything easier, and I see it is
>> refused.
>> you: Because it does not work for passthrough mode.
>> me: what are you talking about?
>>       Config space does not work passthrough?
>>       Have you ever tried pass through a virtio device to a guest?
>>
>> So I ask you try to pass through a virito-pci device to a guest, then check
>> whether the config space work for pass-through mode.
>>
>> again, don't cut off threads before the discussion is closed.
>>>> Let me repeat again, these live migration facilities are
>>>> per-device(per-VF) facility, so it only migrates itself.
>>>>
>>> Since they are per device (per VF), they reside in the guest VM. Hence, VMM
>> cannot live migrate it.
>> you know the config space can be trapped and emulated, and the hypervisor
>> takes the ownership of the device once the guest freeze in the stop window.
> When you say config space, do you mean PCI config space of 4K size?
you can take an example of virito common config cap.
>
>>>> And for pass through, you can try passthrough a virito device to a
>>>> guest, see how the guest initialize the device through the config space.
>>>>
>>>> That is really basic virtualization, not hard to test.
>>> Repeated points, I am omitting.
>> ok, if you get it, let's close it.
>>>>>>>> inflight descriptor tracking will be implemented by Eugenio in V2.
>>>>>>> When we have near complete proposal from two device vendors, you
>>>>>>> want to push something to unknown future without reviewing the
>>>>>>> work; does not
>>>>>> make sense.
>>>>>> Didn't I ever provide feedback to you? Really?
>>>>> No. I didn’t see why you need to post a new patch for dirty page
>>>>> tracking,
>>>> when it is already present in this series.
>>> This is plain ignorance and shows non_cooperative mode of working in
>> technical committee.
>> you have cut off the tread again, so I can't read the context.
> Enjoy long threads. 😊
you skip finished discussions for sure, but don't do that to on-going 
discussions.
>>>>> I would like to understand and review this aspects.
>>>>> Same for the device context.
>>>> you will see dirty page tracking in my V2, as I repeated for many times.
>>> Since you are not co-operative, I have less sympathy to see V2.
>>> I don’t see a reason to see when, it is fully presented here.
>> Again, please don't take it personal and please be professional.
>>
>> Speaking of collaboration, please at least respect others' time and answers.
>> Both Jason and I have responded to you multiple times on the same
>> questions(for example, FLR, nested, passthrough).
>> If our answers are ignored again and again, and then after a few days or hours
>> you come back asking the same question again, what's the point?
>>
> I didn’t ask questions in area of FLR and passthrough, please check again.
OK, then please don't force us answer the same questions anymore, for 
example no FLR anymore.
>
>> And please don't cut off any threads before we close the discussion.
>>>> For device context, we have discussed this in other threads, did you
>>>> ignored that again?
>>> No. I didn’t. I replied that the generic infrastructure is built the enables every
>> device type to migrate by defining their device context.
>> don't we have a conclusion there or did you miss anything? Since you refuse to
>> define device context for every device type, how do you migrate stateful
>> devices?
>>
>> So we should implement a stateless live migration solution, right?
> No. device context is basic facility that intent to cover most virtio devices.
not most, instead it should be "all", if you are implementing a virtio 
live migration.
> I didn’t not refuse to define context.
> I said, device context will be incrementally defined subsequently.
just define what we have now, for example, define virito-fs as it is now.
> Like Michael said, I expect every device to define device context section in coming months for 1.4 time frame.
as MST said, do you expect the implementation to figure out the device 
context by themselves?
If you want to migrate device context, you should define them.
>
>>>> Hint: how do you define device context for every device type, e.g, virtio-fs.
>>>> Don't say you only migrate virito-net or blk.
>>> I didn’t say it. I said to migrate all 30+ device types.
>>> And infrastructure is presented here.
>> so please define device context for all the devices.
>> how about starting from virtio-fs?
> Should be done incrementally.
show me your patch
>
>>>>>>> You are still in the mode of _take_ what we did with near zero
>> explanation.
>>>>>>> You asked question of why passthrough proposal cannot advantage of
>>>>>>> in_band
>>>>>> config registers.
>>>>>>> I explained technical reason listed here.
>>>>>> I have answered the questions, and asked questions for many times.
>>>>>> What do you mean by "why passthrough proposal cannot advantage of
>>>>>> in_band config registers."?
>>>>>> Config space work for passthrough for sure.
>>>>> Config space registers are passthrough the guest VM.
>>>>> Hence hypervisor messing it with, programming some address would
>>>>> result in
>>>> either security issue.
>>>>> Or functionally broken, to sustain the functionality, each nested
>>>>> layer needs
>>>> one copy of these registers for each nest level.
>>>>> So they must be trapped somehow.
>>>> trap and emulated are basic virtualization.
>>> Not for passthrough devices, sorry.
>>> See the paper that Jason pointed out.
>>> Control program/vmm is trap is involved only on the privileged operation of
>> the VMM.
>>> Virtio cvqs, virtio registers are not the privileged operation of the VMM,
>> because they are of the native virtio device itself.
>>> Period.
>> since the context is cut of again, I failed to read the context.
>>
>> But config space can be trapped and emulated, right?
> Answered above.
>
>> When guest accessing device config space, actually it access the hypervisor-
>> presented config space.
>>>>> Secondly I don’t see how one can read 1M flows using config registers.
>>>> Not sure what you are talking about, beyond the spec?
>>> The spec which is under works for few months by multiple technical
>> members.
>>> Please subscribe to virtio-comment mailing list.
>>> How come you changed your point from cvq to different argument of out
>>> of spec? :)
>> I mean, what is your 1M flows? is it beyond spec?
> No. it is not beyond the spec.
> It is the spec in work for several months by multiple device, OS and cloud operators.
then, again, what is your 1M flow? if not defined in the spec, then it 
is beyond spec.
>
>>>>>>> So please don’t jump to conclusions before finishing the
>>>>>>> discussion on how
>>>>>> both side can take advantage of each other.
>>>>>>> Lets please do that.
>>>>>> We have proposed a solution, right?
>>>>>>
>>>>> Which one? To do something in future?
>>>>> I don’t see a suggestion on how one can use device context and dirty
>>>>> page
>>>> tracking for nested and passthrough uniformly.
>>>>> I see a technical difficulty in making both work with uniform interface.
>>>> Please don't ignore previous answers, don't force us repeat again and again.
>>>>
>>> You didn’t answer, how.
>>> Your answer was "you will post dirty page tracking without reviewing current"
>> and Eugenio will post v2....
>> Yes, will do. and you can check the patch when it posted.
>>
> Does not make sense to me at all.
tracking dirty pages does not make sense to you?
>   
>> Eugenio will cook a patch for in-flight descriptors, not dirty page, that is mine.
>>>> It is Jason's proposal. Please refer to previous threads, also for
>>>> device context and dirty pages.
>>>>>> I still need to point out: admin vq LM does not work, one example is
>> nested.
>>>>> As Michael said, please don’t confuse between admin commands and
>>>>> admin
>>>> vq.
>>>> anyway, admin vq live migration don't work for nested.
>>> I am convicned with the paper that Jason pointed out.
>>>
>>> A nested solution involves a member device supporting the nesting without
>> trap and emulation so that it follows the two properties:
>>> The efficiency property and equivalence property.
>>>
>>> Hence a member device which wants to support nested case, should present
>> itself with attributes to support nesting.
>> failed to process the sentence, but I am glad you are convinced by the paper.
>>>
>>>>>>>> There are no scale problem as I repeated for many time, they are
>>>>>>>> per-device basic facilities, just migrate the VF by its own
>>>>>>>> facility, so there are no 40000 member devices, this is not per PF.
>>>>>>>>
>>>>>>> I explained that device reset, flr etc flow cannot work when
>>>>>>> controlling and
>>>>>> controlled functions are single entity for passthrough mode.
>>>>>>> The scale problem is, one needs to duplicate the registers on each VF.
>>>>>>> The industry is moving away from the register interface in many
>>>>>>> _real_ hw
>>>>>> devices implementation.
>>>>>>> Some of the examples are IMS, SIOV, NVMe and more.
>>>>>> we have discussed this for many times, please refer to previous
>>>>>> threads, even with Jason.
>>>>> I do not agree for any registers to add to the VF which are reset on
>>>> device_reset and FLR.
>>>>> As it does not work for passthrough mode.
>>>> Jason has answered your these FLR questions for many times, I don't
>>>> want to repeat his words, even myself have answered many times. If
>>>> you keep ignoring the answers, and ask again and again, what is the point?
>>>>
>>>> So please refer to the previous threads.
>>> I don’t think I asked the question above. Please re-read.
>> you cut if off again, what question? if about FLR, I believe Jason has answered
>> for many times.
> Again, please read. I didn’t ask the question for FLR.
> You keep saying "what question".
I failed to read the context because you have cut them off.
>
>>>>>>>> The device context can be read from config space or trapped, like
>>>>>>>> shadow
>>>>>>> There are 1 million flows of the net device flow filters in progress.
>>>>>>> Each flow is 64B in size.
>>>>>>> Total size is 64MB.
>>>>>>> I don’t see how one can read such amount of memory using config
>>>> registers.
>>>>>> control vq?
>>>>> The control vq and flow filter vqs are owned by the guest driver,
>>>>> not the
>>>> hypervisor.
>>>>> So no, cvq cannot be used.
>>>> first, don't cut off the threads, don't delete words, that really confusing
>> readers.
>>> Your comments are so long that it is hard to follow such a long thread.
>>> Hence only the related comments are kept.
>>> But I understand, will try to avoid.
>>>
>>>> And I think you misunderstand a lot of virtualization fundamentals,
>>>> at least have a look at how shadow control vq works.
>>>>
>>> In case if you don’t know, the shadow cvq acceleration for Nvidia ConnectX6-
>> DX is done jointly with Dragos and me, with recent patches from Sie-Wei.
>>> I don’t think so I missed.
>>>
>>> Shadow vq is great when you don’t have underlying support from the device.
>>>
>>> When you have passthrough member devices, they are not trapped or
>> emulated.
>>> The future hypervisor must not be able to see things of cvq, datavq or
>> addressed programmed by the guest.
>>> And hence the infrastructure is geared towards such approach.
>> I failed to read the full context as you cut off them. I can't even read your
>> original questions, they are truncated.
>>
>> Anyway, lets migrate device without device-context first.
> Passthrough device cannot migrate without device-context as listed.
so please define the device context.
>
>>>> And the parameters set to config vq are also device context as we
>>>> discussed for many times.
>>>>>> Or do you want to migrate non-virtio context?
>>>>> Every thing is virtio device context.
>>>> see above
>>>>>>>> control vq which is already done, that is basic virtualization.
>>>>>>> There is nothing like "basic virtualization".
>>>>>>> What is proposed here is fulfilling the requirement of passthrough mode.
>>>>>>>
>>>>>>> Your comment is implying, "I don’t care for passthrough
>>>>>>> requirements, do
>>>>>> non_passthrough".
>>>>>> that is your understanding, and you misunderstood it. Config space
>>>>>> servers passthrough for many years.
>>>>> "Config space servers" ?
>>>>> I do not understand it, can you please explain what does that mean?
>>>>>
>>>>> I do not see your suggestion on how one can implement passthrough
>>>>> member
>>>> device when passthrough device does the dma and migration framework
>>>> also need to do the dma.
>>>> Try pass through a virtio device to a guest and learn how the guest
>>>> take advantage the config space before you comment.
>>> Right. It does not work. The guest is doing the device_reset and flr.
>>> Hence, it is resetting everything. All the dirty page log is lost.
>>> All the device context is lost.
>>> Hypervisor didn’t see any of this happening, because it didn’t do the trap.
>>>
>>> Look, if you are going to continue to argue that you must do trap +
>>> emulation and don’t talk about passthrough, Please stop here, because
>> discussion won't go anywhere.
>>> I made my best to answer the limitations in very first email where you asked.
>> OK, I see the gap, and I am sure we can help you here.
>> Try consider a question:
>> how do you define pass-through?
> As defined in the cover letter and theory of operation.
> Repeat here:
> A device whose virtio interfaces are not intercepted by VMM.
> In future, may be even MSI-X and MSI-X_v2 or newer interrupt method will be passthrough at device level too.
> (only cpu level interrupt remapping will be hypercall at interrupt controller level).
>
> A PCI spec defined config space to stay as emulated as it is generic and not supposed to have any virtio specific things in it as directed by the PCI-SIG.
how guest access device config space in your "passthrough"?
>
>> Can a guest access the device without a host driver helper?
> Yes for all the virtio interfaces which includes, virtio device common and device config space, cvq, data vq, flow filter vqs, shared memory and anything new of the future.
interesting, if so, let me ask you a question, Is a guest privileged to 
access any devices on the host?
>
>>>>> That basic facility is missing dirty page tracking, P2P support,
>>>>> device context,
>>>> FLR, device reset support.
>>>>> Hence, it is unusable right now for passthough member device.
>>>>> And 6th problemetic thing in it is, it does not scale with member devices.
>>>> Please refer to previous discussions, it is meaningless if you keep
>>>> ignoring our answers and keep asking the same questions.
>>> Again, please re-read, I didn’t ask the question.
>>> I replied 6 problems that are not solved.
>> I believe we have answered for many times. The questions are cut off again, but
>> how about search for previous answers?
>>>>>>>> If you want to migrate device context, you need to specify device
>>>>>>>> context for every type of device, net maybe easy, how do you see virtio-
>> fs?
>>>>>>> Virtio-fs will have its on device context too.
>>>>>>> Every device has some sort of backend in varied degree.
>>>>>>> Net being widely used and moderate complex device.
>>>>>>> Fs being slightly stateful but less complex than net, as it has
>>>>>>> far less control
>>>>>> operations.
>>>>>> so, do you say you have implement a live migration solution which
>>>>>> can migrate device context, but only work for net or block?
>>>>> I don’t think this question about implementation has any relevance.
>>>>> Frankly feels like a court to me. :( No. I dint say that.
>>>>> We have implemented net, fs, block devices and single framework
>>>>> proposed
>>>> here can support all 3 and rest 28+.
>>>>> The device context part in this series do not cover special/optional
>>>>> things of
>>>> all the device type.
>>>>> This is something I promised to do gradually, once the framework looks
>> good.
>>>> If you don't define them, only talking about "migrate the device
>>>> context" but don't tell us what do migrate, does this make sense to anybody?
>>>>>> Then you should call it virtio net/blk migration and implement in
>>>>>> net/block section.
>>>>> No. you misunderstood. My point was showing orthogonal complexities
>>>>> of net
>>>> vs fs.
>>>>> I likely failed to explain that.
>>>> see above, anyway you need to define them, how about starting form virito
>> FS?
>>>>>>> In fact virtio-fs device already discusses the migrating the
>>>>>>> device side state, as
>>>>>> listed in device context.
>>>>>>> So virtio-fs device will have its own device-context defined.
>>>>>> if you want to migrate it, you need to define it
>>>>> Sure.
>>>>> Only device specific things to be defined in future.
>>>> Now, not future if you want to migrate device context.
>>> It is not mandatory, and it is impractical do everything in one series.
>>> It is planned for 1.4.
>> really, you want to define device context for every device time?
>>
> Yes.
>   
>> Remember don't migrate device-context before you define them or how can
>> the HW implementions know how to do.
> I disagree. The infrastructure is defined. And incrementally device context will also be defined.
> See an example work from Michael, i.e. admin command and aq generic facility is defined.
> And device migration is able to utilize it incrementally. The lower layer fulfill the requirements.
> This is exactly what is done here.
>
> Device context framework is defined and many device spec owners will be easily define their device context making it migratable.
see above answers for device context.
>
>>>>> Rest is already present.
>>>>> We are not going to define all the device context in one patch
>>>>> series that no
>>>> one can review reliably.
>>>>> It will be done incrementally.
>>>> so you agree at least for now we should migrate stateless devices, right?
>>>>> But the feedback, I am taking is, we need to add a command that
>>>>> indicates
>>>> which TLVs are supported in the device migration.
>>>>> So virtio-fs or other device migration capabilities can be discovered.
>>>>> I will cover this in v2.
>>>> so you propose a solution as "virtio migration", but only migrate
>>>> selective types of devices?
>>>> You should rename it to be "virtio-net live migration".
>>> Sorry, I wont. Because infrastructure is for majority device types.
>>>
>>> Which field did you observe which is net specific?
>>> We want to cover all the device types.
>>> Don’t need to cook their context in one series.
>> so, not work for all device types? limited to some specific types?
>> you still need to rename it what ever.
> No. framework works for all device types.
without defining them?
>
>>>>> Thanks a lot for this thoughts.
>>>>>
>>>>>>> The infrastructure and basic facilities are setup in this series,
>>>>>>> that one can
>>>>>> easily extend for all the current and new device types.
>>>>>> really? how?
>>>>>>>> And we are migrating stateless devices, or no? How do you migrate
>>>>>>>> virtio-
>>>> fs?
>>>>>>>>> 2. sharing such large context and write addresses in parallel
>>>>>>>>> for multiple devices cannot be done using single register file
>>>>>>>> see above
>>>>>>>>> 3. These registers cannot be residing in the VF because VF can
>>>>>>>>> undergo FLR, and device reset which must clear these registers
>>>>>>>> do you mean you want to audit all PCI features? When FLR, the
>>>>>>>> device is rested, do you expect a device remember anything after FLR?
>>>>>>> Not at all. VF member device will not remember anything after FLR.
>>>>>>>> Do you want to trap FLR? Why?
>>>>>>> This proposal does _not_ want to trap the FLR in the hypervisor virtio
>> driver.
>>>>>>> When one does the mediation-based design, it must
>>>>>>> trap/emulate/fake the
>>>>>> FLR.
>>>>>>> It helps to address the case of nested as you mentioned.
>>>>>> once passthrough, the guest driver can access the config space to
>>>>>> reset the device, right?
>>>>>>>> Why FLR block or conflict with live migration?
>>>>>>> It does not block or conflict.
>>>>>> OK, cool, so let's make this a conclusion
>>>>>>> The whole point is, when you put live migration functionality on
>>>>>>> the VF itself,
>>>>>> you just cannot FLR this device.
>>>>>>> One must trap the FLR and do fake FLR and build the whole
>>>>>>> infrastructure to
>>>>>> not FLR The device.
>>>>>>> Above is not passthrough device.
>>>>>> No, the guest can reset the device, even causing a failed live migration.
>>>>> Not in the proposal here.
>>>>> Can you please prove how in the current v1 proposal, device reset
>>>>> will fail the
>>>> migration?
>>>>> I would like to fix it.
>>>> if the device is reset, it forgets everything right?
>>> Right. This is why all dirty page track; device context is lost on device reset.
>>> Hence, the controlling function and controlled function are two different
>> entities.
>> so there can be inconsistent migrations and races, right? And if the guest reset
>> the device, actually the hypervisor should let it be, right?
> No. it should not be in because hypervisor has not composed the member device. It is in the hw controlled function itself.
interesting, do you mean when the guest reset the device, the hypervisor 
should refuse?

This actually conflict with your statement of your "passthrough" by " 
not intercepted by VMM". So you actually understand trap and emulate and 
passthrough.
>
>>>>>>>>> 4. When VF does the DMA, all dma occurs in the guest address
>>>>>>>>> space, not in
>>>>>>>> hypervisor space; any flr and device reset must stop such dma.
>>>>>>>>> And device reset and flr are controlled by the guest (not
>>>>>>>>> mediated by
>>>>>>>> hypervisor).
>>>>>>>> if the guest reset the device, it is totally reasonable
>>>>>>>> operation, and the guest own the risk, right?
>>>>>>> Sure, but the guest still expects its dirty pages and device
>>>>>>> context to be
>>>>>> migrated across device_reset.
>>>>>>> Device_reset will lose all this information within the device if
>>>>>>> done without
>>>>>> mediation and special care.
>>>>>> No, if the guest reset a device, that means the device should be
>>>>>> RESET, to forget its config, that would be really wired to migrate
>>>>>> a fresh device at the source side, to be a running device at the
>>>>>> destination
>>>> side.
>>>>> Device reset not doing the role of reset is just a plain broken spec.
>>>> why? The reset behavior is well defined in the spec, and works fine for years.
>>> So any new construct that one adds, it will be reset as well and dirty page
>> track is lost.
>> Yes and do you want to prevent that? You may surprise the guest.
> Yes, want to prevent that.
> Not sure what you mean by surprise the guest. Unlikely.
> Why because guest did the reset, it knows what it is doing.
> (Keep in mind that guest does not expect to lose its dirty pages).
Shocked...

This statement conflict with basic virtualization.
>
>>>>>>> So, to avoid that now one needs to have fake device reset too and
>>>>>>> build that
>>>>>> infrastructure to not reset.
>>>>>>> The passthrough proposal fundamental concept is:
>>>>>>>
>>>>>>> all the native virtio functionalities are between guest driver and
>>>>>>> the actual
>>>>>> device.
>>>>>> see above.
>>>>>>>> and still, do you want to audit every PCI features? at least you
>>>>>>>> didn't do that in your series.
>>>>>>> Can you please list which PCI features audit you are talking about?
>>>>>> you audit FLR, then do you want to check everyone?
>>>>>> If no, how to decide which one should be audited, why others not?
>>>>> I really find it hard to follow your question.
>>>>>
>>>>> I explained in patch 5 and 8 about interactions with the FLR and its support.
>>>>> Not sure what you want me to check.
>>>>>
>>>>> You mentioned that "I didn’t audit every PCI features"? So can you
>>>>> please list
>>>> which one and in relation to which admin commands?
>>>> Your job to audit everyone if you talk about FLR. Because FLR is PCI
>>>> spec, not virtio, you need to explain why other PCI features not need to be
>> audited.
>>> Sure, but when you point figure as I didn’t audit, please mention what is not
>> audited.
>> well, we are migrating virtio devices, but you keep talking PCI, so do you want to
>> take every PCI functionalities into considerations>
> For pci transport, yes.
First, that is out of virtio spec.
Second, if so, you should audit every pci feature, state.
Don't say you want me to define them, this is your statement.
>
>>>> We have explained why FLR is not a concern for many times, and I
>>>> don't want to repeat, please refer to previous discussions.
>>> You seem to ignore the first paragraph of theory of operation that FLR is not
>> trapped.
>> this is the guest issue FLR, right? If so the guest owns the risks and the
>> hypervisor should not prevent that.
> Exactly, hypervisor do not prevent it.
> The owner device still has the ownership to not lose previously logged dirty pages addresses.
> And device still need to report device reset occurred, so that destination side can wipe off and start fresh.
OK, so you know the answer now. This answers your own question above.
>
>>>>>>> Keep in mind, that will all the mediation, one now must equally
>>>>>>> audit all this
>>>>>> giant software stack too.
>>>>>>> So maybe it is fine for those who are ok with it.
>>>>>> so you agree FLR is not a problem, at least for config space solution?
>>>>> I don’t know what you mean "FLR is not a problem".
>>>>>
>>>>> FLR on the VF must work as it works without live migration for
>>>>> passthrough
>>>> device as today.
>>>>> And admin commands have some interactions with it.
>>>>> And this proposal covers it.
>>>>> I am missing some text that Michael and Jason pointed out.
>>>>> I am working on v2 to annotate or better word them.
>>>> When guest reset the device, the device should be reset for sure.
>>>> then it forgets everything, how do you expect the reset-ed device still work
>> for live migration?
>>>> is it a race?
>>> I don’t expect it live migration to work at all with such a approach.
>>> This is why in my proposal live migration occurs on the owner device, while
>> controlled function (member device) is undergoing the device reset.
>> see above
>>>>>>>> For migration, you know the hypervisor takes the ownership of the
>>>>>>>> device in the stop_window.
>>>>>>> I do not know what stop_window means.
>>>>>>> Do you mean stop_copy of vfio or it is qemu term?
>>>>>> when guest freeze.
>>>>>>>>> 5. Any PASID to separate out admin vq on the VF does not work
>>>>>>>>> for two
>>>>>>>> reasons.
>>>>>>>>> R_1: device flr and device reset must stop all the dmas.
>>>>>>>>> R_2: PASID by most leading vendors is still not mature enough
>>>>>>>>> R_3: One also needs to do inversion to not expose PASID
>>>>>>>>> capability of the member PCI device to not expose
>>>>>>>> see above and what if guest shutdown? the same answer, right?
>>>>>>> Not sure, I follow.
>>>>>>> If the guest shutdown, the guest specific shutdown APIs are called.
>>>>>>>
>>>>>>> With passthrough device, R_1 just works as is.
>>>>>>> R_3 is not needed as they are directly given to the guest.
>>>>>>> R_2 platform dependency is not needed either.
>>>>>> I think we already have a concussion for FLR.
>>>>> I don’t have any concussion.
>>>>> I wrote what to be supported for the FLR above.
>>>> OK, again, our discussions has been ignored again, and all start over again.
>>>>
>>>> Would you please read our previous discussions?
>>> You asked the question about why it wont work, I answered.
>>> I don’t see a point of debating same thing over again.
>> Is that cut off again?
>>
> No it is not cut off here.
>
>> if still about FLR, so please see above comments.
>> And I agree if the answers are ignored again, we don't need to repeat.
> I didn’t ask questions. Please re-read.
>
>>>>>> For PASID, what blocks the solution?
>>>>> When the device is passthrough, PASID capabilities cannot be emulated.
>>>>> PASID space is owned fully by the guest.
>>>>>
>>>>> There is no single known cpu vendor support splitting pasid between
>>>> hypervisor and guest.
>>>>> I can double check, but last I recall that Linux kernel removed such
>>>>> weird
>>>> support.
>>>> do you know there is something called vIOMMU?
>>> Probably yes.

[-- Attachment #2: Type: text/html, Size: 64270 bytes --]

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  6:32                               ` Zhu, Lingshan
@ 2023-10-18  6:34                                 ` Parav Pandit
  2023-10-18  6:39                                 ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  6:34 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

Your reply is not in the text format. Please resend with text format to continue discussion.

From: Zhu, Lingshan <lingshan.zhu@intel.com> 
Sent: Wednesday, October 18, 2023 12:02 PM
To: Parav Pandit <parav@nvidia.com>; Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>; virtio-comment@lists.oasis-open.org; cohuck@redhat.com; sburla@marvell.com; Shahaf Shuler <shahafs@nvidia.com>; Maor Gottlieb <maorg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration

On 10/18/2023 1:00 PM, Parav Pandit wrote:

From: Zhu, Lingshan mailto:lingshan.zhu@intel.com
Sent: Monday, October 16, 2023 3:14 PM

On 10/13/2023 7:28 PM, Parav Pandit wrote:
From: Zhu, Lingshan mailto:lingshan.zhu@intel.com
Sent: Friday, October 13, 2023 2:36 PM
[..]
Because it does not work for passthrough mode.
what are you talking about?
Config space does not work passthrough?
Once the register space of the VF that is supposed to be used by the live
migration is passed to the guest, it is under guest control.
Hence, live migration driver won't be able to use it.
Does guest control device status to reset itself? harmful?
No. it is not harmful.
Is owner device reseting itself, harmful? No.
Is member device resetting isetlf, harmful? No.
Should member device reset
good

These facilities can be trapped and emulated, even the feature bits, right?
You know the guest actually don't direct access the device config space, there is
a vfio/vdpa driver, right?
You can practically trap and emulated everything.
If you continue to ignore passthrough requirements and keep repeating that do trap and emulate, this discussion does not go anywhere.
Clearly I did not ignore passthrough and keep answering your question for many times.
Maybe you didn't get it, so I would ask how you define your "passthrough"?

You may find that the guest vCPUs(guest vRC) actually are not privileged to access to
host pci device(host CPU RC), that's why a pass-through driver like vfio is a must.

Therefore the device config space can be trapped. Is that clear now?

Have you ever tried pass through a virtio device to a guest?
:)
Please explain how the question is relevant to this discussion in separate
thread, so that one can keep technical focus.
(Please keep your discussion technical, instead of derogatory to other
members).
if you want me to answer your question, at least you SHOULD NOT cut off the
context, or you are trying to confuse everyone.
Or did you try to avoid or hide anything? I am not sure this is a good practice.

The context in last discussion is:

me: OK, I pop-ed Jason's proposal to make everything easier, and I see it is
refused.
you: Because it does not work for passthrough mode.
me: what are you talking about?
     Config space does not work passthrough?
     Have you ever tried pass through a virtio device to a guest?

So I ask you try to pass through a virito-pci device to a guest, then check
whether the config space work for pass-through mode.

again, don't cut off threads before the discussion is closed.

Let me repeat again, these live migration facilities are
per-device(per-VF) facility, so it only migrates itself.

Since they are per device (per VF), they reside in the guest VM. Hence, VMM
cannot live migrate it.
you know the config space can be trapped and emulated, and the hypervisor
takes the ownership of the device once the guest freeze in the stop window.
When you say config space, do you mean PCI config space of 4K size?
you can take an example of virito common config cap.

And for pass through, you can try passthrough a virito device to a
guest, see how the guest initialize the device through the config space.

That is really basic virtualization, not hard to test.
Repeated points, I am omitting.
ok, if you get it, let's close it.

inflight descriptor tracking will be implemented by Eugenio in V2.
When we have near complete proposal from two device vendors, you
want to push something to unknown future without reviewing the
work; does not
make sense.
Didn't I ever provide feedback to you? Really?
No. I didn’t see why you need to post a new patch for dirty page
tracking,
when it is already present in this series.
This is plain ignorance and shows non_cooperative mode of working in
technical committee.
you have cut off the tread again, so I can't read the context.
Enjoy long threads. 😊
you skip finished discussions for sure, but don't do that to on-going discussions.

I would like to understand and review this aspects.
Same for the device context.
you will see dirty page tracking in my V2, as I repeated for many times.
Since you are not co-operative, I have less sympathy to see V2.
I don’t see a reason to see when, it is fully presented here.
Again, please don't take it personal and please be professional.

Speaking of collaboration, please at least respect others' time and answers.
Both Jason and I have responded to you multiple times on the same
questions(for example, FLR, nested, passthrough).
If our answers are ignored again and again, and then after a few days or hours
you come back asking the same question again, what's the point?

I didn’t ask questions in area of FLR and passthrough, please check again.
OK, then please don't force us answer the same questions anymore, for example no FLR anymore.

And please don't cut off any threads before we close the discussion.

For device context, we have discussed this in other threads, did you
ignored that again?
No. I didn’t. I replied that the generic infrastructure is built the enables every
device type to migrate by defining their device context.
don't we have a conclusion there or did you miss anything? Since you refuse to
define device context for every device type, how do you migrate stateful
devices?

So we should implement a stateless live migration solution, right?
No. device context is basic facility that intent to cover most virtio devices.
not most, instead it should be "all", if you are implementing a virtio live migration.

I didn’t not refuse to define context.
I said, device context will be incrementally defined subsequently.
just define what we have now, for example, define virito-fs as it is now.

Like Michael said, I expect every device to define device context section in coming months for 1.4 time frame.
as MST said, do you expect the implementation to figure out the device context by themselves?
If you want to migrate device context, you should define them.

Hint: how do you define device context for every device type, e.g, virtio-fs.
Don't say you only migrate virito-net or blk.
I didn’t say it. I said to migrate all 30+ device types.
And infrastructure is presented here.
so please define device context for all the devices.
how about starting from virtio-fs?
Should be done incrementally.
show me your patch

You are still in the mode of _take_ what we did with near zero
explanation.
You asked question of why passthrough proposal cannot advantage of
in_band
config registers.
I explained technical reason listed here.
I have answered the questions, and asked questions for many times.
What do you mean by "why passthrough proposal cannot advantage of
in_band config registers."?
Config space work for passthrough for sure.
Config space registers are passthrough the guest VM.
Hence hypervisor messing it with, programming some address would
result in
either security issue.
Or functionally broken, to sustain the functionality, each nested
layer needs
one copy of these registers for each nest level.
So they must be trapped somehow.
trap and emulated are basic virtualization.
Not for passthrough devices, sorry.
See the paper that Jason pointed out.
Control program/vmm is trap is involved only on the privileged operation of
the VMM.
Virtio cvqs, virtio registers are not the privileged operation of the VMM,
because they are of the native virtio device itself.
Period.
since the context is cut of again, I failed to read the context.

But config space can be trapped and emulated, right?
Answered above.

When guest accessing device config space, actually it access the hypervisor-
presented config space.

Secondly I don’t see how one can read 1M flows using config registers.
Not sure what you are talking about, beyond the spec?
The spec which is under works for few months by multiple technical
members.
Please subscribe to virtio-comment mailing list.
How come you changed your point from cvq to different argument of out
of spec? :)
I mean, what is your 1M flows? is it beyond spec?

No. it is not beyond the spec.
It is the spec in work for several months by multiple device, OS and cloud operators.
then, again, what is your 1M flow? if not defined in the spec, then it is beyond spec.

So please don’t jump to conclusions before finishing the
discussion on how
both side can take advantage of each other.
Lets please do that.
We have proposed a solution, right?

Which one? To do something in future?
I don’t see a suggestion on how one can use device context and dirty
page
tracking for nested and passthrough uniformly.
I see a technical difficulty in making both work with uniform interface.
Please don't ignore previous answers, don't force us repeat again and again.

You didn’t answer, how.
Your answer was "you will post dirty page tracking without reviewing current"
and Eugenio will post v2....
Yes, will do. and you can check the patch when it posted.

Does not make sense to me at all.
tracking dirty pages does not make sense to you?

Eugenio will cook a patch for in-flight descriptors, not dirty page, that is mine.

It is Jason's proposal. Please refer to previous threads, also for
device context and dirty pages.
I still need to point out: admin vq LM does not work, one example is
nested.
As Michael said, please don’t confuse between admin commands and
admin
vq.
anyway, admin vq live migration don't work for nested.
I am convicned with the paper that Jason pointed out.

A nested solution involves a member device supporting the nesting without
trap and emulation so that it follows the two properties:
The efficiency property and equivalence property.

Hence a member device which wants to support nested case, should present
itself with attributes to support nesting.
failed to process the sentence, but I am glad you are convinced by the paper.

There are no scale problem as I repeated for many time, they are
per-device basic facilities, just migrate the VF by its own
facility, so there are no 40000 member devices, this is not per PF.

I explained that device reset, flr etc flow cannot work when
controlling and
controlled functions are single entity for passthrough mode.
The scale problem is, one needs to duplicate the registers on each VF.
The industry is moving away from the register interface in many
_real_ hw
devices implementation.
Some of the examples are IMS, SIOV, NVMe and more.
we have discussed this for many times, please refer to previous
threads, even with Jason.
I do not agree for any registers to add to the VF which are reset on
device_reset and FLR.
As it does not work for passthrough mode.
Jason has answered your these FLR questions for many times, I don't
want to repeat his words, even myself have answered many times. If
you keep ignoring the answers, and ask again and again, what is the point?

So please refer to the previous threads.
I don’t think I asked the question above. Please re-read.
you cut if off again, what question? if about FLR, I believe Jason has answered
for many times.

Again, please read. I didn’t ask the question for FLR.
You keep saying "what question".
I failed to read the context because you have cut them off.

The device context can be read from config space or trapped, like
shadow
There are 1 million flows of the net device flow filters in progress.
Each flow is 64B in size.
Total size is 64MB.
I don’t see how one can read such amount of memory using config
registers.
control vq?
The control vq and flow filter vqs are owned by the guest driver,
not the
hypervisor.
So no, cvq cannot be used.
first, don't cut off the threads, don't delete words, that really confusing
readers.

Your comments are so long that it is hard to follow such a long thread.
Hence only the related comments are kept.
But I understand, will try to avoid.

And I think you misunderstand a lot of virtualization fundamentals,
at least have a look at how shadow control vq works.

In case if you don’t know, the shadow cvq acceleration for Nvidia ConnectX6-
DX is done jointly with Dragos and me, with recent patches from Sie-Wei.

I don’t think so I missed.

Shadow vq is great when you don’t have underlying support from the device.

When you have passthrough member devices, they are not trapped or
emulated.
The future hypervisor must not be able to see things of cvq, datavq or
addressed programmed by the guest.
And hence the infrastructure is geared towards such approach.
I failed to read the full context as you cut off them. I can't even read your
original questions, they are truncated.

Anyway, lets migrate device without device-context first.

Passthrough device cannot migrate without device-context as listed.
so please define the device context.

And the parameters set to config vq are also device context as we
discussed for many times.
Or do you want to migrate non-virtio context?
Every thing is virtio device context.
see above
control vq which is already done, that is basic virtualization.
There is nothing like "basic virtualization".
What is proposed here is fulfilling the requirement of passthrough mode.

Your comment is implying, "I don’t care for passthrough
requirements, do
non_passthrough".
that is your understanding, and you misunderstood it. Config space
servers passthrough for many years.
"Config space servers" ?
I do not understand it, can you please explain what does that mean?

I do not see your suggestion on how one can implement passthrough
member
device when passthrough device does the dma and migration framework
also need to do the dma.
Try pass through a virtio device to a guest and learn how the guest
take advantage the config space before you comment.
Right. It does not work. The guest is doing the device_reset and flr.
Hence, it is resetting everything. All the dirty page log is lost.
All the device context is lost.
Hypervisor didn’t see any of this happening, because it didn’t do the trap.

Look, if you are going to continue to argue that you must do trap +
emulation and don’t talk about passthrough, Please stop here, because
discussion won't go anywhere.

I made my best to answer the limitations in very first email where you asked.
OK, I see the gap, and I am sure we can help you here.
Try consider a question:
how do you define pass-through? 
As defined in the cover letter and theory of operation.
Repeat here:
A device whose virtio interfaces are not intercepted by VMM.
In future, may be even MSI-X and MSI-X_v2 or newer interrupt method will be passthrough at device level too.
(only cpu level interrupt remapping will be hypercall at interrupt controller level).

A PCI spec defined config space to stay as emulated as it is generic and not supposed to have any virtio specific things in it as directed by the PCI-SIG.
how guest access device config space in your "passthrough"? 

Can a guest access the device without a host driver helper?
Yes for all the virtio interfaces which includes, virtio device common and device config space, cvq, data vq, flow filter vqs, shared memory and anything new of the future.
interesting, if so, let me ask you a question, Is a guest privileged to access any devices on the host?

That basic facility is missing dirty page tracking, P2P support,
device context,
FLR, device reset support.
Hence, it is unusable right now for passthough member device.
And 6th problemetic thing in it is, it does not scale with member devices.
Please refer to previous discussions, it is meaningless if you keep
ignoring our answers and keep asking the same questions.
Again, please re-read, I didn’t ask the question.
I replied 6 problems that are not solved.
I believe we have answered for many times. The questions are cut off again, but
how about search for previous answers?

If you want to migrate device context, you need to specify device
context for every type of device, net maybe easy, how do you see virtio-
fs?
Virtio-fs will have its on device context too.
Every device has some sort of backend in varied degree.
Net being widely used and moderate complex device.
Fs being slightly stateful but less complex than net, as it has
far less control
operations.
so, do you say you have implement a live migration solution which
can migrate device context, but only work for net or block?
I don’t think this question about implementation has any relevance.
Frankly feels like a court to me. :( No. I dint say that.
We have implemented net, fs, block devices and single framework
proposed
here can support all 3 and rest 28+.
The device context part in this series do not cover special/optional
things of
all the device type.
This is something I promised to do gradually, once the framework looks
good.
If you don't define them, only talking about "migrate the device
context" but don't tell us what do migrate, does this make sense to anybody?
Then you should call it virtio net/blk migration and implement in
net/block section.
No. you misunderstood. My point was showing orthogonal complexities
of net
vs fs.
I likely failed to explain that.
see above, anyway you need to define them, how about starting form virito
FS?
In fact virtio-fs device already discusses the migrating the
device side state, as
listed in device context.
So virtio-fs device will have its own device-context defined.
if you want to migrate it, you need to define it
Sure.
Only device specific things to be defined in future.
Now, not future if you want to migrate device context.
It is not mandatory, and it is impractical do everything in one series.
It is planned for 1.4.
really, you want to define device context for every device time?

Yes.

Remember don't migrate device-context before you define them or how can
the HW implementions know how to do.
I disagree. The infrastructure is defined. And incrementally device context will also be defined.
See an example work from Michael, i.e. admin command and aq generic facility is defined.
And device migration is able to utilize it incrementally. The lower layer fulfill the requirements.
This is exactly what is done here.

Device context framework is defined and many device spec owners will be easily define their device context making it migratable.
see above answers for device context.

Rest is already present.
We are not going to define all the device context in one patch
series that no
one can review reliably.
It will be done incrementally.
so you agree at least for now we should migrate stateless devices, right?
But the feedback, I am taking is, we need to add a command that
indicates
which TLVs are supported in the device migration.
So virtio-fs or other device migration capabilities can be discovered.
I will cover this in v2.
so you propose a solution as "virtio migration", but only migrate
selective types of devices?
You should rename it to be "virtio-net live migration".
Sorry, I wont. Because infrastructure is for majority device types.

Which field did you observe which is net specific?
We want to cover all the device types.
Don’t need to cook their context in one series.
so, not work for all device types? limited to some specific types?
you still need to rename it what ever.
No. framework works for all device types.
without defining them?

Thanks a lot for this thoughts.

The infrastructure and basic facilities are setup in this series,
that one can
easily extend for all the current and new device types.
really? how?
And we are migrating stateless devices, or no? How do you migrate
virtio-
fs?
2. sharing such large context and write addresses in parallel
for multiple devices cannot be done using single register file
see above
3. These registers cannot be residing in the VF because VF can
undergo FLR, and device reset which must clear these registers
do you mean you want to audit all PCI features? When FLR, the
device is rested, do you expect a device remember anything after FLR?
Not at all. VF member device will not remember anything after FLR.
Do you want to trap FLR? Why?
This proposal does _not_ want to trap the FLR in the hypervisor virtio
driver.

When one does the mediation-based design, it must
trap/emulate/fake the
FLR.
It helps to address the case of nested as you mentioned.
once passthrough, the guest driver can access the config space to
reset the device, right?
Why FLR block or conflict with live migration?
It does not block or conflict.
OK, cool, so let's make this a conclusion
The whole point is, when you put live migration functionality on
the VF itself,
you just cannot FLR this device.
One must trap the FLR and do fake FLR and build the whole
infrastructure to
not FLR The device.
Above is not passthrough device.
No, the guest can reset the device, even causing a failed live migration.
Not in the proposal here.
Can you please prove how in the current v1 proposal, device reset
will fail the
migration?
I would like to fix it.
if the device is reset, it forgets everything right?
Right. This is why all dirty page track; device context is lost on device reset.
Hence, the controlling function and controlled function are two different
entities.
so there can be inconsistent migrations and races, right? And if the guest reset
the device, actually the hypervisor should let it be, right?
No. it should not be in because hypervisor has not composed the member device. It is in the hw controlled function itself.
interesting, do you mean when the guest reset the device, the hypervisor should refuse?

This actually conflict with your statement of your "passthrough" by " not intercepted by VMM". So you actually understand trap and emulate and passthrough. 

4. When VF does the DMA, all dma occurs in the guest address
space, not in
hypervisor space; any flr and device reset must stop such dma.
And device reset and flr are controlled by the guest (not
mediated by
hypervisor).
if the guest reset the device, it is totally reasonable
operation, and the guest own the risk, right?
Sure, but the guest still expects its dirty pages and device
context to be
migrated across device_reset.
Device_reset will lose all this information within the device if
done without
mediation and special care.
No, if the guest reset a device, that means the device should be
RESET, to forget its config, that would be really wired to migrate
a fresh device at the source side, to be a running device at the
destination
side.
Device reset not doing the role of reset is just a plain broken spec.
why? The reset behavior is well defined in the spec, and works fine for years.
So any new construct that one adds, it will be reset as well and dirty page
track is lost.
Yes and do you want to prevent that? You may surprise the guest.
Yes, want to prevent that.
Not sure what you mean by surprise the guest. Unlikely.
Why because guest did the reset, it knows what it is doing.
(Keep in mind that guest does not expect to lose its dirty pages).
Shocked...

This statement conflict with basic virtualization. 

So, to avoid that now one needs to have fake device reset too and
build that
infrastructure to not reset.
The passthrough proposal fundamental concept is:

all the native virtio functionalities are between guest driver and
the actual
device.
see above.
and still, do you want to audit every PCI features? at least you
didn't do that in your series.
Can you please list which PCI features audit you are talking about?
you audit FLR, then do you want to check everyone?
If no, how to decide which one should be audited, why others not?
I really find it hard to follow your question.

I explained in patch 5 and 8 about interactions with the FLR and its support.
Not sure what you want me to check.

You mentioned that "I didn’t audit every PCI features"? So can you
please list
which one and in relation to which admin commands?
Your job to audit everyone if you talk about FLR. Because FLR is PCI
spec, not virtio, you need to explain why other PCI features not need to be
audited.

Sure, but when you point figure as I didn’t audit, please mention what is not
audited.
well, we are migrating virtio devices, but you keep talking PCI, so do you want to
take every PCI functionalities into considerations>
For pci transport, yes.
First, that is out of virtio spec.
Second, if so, you should audit every pci feature, state.
Don't say you want me to define them, this is your statement.

We have explained why FLR is not a concern for many times, and I
don't want to repeat, please refer to previous discussions.
You seem to ignore the first paragraph of theory of operation that FLR is not
trapped.
this is the guest issue FLR, right? If so the guest owns the risks and the
hypervisor should not prevent that.
Exactly, hypervisor do not prevent it.
The owner device still has the ownership to not lose previously logged dirty pages addresses.
And device still need to report device reset occurred, so that destination side can wipe off and start fresh.
OK, so you know the answer now. This answers your own question above.

Keep in mind, that will all the mediation, one now must equally
audit all this
giant software stack too.
So maybe it is fine for those who are ok with it.
so you agree FLR is not a problem, at least for config space solution?
I don’t know what you mean "FLR is not a problem".

FLR on the VF must work as it works without live migration for
passthrough
device as today.
And admin commands have some interactions with it.
And this proposal covers it.
I am missing some text that Michael and Jason pointed out.
I am working on v2 to annotate or better word them.
When guest reset the device, the device should be reset for sure.
then it forgets everything, how do you expect the reset-ed device still work
for live migration?
is it a race?
I don’t expect it live migration to work at all with such a approach.
This is why in my proposal live migration occurs on the owner device, while
controlled function (member device) is undergoing the device reset.
see above

For migration, you know the hypervisor takes the ownership of the
device in the stop_window.
I do not know what stop_window means.
Do you mean stop_copy of vfio or it is qemu term?
when guest freeze.
5. Any PASID to separate out admin vq on the VF does not work
for two
reasons.
R_1: device flr and device reset must stop all the dmas.
R_2: PASID by most leading vendors is still not mature enough
R_3: One also needs to do inversion to not expose PASID
capability of the member PCI device to not expose
see above and what if guest shutdown? the same answer, right?
Not sure, I follow.
If the guest shutdown, the guest specific shutdown APIs are called.

With passthrough device, R_1 just works as is.
R_3 is not needed as they are directly given to the guest.
R_2 platform dependency is not needed either.
I think we already have a concussion for FLR.
I don’t have any concussion.
I wrote what to be supported for the FLR above.
OK, again, our discussions has been ignored again, and all start over again.

Would you please read our previous discussions?
You asked the question about why it wont work, I answered.
I don’t see a point of debating same thing over again.
Is that cut off again?

No it is not cut off here.

if still about FLR, so please see above comments.
And I agree if the answers are ignored again, we don't need to repeat.
I didn’t ask questions. Please re-read.

For PASID, what blocks the solution?
When the device is passthrough, PASID capabilities cannot be emulated.
PASID space is owned fully by the guest.

There is no single known cpu vendor support splitting pasid between
hypervisor and guest.
I can double check, but last I recall that Linux kernel removed such
weird
support.
do you know there is something called vIOMMU?
Probably yes.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  5:02                               ` Parav Pandit
  2023-10-18  6:20                                 ` Michael S. Tsirkin
@ 2023-10-18  6:35                                 ` Zhu, Lingshan
  2023-10-18  6:41                                   ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-18  6:35 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/18/2023 1:02 PM, Parav Pandit wrote:
>
>> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
>> open.org> On Behalf Of Zhu, Lingshan
>> Sent: Monday, October 16, 2023 3:18 PM
>>
>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Friday, October 13, 2023 3:14 PM
>>>>>>>> How do you transfer the ownership?
>>>>>>> An additional ownership deletgation by a new admin command.
>>>>>> if you think this can work, do you want to cook a patch to
>>>>>> implement this before you submitting this live migration series?
>>>>> I answered this already above.
>>>> talk is cheap, show me your patch
>>> Huh. We presented the infrastructure that migrates, 30+ device types,
>> covering device context ideas from Oracle.
>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
>>>
>>> Please have some respect for other members who covered more ground than
>> your series.
>>> What more? Apply the same nested concept on the member device as
>> Michael suggested, it is nested virtualization maintain exact same semantics.
>>> So a VF is mapped as PF to the L1 guest.
>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
>>>
>>> This nested work can be extended in future, once first level nesting is
>> covered.
>>>> Answer all questions above, if you think a management VF can work,
>>>> please show me your patch.
>>> The idea evolves from technical debate then pointing fingers like your
>> comment.
>>> I think a positive discussion with Michael and a pointer to the paper from
>> Jason gave a good direction of doing _right_ nesting that follows two principles.
>>> a. efficiency property
>>> b. equivalence property
>>>
>>> (c. resource control is natural already)
>>>
>>> Both apply at VMM and at VM level enabling recursive virtualization, by
>> having VF that can act as PF inside the guest.
>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
>> Please just show me your patch resolving these opens, how about start from
>> defining virito-fs device context and your management VF?
> As answered, device context infrastructure is done, per device specific device-context will be defined incrementally.
> I will not be including virtio-fs in this series. It will be done incrementally in future utilizing the infrastructure build in this series.
Done? How do you conclude this? You just tell me what is the full set of 
virito-fs device context now and how to migrate them.

You cant? you refuse or you don't? Do you expect the HW designer to 
figure out by themself?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  6:32                               ` Zhu, Lingshan
  2023-10-18  6:34                                 ` Parav Pandit
@ 2023-10-18  6:39                                 ` Zhu, Lingshan
  2023-10-18  6:42                                   ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-18  6:39 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

[-- Attachment #1: Type: text/plain, Size: 31229 bytes --]

resend as Parav requested. This mail format looks fine at my side

On 10/18/2023 2:32 PM, Zhu, Lingshan wrote:
>
>
> On 10/18/2023 1:00 PM, Parav Pandit wrote:
>>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>>> Sent: Monday, October 16, 2023 3:14 PM
>>>
>>> On 10/13/2023 7:28 PM, Parav Pandit wrote:
>>>>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>>>>> Sent: Friday, October 13, 2023 2:36 PM
>>>> [..]
>>>>>> Because it does not work for passthrough mode.
>>>>> what are you talking about?
>>>>> Config space does not work passthrough?
>>>> Once the register space of the VF that is supposed to be used by the live
>>> migration is passed to the guest, it is under guest control.
>>>> Hence, live migration driver won't be able to use it.
>>> Does guest control device status to reset itself? harmful?
>> No. it is not harmful.
>> Is owner device reseting itself, harmful? No.
>> Is member device resetting isetlf, harmful? No.
>> Should member device reset
> good
>>> These facilities can be trapped and emulated, even the feature bits, right?
>>> You know the guest actually don't direct access the device config space, there is
>>> a vfio/vdpa driver, right?
>> You can practically trap and emulated everything.
>> If you continue to ignore passthrough requirements and keep repeating that do trap and emulate, this discussion does not go anywhere.
> Clearly I did not ignore passthrough and keep answering your question 
> for many times.
> Maybe you didn't get it, so I would ask how you define your "passthrough"?
>
> You may find that the guest vCPUs(guest vRC) actually are not 
> privileged to access to
> host pci device(host CPU RC), that's why a pass-through driver like 
> vfio is a must.
>
> Therefore the device config space can be trapped. Is that clear now?
>>>>> Have you ever tried pass through a virtio device to a guest?
>>>> :)
>>>> Please explain how the question is relevant to this discussion in separate
>>> thread, so that one can keep technical focus.
>>>> (Please keep your discussion technical, instead of derogatory to other
>>> members).
>>> if you want me to answer your question, at least you SHOULD NOT cut off the
>>> context, or you are trying to confuse everyone.
>>> Or did you try to avoid or hide anything? I am not sure this is a good practice.
>>>
>>> The context in last discussion is:
>>>
>>> me: OK, I pop-ed Jason's proposal to make everything easier, and I see it is
>>> refused.
>>> you: Because it does not work for passthrough mode.
>>> me: what are you talking about?
>>>       Config space does not work passthrough?
>>>       Have you ever tried pass through a virtio device to a guest?
>>>
>>> So I ask you try to pass through a virito-pci device to a guest, then check
>>> whether the config space work for pass-through mode.
>>>
>>> again, don't cut off threads before the discussion is closed.
>>>>> Let me repeat again, these live migration facilities are
>>>>> per-device(per-VF) facility, so it only migrates itself.
>>>>>
>>>> Since they are per device (per VF), they reside in the guest VM. Hence, VMM
>>> cannot live migrate it.
>>> you know the config space can be trapped and emulated, and the hypervisor
>>> takes the ownership of the device once the guest freeze in the stop window.
>> When you say config space, do you mean PCI config space of 4K size?
> you can take an example of virito common config cap.
>>>>> And for pass through, you can try passthrough a virito device to a
>>>>> guest, see how the guest initialize the device through the config space.
>>>>>
>>>>> That is really basic virtualization, not hard to test.
>>>> Repeated points, I am omitting.
>>> ok, if you get it, let's close it.
>>>>>>>>> inflight descriptor tracking will be implemented by Eugenio in V2.
>>>>>>>> When we have near complete proposal from two device vendors, you
>>>>>>>> want to push something to unknown future without reviewing the
>>>>>>>> work; does not
>>>>>>> make sense.
>>>>>>> Didn't I ever provide feedback to you? Really?
>>>>>> No. I didn’t see why you need to post a new patch for dirty page
>>>>>> tracking,
>>>>> when it is already present in this series.
>>>> This is plain ignorance and shows non_cooperative mode of working in
>>> technical committee.
>>> you have cut off the tread again, so I can't read the context.
>> Enjoy long threads. 😊
> you skip finished discussions for sure, but don't do that to on-going 
> discussions.
>>>>>> I would like to understand and review this aspects.
>>>>>> Same for the device context.
>>>>> you will see dirty page tracking in my V2, as I repeated for many times.
>>>> Since you are not co-operative, I have less sympathy to see V2.
>>>> I don’t see a reason to see when, it is fully presented here.
>>> Again, please don't take it personal and please be professional.
>>>
>>> Speaking of collaboration, please at least respect others' time and answers.
>>> Both Jason and I have responded to you multiple times on the same
>>> questions(for example, FLR, nested, passthrough).
>>> If our answers are ignored again and again, and then after a few days or hours
>>> you come back asking the same question again, what's the point?
>>>
>> I didn’t ask questions in area of FLR and passthrough, please check again.
> OK, then please don't force us answer the same questions anymore, for 
> example no FLR anymore.
>>> And please don't cut off any threads before we close the discussion.
>>>>> For device context, we have discussed this in other threads, did you
>>>>> ignored that again?
>>>> No. I didn’t. I replied that the generic infrastructure is built the enables every
>>> device type to migrate by defining their device context.
>>> don't we have a conclusion there or did you miss anything? Since you refuse to
>>> define device context for every device type, how do you migrate stateful
>>> devices?
>>>
>>> So we should implement a stateless live migration solution, right?
>> No. device context is basic facility that intent to cover most virtio devices.
> not most, instead it should be "all", if you are implementing a virtio 
> live migration.
>> I didn’t not refuse to define context.
>> I said, device context will be incrementally defined subsequently.
> just define what we have now, for example, define virito-fs as it is now.
>> Like Michael said, I expect every device to define device context section in coming months for 1.4 time frame.
> as MST said, do you expect the implementation to figure out the device 
> context by themselves?
> If you want to migrate device context, you should define them.
>>>>> Hint: how do you define device context for every device type, e.g, virtio-fs.
>>>>> Don't say you only migrate virito-net or blk.
>>>> I didn’t say it. I said to migrate all 30+ device types.
>>>> And infrastructure is presented here.
>>> so please define device context for all the devices.
>>> how about starting from virtio-fs?
>> Should be done incrementally.
> show me your patch
>>>>>>>> You are still in the mode of _take_ what we did with near zero
>>> explanation.
>>>>>>>> You asked question of why passthrough proposal cannot advantage of
>>>>>>>> in_band
>>>>>>> config registers.
>>>>>>>> I explained technical reason listed here.
>>>>>>> I have answered the questions, and asked questions for many times.
>>>>>>> What do you mean by "why passthrough proposal cannot advantage of
>>>>>>> in_band config registers."?
>>>>>>> Config space work for passthrough for sure.
>>>>>> Config space registers are passthrough the guest VM.
>>>>>> Hence hypervisor messing it with, programming some address would
>>>>>> result in
>>>>> either security issue.
>>>>>> Or functionally broken, to sustain the functionality, each nested
>>>>>> layer needs
>>>>> one copy of these registers for each nest level.
>>>>>> So they must be trapped somehow.
>>>>> trap and emulated are basic virtualization.
>>>> Not for passthrough devices, sorry.
>>>> See the paper that Jason pointed out.
>>>> Control program/vmm is trap is involved only on the privileged operation of
>>> the VMM.
>>>> Virtio cvqs, virtio registers are not the privileged operation of the VMM,
>>> because they are of the native virtio device itself.
>>>> Period.
>>> since the context is cut of again, I failed to read the context.
>>>
>>> But config space can be trapped and emulated, right?
>> Answered above.
>>
>>> When guest accessing device config space, actually it access the hypervisor-
>>> presented config space.
>>>>>> Secondly I don’t see how one can read 1M flows using config registers.
>>>>> Not sure what you are talking about, beyond the spec?
>>>> The spec which is under works for few months by multiple technical
>>> members.
>>>> Please subscribe to virtio-comment mailing list.
>>>> How come you changed your point from cvq to different argument of out
>>>> of spec? :)
>>> I mean, what is your 1M flows? is it beyond spec?
>> No. it is not beyond the spec.
>> It is the spec in work for several months by multiple device, OS and cloud operators.
> then, again, what is your 1M flow? if not defined in the spec, then it 
> is beyond spec.
>>>>>>>> So please don’t jump to conclusions before finishing the
>>>>>>>> discussion on how
>>>>>>> both side can take advantage of each other.
>>>>>>>> Lets please do that.
>>>>>>> We have proposed a solution, right?
>>>>>>>
>>>>>> Which one? To do something in future?
>>>>>> I don’t see a suggestion on how one can use device context and dirty
>>>>>> page
>>>>> tracking for nested and passthrough uniformly.
>>>>>> I see a technical difficulty in making both work with uniform interface.
>>>>> Please don't ignore previous answers, don't force us repeat again and again.
>>>>>
>>>> You didn’t answer, how.
>>>> Your answer was "you will post dirty page tracking without reviewing current"
>>> and Eugenio will post v2....
>>> Yes, will do. and you can check the patch when it posted.
>>>
>> Does not make sense to me at all.
> tracking dirty pages does not make sense to you?
>>   
>>> Eugenio will cook a patch for in-flight descriptors, not dirty page, that is mine.
>>>>> It is Jason's proposal. Please refer to previous threads, also for
>>>>> device context and dirty pages.
>>>>>>> I still need to point out: admin vq LM does not work, one example is
>>> nested.
>>>>>> As Michael said, please don’t confuse between admin commands and
>>>>>> admin
>>>>> vq.
>>>>> anyway, admin vq live migration don't work for nested.
>>>> I am convicned with the paper that Jason pointed out.
>>>>
>>>> A nested solution involves a member device supporting the nesting without
>>> trap and emulation so that it follows the two properties:
>>>> The efficiency property and equivalence property.
>>>>
>>>> Hence a member device which wants to support nested case, should present
>>> itself with attributes to support nesting.
>>> failed to process the sentence, but I am glad you are convinced by the paper.
>>>>>>>>> There are no scale problem as I repeated for many time, they are
>>>>>>>>> per-device basic facilities, just migrate the VF by its own
>>>>>>>>> facility, so there are no 40000 member devices, this is not per PF.
>>>>>>>>>
>>>>>>>> I explained that device reset, flr etc flow cannot work when
>>>>>>>> controlling and
>>>>>>> controlled functions are single entity for passthrough mode.
>>>>>>>> The scale problem is, one needs to duplicate the registers on each VF.
>>>>>>>> The industry is moving away from the register interface in many
>>>>>>>> _real_ hw
>>>>>>> devices implementation.
>>>>>>>> Some of the examples are IMS, SIOV, NVMe and more.
>>>>>>> we have discussed this for many times, please refer to previous
>>>>>>> threads, even with Jason.
>>>>>> I do not agree for any registers to add to the VF which are reset on
>>>>> device_reset and FLR.
>>>>>> As it does not work for passthrough mode.
>>>>> Jason has answered your these FLR questions for many times, I don't
>>>>> want to repeat his words, even myself have answered many times. If
>>>>> you keep ignoring the answers, and ask again and again, what is the point?
>>>>>
>>>>> So please refer to the previous threads.
>>>> I don’t think I asked the question above. Please re-read.
>>> you cut if off again, what question? if about FLR, I believe Jason has answered
>>> for many times.
>> Again, please read. I didn’t ask the question for FLR.
>> You keep saying "what question".
> I failed to read the context because you have cut them off.
>>>>>>>>> The device context can be read from config space or trapped, like
>>>>>>>>> shadow
>>>>>>>> There are 1 million flows of the net device flow filters in progress.
>>>>>>>> Each flow is 64B in size.
>>>>>>>> Total size is 64MB.
>>>>>>>> I don’t see how one can read such amount of memory using config
>>>>> registers.
>>>>>>> control vq?
>>>>>> The control vq and flow filter vqs are owned by the guest driver,
>>>>>> not the
>>>>> hypervisor.
>>>>>> So no, cvq cannot be used.
>>>>> first, don't cut off the threads, don't delete words, that really confusing
>>> readers.
>>>> Your comments are so long that it is hard to follow such a long thread.
>>>> Hence only the related comments are kept.
>>>> But I understand, will try to avoid.
>>>>
>>>>> And I think you misunderstand a lot of virtualization fundamentals,
>>>>> at least have a look at how shadow control vq works.
>>>>>
>>>> In case if you don’t know, the shadow cvq acceleration for Nvidia ConnectX6-
>>> DX is done jointly with Dragos and me, with recent patches from Sie-Wei.
>>>> I don’t think so I missed.
>>>>
>>>> Shadow vq is great when you don’t have underlying support from the device.
>>>>
>>>> When you have passthrough member devices, they are not trapped or
>>> emulated.
>>>> The future hypervisor must not be able to see things of cvq, datavq or
>>> addressed programmed by the guest.
>>>> And hence the infrastructure is geared towards such approach.
>>> I failed to read the full context as you cut off them. I can't even read your
>>> original questions, they are truncated.
>>>
>>> Anyway, lets migrate device without device-context first.
>> Passthrough device cannot migrate without device-context as listed.
> so please define the device context.
>>>>> And the parameters set to config vq are also device context as we
>>>>> discussed for many times.
>>>>>>> Or do you want to migrate non-virtio context?
>>>>>> Every thing is virtio device context.
>>>>> see above
>>>>>>>>> control vq which is already done, that is basic virtualization.
>>>>>>>> There is nothing like "basic virtualization".
>>>>>>>> What is proposed here is fulfilling the requirement of passthrough mode.
>>>>>>>>
>>>>>>>> Your comment is implying, "I don’t care for passthrough
>>>>>>>> requirements, do
>>>>>>> non_passthrough".
>>>>>>> that is your understanding, and you misunderstood it. Config space
>>>>>>> servers passthrough for many years.
>>>>>> "Config space servers" ?
>>>>>> I do not understand it, can you please explain what does that mean?
>>>>>>
>>>>>> I do not see your suggestion on how one can implement passthrough
>>>>>> member
>>>>> device when passthrough device does the dma and migration framework
>>>>> also need to do the dma.
>>>>> Try pass through a virtio device to a guest and learn how the guest
>>>>> take advantage the config space before you comment.
>>>> Right. It does not work. The guest is doing the device_reset and flr.
>>>> Hence, it is resetting everything. All the dirty page log is lost.
>>>> All the device context is lost.
>>>> Hypervisor didn’t see any of this happening, because it didn’t do the trap.
>>>>
>>>> Look, if you are going to continue to argue that you must do trap +
>>>> emulation and don’t talk about passthrough, Please stop here, because
>>> discussion won't go anywhere.
>>>> I made my best to answer the limitations in very first email where you asked.
>>> OK, I see the gap, and I am sure we can help you here.
>>> Try consider a question:
>>> how do you define pass-through?
>> As defined in the cover letter and theory of operation.
>> Repeat here:
>> A device whose virtio interfaces are not intercepted by VMM.
>> In future, may be even MSI-X and MSI-X_v2 or newer interrupt method will be passthrough at device level too.
>> (only cpu level interrupt remapping will be hypercall at interrupt controller level).
>>
>> A PCI spec defined config space to stay as emulated as it is generic and not supposed to have any virtio specific things in it as directed by the PCI-SIG.
> how guest access device config space in your "passthrough"?
>>> Can a guest access the device without a host driver helper?
>> Yes for all the virtio interfaces which includes, virtio device common and device config space, cvq, data vq, flow filter vqs, shared memory and anything new of the future.
> interesting, if so, let me ask you a question, Is a guest privileged 
> to access any devices on the host?
>>>>>> That basic facility is missing dirty page tracking, P2P support,
>>>>>> device context,
>>>>> FLR, device reset support.
>>>>>> Hence, it is unusable right now for passthough member device.
>>>>>> And 6th problemetic thing in it is, it does not scale with member devices.
>>>>> Please refer to previous discussions, it is meaningless if you keep
>>>>> ignoring our answers and keep asking the same questions.
>>>> Again, please re-read, I didn’t ask the question.
>>>> I replied 6 problems that are not solved.
>>> I believe we have answered for many times. The questions are cut off again, but
>>> how about search for previous answers?
>>>>>>>>> If you want to migrate device context, you need to specify device
>>>>>>>>> context for every type of device, net maybe easy, how do you see virtio-
>>> fs?
>>>>>>>> Virtio-fs will have its on device context too.
>>>>>>>> Every device has some sort of backend in varied degree.
>>>>>>>> Net being widely used and moderate complex device.
>>>>>>>> Fs being slightly stateful but less complex than net, as it has
>>>>>>>> far less control
>>>>>>> operations.
>>>>>>> so, do you say you have implement a live migration solution which
>>>>>>> can migrate device context, but only work for net or block?
>>>>>> I don’t think this question about implementation has any relevance.
>>>>>> Frankly feels like a court to me. :( No. I dint say that.
>>>>>> We have implemented net, fs, block devices and single framework
>>>>>> proposed
>>>>> here can support all 3 and rest 28+.
>>>>>> The device context part in this series do not cover special/optional
>>>>>> things of
>>>>> all the device type.
>>>>>> This is something I promised to do gradually, once the framework looks
>>> good.
>>>>> If you don't define them, only talking about "migrate the device
>>>>> context" but don't tell us what do migrate, does this make sense to anybody?
>>>>>>> Then you should call it virtio net/blk migration and implement in
>>>>>>> net/block section.
>>>>>> No. you misunderstood. My point was showing orthogonal complexities
>>>>>> of net
>>>>> vs fs.
>>>>>> I likely failed to explain that.
>>>>> see above, anyway you need to define them, how about starting form virito
>>> FS?
>>>>>>>> In fact virtio-fs device already discusses the migrating the
>>>>>>>> device side state, as
>>>>>>> listed in device context.
>>>>>>>> So virtio-fs device will have its own device-context defined.
>>>>>>> if you want to migrate it, you need to define it
>>>>>> Sure.
>>>>>> Only device specific things to be defined in future.
>>>>> Now, not future if you want to migrate device context.
>>>> It is not mandatory, and it is impractical do everything in one series.
>>>> It is planned for 1.4.
>>> really, you want to define device context for every device time?
>>>
>> Yes.
>>   
>>> Remember don't migrate device-context before you define them or how can
>>> the HW implementions know how to do.
>> I disagree. The infrastructure is defined. And incrementally device context will also be defined.
>> See an example work from Michael, i.e. admin command and aq generic facility is defined.
>> And device migration is able to utilize it incrementally. The lower layer fulfill the requirements.
>> This is exactly what is done here.
>>
>> Device context framework is defined and many device spec owners will be easily define their device context making it migratable.
> see above answers for device context.
>>>>>> Rest is already present.
>>>>>> We are not going to define all the device context in one patch
>>>>>> series that no
>>>>> one can review reliably.
>>>>>> It will be done incrementally.
>>>>> so you agree at least for now we should migrate stateless devices, right?
>>>>>> But the feedback, I am taking is, we need to add a command that
>>>>>> indicates
>>>>> which TLVs are supported in the device migration.
>>>>>> So virtio-fs or other device migration capabilities can be discovered.
>>>>>> I will cover this in v2.
>>>>> so you propose a solution as "virtio migration", but only migrate
>>>>> selective types of devices?
>>>>> You should rename it to be "virtio-net live migration".
>>>> Sorry, I wont. Because infrastructure is for majority device types.
>>>>
>>>> Which field did you observe which is net specific?
>>>> We want to cover all the device types.
>>>> Don’t need to cook their context in one series.
>>> so, not work for all device types? limited to some specific types?
>>> you still need to rename it what ever.
>> No. framework works for all device types.
> without defining them?
>>>>>> Thanks a lot for this thoughts.
>>>>>>
>>>>>>>> The infrastructure and basic facilities are setup in this series,
>>>>>>>> that one can
>>>>>>> easily extend for all the current and new device types.
>>>>>>> really? how?
>>>>>>>>> And we are migrating stateless devices, or no? How do you migrate
>>>>>>>>> virtio-
>>>>> fs?
>>>>>>>>>> 2. sharing such large context and write addresses in parallel
>>>>>>>>>> for multiple devices cannot be done using single register file
>>>>>>>>> see above
>>>>>>>>>> 3. These registers cannot be residing in the VF because VF can
>>>>>>>>>> undergo FLR, and device reset which must clear these registers
>>>>>>>>> do you mean you want to audit all PCI features? When FLR, the
>>>>>>>>> device is rested, do you expect a device remember anything after FLR?
>>>>>>>> Not at all. VF member device will not remember anything after FLR.
>>>>>>>>> Do you want to trap FLR? Why?
>>>>>>>> This proposal does _not_ want to trap the FLR in the hypervisor virtio
>>> driver.
>>>>>>>> When one does the mediation-based design, it must
>>>>>>>> trap/emulate/fake the
>>>>>>> FLR.
>>>>>>>> It helps to address the case of nested as you mentioned.
>>>>>>> once passthrough, the guest driver can access the config space to
>>>>>>> reset the device, right?
>>>>>>>>> Why FLR block or conflict with live migration?
>>>>>>>> It does not block or conflict.
>>>>>>> OK, cool, so let's make this a conclusion
>>>>>>>> The whole point is, when you put live migration functionality on
>>>>>>>> the VF itself,
>>>>>>> you just cannot FLR this device.
>>>>>>>> One must trap the FLR and do fake FLR and build the whole
>>>>>>>> infrastructure to
>>>>>>> not FLR The device.
>>>>>>>> Above is not passthrough device.
>>>>>>> No, the guest can reset the device, even causing a failed live migration.
>>>>>> Not in the proposal here.
>>>>>> Can you please prove how in the current v1 proposal, device reset
>>>>>> will fail the
>>>>> migration?
>>>>>> I would like to fix it.
>>>>> if the device is reset, it forgets everything right?
>>>> Right. This is why all dirty page track; device context is lost on device reset.
>>>> Hence, the controlling function and controlled function are two different
>>> entities.
>>> so there can be inconsistent migrations and races, right? And if the guest reset
>>> the device, actually the hypervisor should let it be, right?
>> No. it should not be in because hypervisor has not composed the member device. It is in the hw controlled function itself.
> interesting, do you mean when the guest reset the device, the 
> hypervisor should refuse?
>
> This actually conflict with your statement of your "passthrough" by " 
> not intercepted by VMM". So you actually understand trap and emulate 
> and passthrough.
>>>>>>>>>> 4. When VF does the DMA, all dma occurs in the guest address
>>>>>>>>>> space, not in
>>>>>>>>> hypervisor space; any flr and device reset must stop such dma.
>>>>>>>>>> And device reset and flr are controlled by the guest (not
>>>>>>>>>> mediated by
>>>>>>>>> hypervisor).
>>>>>>>>> if the guest reset the device, it is totally reasonable
>>>>>>>>> operation, and the guest own the risk, right?
>>>>>>>> Sure, but the guest still expects its dirty pages and device
>>>>>>>> context to be
>>>>>>> migrated across device_reset.
>>>>>>>> Device_reset will lose all this information within the device if
>>>>>>>> done without
>>>>>>> mediation and special care.
>>>>>>> No, if the guest reset a device, that means the device should be
>>>>>>> RESET, to forget its config, that would be really wired to migrate
>>>>>>> a fresh device at the source side, to be a running device at the
>>>>>>> destination
>>>>> side.
>>>>>> Device reset not doing the role of reset is just a plain broken spec.
>>>>> why? The reset behavior is well defined in the spec, and works fine for years.
>>>> So any new construct that one adds, it will be reset as well and dirty page
>>> track is lost.
>>> Yes and do you want to prevent that? You may surprise the guest.
>> Yes, want to prevent that.
>> Not sure what you mean by surprise the guest. Unlikely.
>> Why because guest did the reset, it knows what it is doing.
>> (Keep in mind that guest does not expect to lose its dirty pages).
> Shocked...
>
> This statement conflict with basic virtualization.
>>>>>>>> So, to avoid that now one needs to have fake device reset too and
>>>>>>>> build that
>>>>>>> infrastructure to not reset.
>>>>>>>> The passthrough proposal fundamental concept is:
>>>>>>>>
>>>>>>>> all the native virtio functionalities are between guest driver and
>>>>>>>> the actual
>>>>>>> device.
>>>>>>> see above.
>>>>>>>>> and still, do you want to audit every PCI features? at least you
>>>>>>>>> didn't do that in your series.
>>>>>>>> Can you please list which PCI features audit you are talking about?
>>>>>>> you audit FLR, then do you want to check everyone?
>>>>>>> If no, how to decide which one should be audited, why others not?
>>>>>> I really find it hard to follow your question.
>>>>>>
>>>>>> I explained in patch 5 and 8 about interactions with the FLR and its support.
>>>>>> Not sure what you want me to check.
>>>>>>
>>>>>> You mentioned that "I didn’t audit every PCI features"? So can you
>>>>>> please list
>>>>> which one and in relation to which admin commands?
>>>>> Your job to audit everyone if you talk about FLR. Because FLR is PCI
>>>>> spec, not virtio, you need to explain why other PCI features not need to be
>>> audited.
>>>> Sure, but when you point figure as I didn’t audit, please mention what is not
>>> audited.
>>> well, we are migrating virtio devices, but you keep talking PCI, so do you want to
>>> take every PCI functionalities into considerations>
>> For pci transport, yes.
> First, that is out of virtio spec.
> Second, if so, you should audit every pci feature, state.
> Don't say you want me to define them, this is your statement.
>>>>> We have explained why FLR is not a concern for many times, and I
>>>>> don't want to repeat, please refer to previous discussions.
>>>> You seem to ignore the first paragraph of theory of operation that FLR is not
>>> trapped.
>>> this is the guest issue FLR, right? If so the guest owns the risks and the
>>> hypervisor should not prevent that.
>> Exactly, hypervisor do not prevent it.
>> The owner device still has the ownership to not lose previously logged dirty pages addresses.
>> And device still need to report device reset occurred, so that destination side can wipe off and start fresh.
> OK, so you know the answer now. This answers your own question above.
>>>>>>>> Keep in mind, that will all the mediation, one now must equally
>>>>>>>> audit all this
>>>>>>> giant software stack too.
>>>>>>>> So maybe it is fine for those who are ok with it.
>>>>>>> so you agree FLR is not a problem, at least for config space solution?
>>>>>> I don’t know what you mean "FLR is not a problem".
>>>>>>
>>>>>> FLR on the VF must work as it works without live migration for
>>>>>> passthrough
>>>>> device as today.
>>>>>> And admin commands have some interactions with it.
>>>>>> And this proposal covers it.
>>>>>> I am missing some text that Michael and Jason pointed out.
>>>>>> I am working on v2 to annotate or better word them.
>>>>> When guest reset the device, the device should be reset for sure.
>>>>> then it forgets everything, how do you expect the reset-ed device still work
>>> for live migration?
>>>>> is it a race?
>>>> I don’t expect it live migration to work at all with such a approach.
>>>> This is why in my proposal live migration occurs on the owner device, while
>>> controlled function (member device) is undergoing the device reset.
>>> see above
>>>>>>>>> For migration, you know the hypervisor takes the ownership of the
>>>>>>>>> device in the stop_window.
>>>>>>>> I do not know what stop_window means.
>>>>>>>> Do you mean stop_copy of vfio or it is qemu term?
>>>>>>> when guest freeze.
>>>>>>>>>> 5. Any PASID to separate out admin vq on the VF does not work
>>>>>>>>>> for two
>>>>>>>>> reasons.
>>>>>>>>>> R_1: device flr and device reset must stop all the dmas.
>>>>>>>>>> R_2: PASID by most leading vendors is still not mature enough
>>>>>>>>>> R_3: One also needs to do inversion to not expose PASID
>>>>>>>>>> capability of the member PCI device to not expose
>>>>>>>>> see above and what if guest shutdown? the same answer, right?
>>>>>>>> Not sure, I follow.
>>>>>>>> If the guest shutdown, the guest specific shutdown APIs are called.
>>>>>>>>
>>>>>>>> With passthrough device, R_1 just works as is.
>>>>>>>> R_3 is not needed as they are directly given to the guest.
>>>>>>>> R_2 platform dependency is not needed either.
>>>>>>> I think we already have a concussion for FLR.
>>>>>> I don’t have any concussion.
>>>>>> I wrote what to be supported for the FLR above.
>>>>> OK, again, our discussions has been ignored again, and all start over again.
>>>>>
>>>>> Would you please read our previous discussions?
>>>> You asked the question about why it wont work, I answered.
>>>> I don’t see a point of debating same thing over again.
>>> Is that cut off again?
>>>
>> No it is not cut off here.
>>
>>> if still about FLR, so please see above comments.
>>> And I agree if the answers are ignored again, we don't need to repeat.
>> I didn’t ask questions. Please re-read.
>>
>>>>>>> For PASID, what blocks the solution?
>>>>>> When the device is passthrough, PASID capabilities cannot be emulated.
>>>>>> PASID space is owned fully by the guest.
>>>>>>
>>>>>> There is no single known cpu vendor support splitting pasid between
>>>>> hypervisor and guest.
>>>>>> I can double check, but last I recall that Linux kernel removed such
>>>>>> weird
>>>>> support.
>>>>> do you know there is something called vIOMMU?
>>>> Probably yes.
>

[-- Attachment #2: Type: text/html, Size: 64367 bytes --]

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  6:35                                 ` Zhu, Lingshan
@ 2023-10-18  6:41                                   ` Parav Pandit
  2023-10-18  6:52                                     ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  6:41 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, October 18, 2023 12:06 PM
> 
> On 10/18/2023 1:02 PM, Parav Pandit wrote:
> >
> >> From: virtio-comment@lists.oasis-open.org
> >> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> >> Sent: Monday, October 16, 2023 3:18 PM
> >>
> >> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, October 13, 2023 3:14 PM
> >>>>>>>> How do you transfer the ownership?
> >>>>>>> An additional ownership deletgation by a new admin command.
> >>>>>> if you think this can work, do you want to cook a patch to
> >>>>>> implement this before you submitting this live migration series?
> >>>>> I answered this already above.
> >>>> talk is cheap, show me your patch
> >>> Huh. We presented the infrastructure that migrates, 30+ device
> >>> types,
> >> covering device context ideas from Oracle.
> >>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
> >>>
> >>> Please have some respect for other members who covered more ground
> >>> than
> >> your series.
> >>> What more? Apply the same nested concept on the member device as
> >> Michael suggested, it is nested virtualization maintain exact same semantics.
> >>> So a VF is mapped as PF to the L1 guest.
> >>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> >>>
> >>> This nested work can be extended in future, once first level nesting
> >>> is
> >> covered.
> >>>> Answer all questions above, if you think a management VF can work,
> >>>> please show me your patch.
> >>> The idea evolves from technical debate then pointing fingers like
> >>> your
> >> comment.
> >>> I think a positive discussion with Michael and a pointer to the
> >>> paper from
> >> Jason gave a good direction of doing _right_ nesting that follows two
> principles.
> >>> a. efficiency property
> >>> b. equivalence property
> >>>
> >>> (c. resource control is natural already)
> >>>
> >>> Both apply at VMM and at VM level enabling recursive virtualization,
> >>> by
> >> having VF that can act as PF inside the guest.
> >>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> >> Please just show me your patch resolving these opens, how about start
> >> from defining virito-fs device context and your management VF?
> > As answered, device context infrastructure is done, per device specific device-
> context will be defined incrementally.
> > I will not be including virtio-fs in this series. It will be done incrementally in
> future utilizing the infrastructure build in this series.
> Done? How do you conclude this? You just tell me what is the full set of virito-fs
> device context now and how to migrate them.
> 
> You cant? you refuse or you don't? Do you expect the HW designer to figure out
> by themself?
I wont be able to tell now as I don’t think it is necessary for this series.
If one out of 30 devices cannot migrate because of unimaginable amount of complexity has been placed there, may be one will not implement it as member device.

From experience of migratable complex gpu devices, rdma devices (stateful having hundred thousand of stateful QPs), my understanding is complex state of virtio-fs can be defined and migratable.
Mlx5 driver consist of 150,000 lines of code and that device is migratable with complex state.
So I am optimistic that virtio-fs can be migratable too.
It does not have to limited by my limited creativity of 2023.
May be I am wrong, in that case one will not implement passthrough virtio-fs device.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  6:39                                 ` Zhu, Lingshan
@ 2023-10-18  6:42                                   ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  6:42 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

[-- Attachment #1: Type: text/plain, Size: 30068 bytes --]

I kept the original format now for you to see that this email is not in the plain text format.

From: Zhu, Lingshan <lingshan.zhu@intel.com>
Sent: Wednesday, October 18, 2023 12:09 PM
To: Parav Pandit <parav@nvidia.com>; Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>; virtio-comment@lists.oasis-open.org; cohuck@redhat.com; sburla@marvell.com; Shahaf Shuler <shahafs@nvidia.com>; Maor Gottlieb <maorg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration

resend as Parav requested. This mail format looks fine at my side
On 10/18/2023 2:32 PM, Zhu, Lingshan wrote:

On 10/18/2023 1:00 PM, Parav Pandit wrote:

From: Zhu, Lingshan <lingshan.zhu@intel.com><mailto:lingshan.zhu@intel.com>

Sent: Monday, October 16, 2023 3:14 PM

On 10/13/2023 7:28 PM, Parav Pandit wrote:

From: Zhu, Lingshan <lingshan.zhu@intel.com><mailto:lingshan.zhu@intel.com>

Sent: Friday, October 13, 2023 2:36 PM

[..]

Because it does not work for passthrough mode.

what are you talking about?

Config space does not work passthrough?

Once the register space of the VF that is supposed to be used by the live

migration is passed to the guest, it is under guest control.

Hence, live migration driver won't be able to use it.

Does guest control device status to reset itself? harmful?

No. it is not harmful.

Is owner device reseting itself, harmful? No.

Is member device resetting isetlf, harmful? No.

Should member device reset
good

These facilities can be trapped and emulated, even the feature bits, right?

You know the guest actually don't direct access the device config space, there is

a vfio/vdpa driver, right?

You can practically trap and emulated everything.

If you continue to ignore passthrough requirements and keep repeating that do trap and emulate, this discussion does not go anywhere.
Clearly I did not ignore passthrough and keep answering your question for many times.
Maybe you didn't get it, so I would ask how you define your "passthrough"?

You may find that the guest vCPUs(guest vRC) actually are not privileged to access to
host pci device(host CPU RC), that's why a pass-through driver like vfio is a must.

Therefore the device config space can be trapped. Is that clear now?

Have you ever tried pass through a virtio device to a guest?

:)

Please explain how the question is relevant to this discussion in separate

thread, so that one can keep technical focus.

(Please keep your discussion technical, instead of derogatory to other

members).

if you want me to answer your question, at least you SHOULD NOT cut off the

context, or you are trying to confuse everyone.

Or did you try to avoid or hide anything? I am not sure this is a good practice.

The context in last discussion is:

me: OK, I pop-ed Jason's proposal to make everything easier, and I see it is

refused.

you: Because it does not work for passthrough mode.

me: what are you talking about?

     Config space does not work passthrough?

     Have you ever tried pass through a virtio device to a guest?

So I ask you try to pass through a virito-pci device to a guest, then check

whether the config space work for pass-through mode.

again, don't cut off threads before the discussion is closed.

Let me repeat again, these live migration facilities are

per-device(per-VF) facility, so it only migrates itself.

Since they are per device (per VF), they reside in the guest VM. Hence, VMM

cannot live migrate it.

you know the config space can be trapped and emulated, and the hypervisor

takes the ownership of the device once the guest freeze in the stop window.

When you say config space, do you mean PCI config space of 4K size?
you can take an example of virito common config cap.

And for pass through, you can try passthrough a virito device to a

guest, see how the guest initialize the device through the config space.

That is really basic virtualization, not hard to test.

Repeated points, I am omitting.

ok, if you get it, let's close it.

inflight descriptor tracking will be implemented by Eugenio in V2.

When we have near complete proposal from two device vendors, you

want to push something to unknown future without reviewing the

work; does not

make sense.

Didn't I ever provide feedback to you? Really?

No. I didn’t see why you need to post a new patch for dirty page

tracking,

when it is already present in this series.

This is plain ignorance and shows non_cooperative mode of working in

technical committee.

you have cut off the tread again, so I can't read the context.

Enjoy long threads. 😊
you skip finished discussions for sure, but don't do that to on-going discussions.

I would like to understand and review this aspects.

Same for the device context.

you will see dirty page tracking in my V2, as I repeated for many times.

Since you are not co-operative, I have less sympathy to see V2.

I don’t see a reason to see when, it is fully presented here.

Again, please don't take it personal and please be professional.

Speaking of collaboration, please at least respect others' time and answers.

Both Jason and I have responded to you multiple times on the same

questions(for example, FLR, nested, passthrough).

If our answers are ignored again and again, and then after a few days or hours

you come back asking the same question again, what's the point?

I didn’t ask questions in area of FLR and passthrough, please check again.
OK, then please don't force us answer the same questions anymore, for example no FLR anymore.

And please don't cut off any threads before we close the discussion.

For device context, we have discussed this in other threads, did you

ignored that again?

No. I didn’t. I replied that the generic infrastructure is built the enables every

device type to migrate by defining their device context.

don't we have a conclusion there or did you miss anything? Since you refuse to

define device context for every device type, how do you migrate stateful

devices?

So we should implement a stateless live migration solution, right?

No. device context is basic facility that intent to cover most virtio devices.
not most, instead it should be "all", if you are implementing a virtio live migration.

I didn’t not refuse to define context.

I said, device context will be incrementally defined subsequently.
just define what we have now, for example, define virito-fs as it is now.

Like Michael said, I expect every device to define device context section in coming months for 1.4 time frame.
as MST said, do you expect the implementation to figure out the device context by themselves?
If you want to migrate device context, you should define them.

Hint: how do you define device context for every device type, e.g, virtio-fs.

Don't say you only migrate virito-net or blk.

I didn’t say it. I said to migrate all 30+ device types.

And infrastructure is presented here.

so please define device context for all the devices.

how about starting from virtio-fs?

Should be done incrementally.
show me your patch

You are still in the mode of _take_ what we did with near zero

explanation.

You asked question of why passthrough proposal cannot advantage of

in_band

config registers.

I explained technical reason listed here.

I have answered the questions, and asked questions for many times.

What do you mean by "why passthrough proposal cannot advantage of

in_band config registers."?

Config space work for passthrough for sure.

Config space registers are passthrough the guest VM.

Hence hypervisor messing it with, programming some address would

result in

either security issue.

Or functionally broken, to sustain the functionality, each nested

layer needs

one copy of these registers for each nest level.

So they must be trapped somehow.

trap and emulated are basic virtualization.

Not for passthrough devices, sorry.

See the paper that Jason pointed out.

Control program/vmm is trap is involved only on the privileged operation of

the VMM.

Virtio cvqs, virtio registers are not the privileged operation of the VMM,

because they are of the native virtio device itself.

Period.

since the context is cut of again, I failed to read the context.

But config space can be trapped and emulated, right?

Answered above.

When guest accessing device config space, actually it access the hypervisor-

presented config space.

Secondly I don’t see how one can read 1M flows using config registers.

Not sure what you are talking about, beyond the spec?

The spec which is under works for few months by multiple technical

members.

Please subscribe to virtio-comment mailing list.

How come you changed your point from cvq to different argument of out

of spec? :)

I mean, what is your 1M flows? is it beyond spec?

No. it is not beyond the spec.

It is the spec in work for several months by multiple device, OS and cloud operators.
then, again, what is your 1M flow? if not defined in the spec, then it is beyond spec.

So please don’t jump to conclusions before finishing the

discussion on how

both side can take advantage of each other.

Lets please do that.

We have proposed a solution, right?

Which one? To do something in future?

I don’t see a suggestion on how one can use device context and dirty

page

tracking for nested and passthrough uniformly.

I see a technical difficulty in making both work with uniform interface.

Please don't ignore previous answers, don't force us repeat again and again.

You didn’t answer, how.

Your answer was "you will post dirty page tracking without reviewing current"

and Eugenio will post v2....

Yes, will do. and you can check the patch when it posted.

Does not make sense to me at all.
tracking dirty pages does not make sense to you?

Eugenio will cook a patch for in-flight descriptors, not dirty page, that is mine.

It is Jason's proposal. Please refer to previous threads, also for

device context and dirty pages.

I still need to point out: admin vq LM does not work, one example is

nested.

As Michael said, please don’t confuse between admin commands and

admin

vq.

anyway, admin vq live migration don't work for nested.

I am convicned with the paper that Jason pointed out.

A nested solution involves a member device supporting the nesting without

trap and emulation so that it follows the two properties:

The efficiency property and equivalence property.

Hence a member device which wants to support nested case, should present

itself with attributes to support nesting.

failed to process the sentence, but I am glad you are convinced by the paper.

There are no scale problem as I repeated for many time, they are

per-device basic facilities, just migrate the VF by its own

facility, so there are no 40000 member devices, this is not per PF.

I explained that device reset, flr etc flow cannot work when

controlling and

controlled functions are single entity for passthrough mode.

The scale problem is, one needs to duplicate the registers on each VF.

The industry is moving away from the register interface in many

_real_ hw

devices implementation.

Some of the examples are IMS, SIOV, NVMe and more.

we have discussed this for many times, please refer to previous

threads, even with Jason.

I do not agree for any registers to add to the VF which are reset on

device_reset and FLR.

As it does not work for passthrough mode.

Jason has answered your these FLR questions for many times, I don't

want to repeat his words, even myself have answered many times. If

you keep ignoring the answers, and ask again and again, what is the point?

So please refer to the previous threads.

I don’t think I asked the question above. Please re-read.

you cut if off again, what question? if about FLR, I believe Jason has answered

for many times.

Again, please read. I didn’t ask the question for FLR.

You keep saying "what question".
I failed to read the context because you have cut them off.

The device context can be read from config space or trapped, like

shadow

There are 1 million flows of the net device flow filters in progress.

Each flow is 64B in size.

Total size is 64MB.

I don’t see how one can read such amount of memory using config

registers.

control vq?

The control vq and flow filter vqs are owned by the guest driver,

not the

hypervisor.

So no, cvq cannot be used.

first, don't cut off the threads, don't delete words, that really confusing

readers.

Your comments are so long that it is hard to follow such a long thread.

Hence only the related comments are kept.

But I understand, will try to avoid.

And I think you misunderstand a lot of virtualization fundamentals,

at least have a look at how shadow control vq works.

In case if you don’t know, the shadow cvq acceleration for Nvidia ConnectX6-

DX is done jointly with Dragos and me, with recent patches from Sie-Wei.

I don’t think so I missed.

Shadow vq is great when you don’t have underlying support from the device.

When you have passthrough member devices, they are not trapped or

emulated.

The future hypervisor must not be able to see things of cvq, datavq or

addressed programmed by the guest.

And hence the infrastructure is geared towards such approach.

I failed to read the full context as you cut off them. I can't even read your

original questions, they are truncated.

Anyway, lets migrate device without device-context first.

Passthrough device cannot migrate without device-context as listed.
so please define the device context.

And the parameters set to config vq are also device context as we

discussed for many times.

Or do you want to migrate non-virtio context?

Every thing is virtio device context.

see above

control vq which is already done, that is basic virtualization.

There is nothing like "basic virtualization".

What is proposed here is fulfilling the requirement of passthrough mode.

Your comment is implying, "I don’t care for passthrough

requirements, do

non_passthrough".

that is your understanding, and you misunderstood it. Config space

servers passthrough for many years.

"Config space servers" ?

I do not understand it, can you please explain what does that mean?

I do not see your suggestion on how one can implement passthrough

member

device when passthrough device does the dma and migration framework

also need to do the dma.

Try pass through a virtio device to a guest and learn how the guest

take advantage the config space before you comment.

Right. It does not work. The guest is doing the device_reset and flr.

Hence, it is resetting everything. All the dirty page log is lost.

All the device context is lost.

Hypervisor didn’t see any of this happening, because it didn’t do the trap.

Look, if you are going to continue to argue that you must do trap +

emulation and don’t talk about passthrough, Please stop here, because

discussion won't go anywhere.

I made my best to answer the limitations in very first email where you asked.

OK, I see the gap, and I am sure we can help you here.

Try consider a question:

how do you define pass-through?

As defined in the cover letter and theory of operation.

Repeat here:

A device whose virtio interfaces are not intercepted by VMM.

In future, may be even MSI-X and MSI-X_v2 or newer interrupt method will be passthrough at device level too.

(only cpu level interrupt remapping will be hypercall at interrupt controller level).

A PCI spec defined config space to stay as emulated as it is generic and not supposed to have any virtio specific things in it as directed by the PCI-SIG.
how guest access device config space in your "passthrough"?

Can a guest access the device without a host driver helper?

Yes for all the virtio interfaces which includes, virtio device common and device config space, cvq, data vq, flow filter vqs, shared memory and anything new of the future.
interesting, if so, let me ask you a question, Is a guest privileged to access any devices on the host?

That basic facility is missing dirty page tracking, P2P support,

device context,

FLR, device reset support.

Hence, it is unusable right now for passthough member device.

And 6th problemetic thing in it is, it does not scale with member devices.

Please refer to previous discussions, it is meaningless if you keep

ignoring our answers and keep asking the same questions.

Again, please re-read, I didn’t ask the question.

I replied 6 problems that are not solved.

I believe we have answered for many times. The questions are cut off again, but

how about search for previous answers?

If you want to migrate device context, you need to specify device

context for every type of device, net maybe easy, how do you see virtio-

fs?

Virtio-fs will have its on device context too.

Every device has some sort of backend in varied degree.

Net being widely used and moderate complex device.

Fs being slightly stateful but less complex than net, as it has

far less control

operations.

so, do you say you have implement a live migration solution which

can migrate device context, but only work for net or block?

I don’t think this question about implementation has any relevance.

Frankly feels like a court to me. :( No. I dint say that.

We have implemented net, fs, block devices and single framework

proposed

here can support all 3 and rest 28+.

The device context part in this series do not cover special/optional

things of

all the device type.

This is something I promised to do gradually, once the framework looks

good.

If you don't define them, only talking about "migrate the device

context" but don't tell us what do migrate, does this make sense to anybody?

Then you should call it virtio net/blk migration and implement in

net/block section.

No. you misunderstood. My point was showing orthogonal complexities

of net

vs fs.

I likely failed to explain that.

see above, anyway you need to define them, how about starting form virito

FS?

In fact virtio-fs device already discusses the migrating the

device side state, as

listed in device context.

So virtio-fs device will have its own device-context defined.

if you want to migrate it, you need to define it

Sure.

Only device specific things to be defined in future.

Now, not future if you want to migrate device context.

It is not mandatory, and it is impractical do everything in one series.

It is planned for 1.4.

really, you want to define device context for every device time?

Yes.

Remember don't migrate device-context before you define them or how can

the HW implementions know how to do.

I disagree. The infrastructure is defined. And incrementally device context will also be defined.

See an example work from Michael, i.e. admin command and aq generic facility is defined.

And device migration is able to utilize it incrementally. The lower layer fulfill the requirements.

This is exactly what is done here.

Device context framework is defined and many device spec owners will be easily define their device context making it migratable.
see above answers for device context.

Rest is already present.

We are not going to define all the device context in one patch

series that no

one can review reliably.

It will be done incrementally.

so you agree at least for now we should migrate stateless devices, right?

But the feedback, I am taking is, we need to add a command that

indicates

which TLVs are supported in the device migration.

So virtio-fs or other device migration capabilities can be discovered.

I will cover this in v2.

so you propose a solution as "virtio migration", but only migrate

selective types of devices?

You should rename it to be "virtio-net live migration".

Sorry, I wont. Because infrastructure is for majority device types.

Which field did you observe which is net specific?

We want to cover all the device types.

Don’t need to cook their context in one series.

so, not work for all device types? limited to some specific types?

you still need to rename it what ever.

No. framework works for all device types.
without defining them?

Thanks a lot for this thoughts.

The infrastructure and basic facilities are setup in this series,

that one can

easily extend for all the current and new device types.

really? how?

And we are migrating stateless devices, or no? How do you migrate

virtio-

fs?

2. sharing such large context and write addresses in parallel

for multiple devices cannot be done using single register file

see above

3. These registers cannot be residing in the VF because VF can

undergo FLR, and device reset which must clear these registers

do you mean you want to audit all PCI features? When FLR, the

device is rested, do you expect a device remember anything after FLR?

Not at all. VF member device will not remember anything after FLR.

Do you want to trap FLR? Why?

This proposal does _not_ want to trap the FLR in the hypervisor virtio

driver.

When one does the mediation-based design, it must

trap/emulate/fake the

FLR.

It helps to address the case of nested as you mentioned.

once passthrough, the guest driver can access the config space to

reset the device, right?

Why FLR block or conflict with live migration?

It does not block or conflict.

OK, cool, so let's make this a conclusion

The whole point is, when you put live migration functionality on

the VF itself,

you just cannot FLR this device.

One must trap the FLR and do fake FLR and build the whole

infrastructure to

not FLR The device.

Above is not passthrough device.

No, the guest can reset the device, even causing a failed live migration.

Not in the proposal here.

Can you please prove how in the current v1 proposal, device reset

will fail the

migration?

I would like to fix it.

if the device is reset, it forgets everything right?

Right. This is why all dirty page track; device context is lost on device reset.

Hence, the controlling function and controlled function are two different

entities.

so there can be inconsistent migrations and races, right? And if the guest reset

the device, actually the hypervisor should let it be, right?

No. it should not be in because hypervisor has not composed the member device. It is in the hw controlled function itself.
interesting, do you mean when the guest reset the device, the hypervisor should refuse?

This actually conflict with your statement of your "passthrough" by " not intercepted by VMM". So you actually understand trap and emulate and passthrough.

4. When VF does the DMA, all dma occurs in the guest address

space, not in

hypervisor space; any flr and device reset must stop such dma.

And device reset and flr are controlled by the guest (not

mediated by

hypervisor).

if the guest reset the device, it is totally reasonable

operation, and the guest own the risk, right?

Sure, but the guest still expects its dirty pages and device

context to be

migrated across device_reset.

Device_reset will lose all this information within the device if

done without

mediation and special care.

No, if the guest reset a device, that means the device should be

RESET, to forget its config, that would be really wired to migrate

a fresh device at the source side, to be a running device at the

destination

side.

Device reset not doing the role of reset is just a plain broken spec.

why? The reset behavior is well defined in the spec, and works fine for years.

So any new construct that one adds, it will be reset as well and dirty page

track is lost.

Yes and do you want to prevent that? You may surprise the guest.

Yes, want to prevent that.

Not sure what you mean by surprise the guest. Unlikely.

Why because guest did the reset, it knows what it is doing.

(Keep in mind that guest does not expect to lose its dirty pages).
Shocked...

This statement conflict with basic virtualization.

So, to avoid that now one needs to have fake device reset too and

build that

infrastructure to not reset.

The passthrough proposal fundamental concept is:

all the native virtio functionalities are between guest driver and

the actual

device.

see above.

and still, do you want to audit every PCI features? at least you

didn't do that in your series.

Can you please list which PCI features audit you are talking about?

you audit FLR, then do you want to check everyone?

If no, how to decide which one should be audited, why others not?

I really find it hard to follow your question.

I explained in patch 5 and 8 about interactions with the FLR and its support.

Not sure what you want me to check.

You mentioned that "I didn’t audit every PCI features"? So can you

please list

which one and in relation to which admin commands?

Your job to audit everyone if you talk about FLR. Because FLR is PCI

spec, not virtio, you need to explain why other PCI features not need to be

audited.

Sure, but when you point figure as I didn’t audit, please mention what is not

audited.

well, we are migrating virtio devices, but you keep talking PCI, so do you want to

take every PCI functionalities into considerations>

For pci transport, yes.
First, that is out of virtio spec.
Second, if so, you should audit every pci feature, state.
Don't say you want me to define them, this is your statement.

We have explained why FLR is not a concern for many times, and I

don't want to repeat, please refer to previous discussions.

You seem to ignore the first paragraph of theory of operation that FLR is not

trapped.

this is the guest issue FLR, right? If so the guest owns the risks and the

hypervisor should not prevent that.

Exactly, hypervisor do not prevent it.

The owner device still has the ownership to not lose previously logged dirty pages addresses.

And device still need to report device reset occurred, so that destination side can wipe off and start fresh.
OK, so you know the answer now. This answers your own question above.

Keep in mind, that will all the mediation, one now must equally

audit all this

giant software stack too.

So maybe it is fine for those who are ok with it.

so you agree FLR is not a problem, at least for config space solution?

I don’t know what you mean "FLR is not a problem".

FLR on the VF must work as it works without live migration for

passthrough

device as today.

And admin commands have some interactions with it.

And this proposal covers it.

I am missing some text that Michael and Jason pointed out.

I am working on v2 to annotate or better word them.

When guest reset the device, the device should be reset for sure.

then it forgets everything, how do you expect the reset-ed device still work

for live migration?

is it a race?

I don’t expect it live migration to work at all with such a approach.

This is why in my proposal live migration occurs on the owner device, while

controlled function (member device) is undergoing the device reset.

see above

For migration, you know the hypervisor takes the ownership of the

device in the stop_window.

I do not know what stop_window means.

Do you mean stop_copy of vfio or it is qemu term?

when guest freeze.

5. Any PASID to separate out admin vq on the VF does not work

for two

reasons.

R_1: device flr and device reset must stop all the dmas.

R_2: PASID by most leading vendors is still not mature enough

R_3: One also needs to do inversion to not expose PASID

capability of the member PCI device to not expose

see above and what if guest shutdown? the same answer, right?

Not sure, I follow.

If the guest shutdown, the guest specific shutdown APIs are called.

With passthrough device, R_1 just works as is.

R_3 is not needed as they are directly given to the guest.

R_2 platform dependency is not needed either.

I think we already have a concussion for FLR.

I don’t have any concussion.

I wrote what to be supported for the FLR above.

OK, again, our discussions has been ignored again, and all start over again.

Would you please read our previous discussions?

You asked the question about why it wont work, I answered.

I don’t see a point of debating same thing over again.

Is that cut off again?

No it is not cut off here.

if still about FLR, so please see above comments.

And I agree if the answers are ignored again, we don't need to repeat.

I didn’t ask questions. Please re-read.

For PASID, what blocks the solution?

When the device is passthrough, PASID capabilities cannot be emulated.

PASID space is owned fully by the guest.

There is no single known cpu vendor support splitting pasid between

hypervisor and guest.

I can double check, but last I recall that Linux kernel removed such

weird

support.

do you know there is something called vIOMMU?

Probably yes.

[-- Attachment #2: Type: text/html, Size: 65075 bytes --]

^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] Re: [PATCH v1 4/8] admin: Add device migration admin commands
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 4/8] admin: Add device migration admin commands Parav Pandit
@ 2023-10-18  6:46   ` Michael S. Tsirkin
  2023-10-18  8:24     ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18  6:46 UTC (permalink / raw)
  To: Parav Pandit; +Cc: virtio-comment, cohuck, sburla, shahafs, maorg, yishaih

On Sun, Oct 08, 2023 at 02:25:51PM +0300, Parav Pandit wrote:

...

> +\paragraph{Device Context Size Get Command}
> +\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Size Get Command}
> +
> +This command returns the remaining estimated device context size. The 
> +driver can query the remaining estimated device context size
> +for the current mode or for the \field{Freeze} mode. While
> +reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the
> +actual device context size may differ than what is being returned by
> +this command. After reading the device context using command
> +VIRTIO_ADMIN_CMD_DEV_CTX_READ, the remaining estimated context size
> +usually reduces by amount of device context read by the driver using
> +VIRTIO_ADMIN_CMD_DEV_CTX_READ command. If the device context is updated
> +rapidly the remaining estimated context size may also increase even after
> +reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command.
> +
> +For the command VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, \field{opcode} is set to 0x9.
> +The \field{group_member_id} refers to the member device to be accessed.
> +
> +\begin{lstlisting}
> +struct virtio_admin_cmd_dev_ctx_size_get_data {
> +        u8 freeze_mode;
> +};
> +\end{lstlisting}
> +
> +The \field{command_specific_data} is in the format
> +\field{struct virtio_admin_cmd_dev_ctx_size_get_data}.
> +When \field{freeze_mode} is set to 1, the device returns the estimated
> +device context size when the device will be in \field{Freeze} mode.
> +As the device context is read from the device, the remaining estimated
> +context size may decrease. For example, member device mode is
> +\field{Stop}, the device has estimated total device context size
> +as 12KB; the device would return 12KB for the first
> +VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command, once the driver has
> +already read 8KB of device context data using
> +VIRTIO_ADMIN_CMD_DEV_CTX_READ command, and the remaining data is
> +4KB, hence the device returns 4KB in the subsequent
> +VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command.
> +
> +\begin{lstlisting}
> +struct virtio_admin_cmd_dev_ctx_size_get_result {
> +        le64 size;
> +};
> +\end{lstlisting}

So we have a 64 bit size? How are we going to return so much?


> +
> +When the command completes successfully, \field{command_specific_result} is in
> +the format \field{struct virtio_admin_cmd_dev_ctx_size_get_result}.
> +
> +Once the device context is fully read, this command returns zero for
> +\field{size} until the new device context is generated.
> +
> +\paragraph{Device Context Read Command}
> +\label{par:Basic Facilities of a Virtio Device / Device groups / Group administration commands / Device Migration / Device Context Read Command}
> +
> +This command reads the current device context.
> +For the command VIRTIO_ADMIN_CMD_DEV_CTX_READ, \field{opcode} is set to 0xa.
> +The \field{group_member_id} refers to the member device to be accessed.
> +
> +This command has no command specific data.

So I am not sure this is wise. Multi-year experience with QEMU taught us
that we are likely to make mistakes when defining migration format -
forget some fields, validate them incorrectly, and so on.
Making a somewhat safe assumption that we'll make mistakes
in the spec, too, I'd like to see some kind of idea of how
we'll support compatibility and/or graceful failure if/when we do.


> +\begin{lstlisting}
> +struct virtio_admin_cmd_dev_ctx_rd_len {
> +        le32 context_len;
> +};
> +
> +struct virtio_admin_cmd_dev_ctx_rd_result {
> +        u8 data[];
> +};
> +\end{lstlisting}

so callers needs to pin whatever device tells it to?

admin commands support truncation intentionally.

it is not clear, to me, that it's ok to have device just
save as much state as it wants to.

> +
> +When the command completes successfully, \field{command_specific_result}
> +is in the format \field{struct virtio_admin_cmd_dev_ctx_rd_result}
> +returned by the device containing the device context data and
> +\field{command_specific_output} is in format of
> +\field{struct virtio_admin_cmd_dev_ctx_rd_len} containing length of
> +context data returned by the device in the command response. When the length
> +returned is zero or when the returned context data is less the data requested by
> +the driver, the device do not have any device context data left that the device
> +can report, at this point the device context stream ends.
> +
> +The driver can read the whole device context data using one or multiple
> +commands. When the device context does not fit in the
> +\field{command_specific_result}, driver reads the subsequent remaining
> +bytes using one or more subsequent commands.

how?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  6:41                                   ` Parav Pandit
@ 2023-10-18  6:52                                     ` Zhu, Lingshan
  2023-10-18  7:20                                       ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-18  6:52 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/18/2023 2:41 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, October 18, 2023 12:06 PM
>>
>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
>>>> From: virtio-comment@lists.oasis-open.org
>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
>>>> Sent: Monday, October 16, 2023 3:18 PM
>>>>
>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Friday, October 13, 2023 3:14 PM
>>>>>>>>>> How do you transfer the ownership?
>>>>>>>>> An additional ownership deletgation by a new admin command.
>>>>>>>> if you think this can work, do you want to cook a patch to
>>>>>>>> implement this before you submitting this live migration series?
>>>>>>> I answered this already above.
>>>>>> talk is cheap, show me your patch
>>>>> Huh. We presented the infrastructure that migrates, 30+ device
>>>>> types,
>>>> covering device context ideas from Oracle.
>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
>>>>>
>>>>> Please have some respect for other members who covered more ground
>>>>> than
>>>> your series.
>>>>> What more? Apply the same nested concept on the member device as
>>>> Michael suggested, it is nested virtualization maintain exact same semantics.
>>>>> So a VF is mapped as PF to the L1 guest.
>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
>>>>>
>>>>> This nested work can be extended in future, once first level nesting
>>>>> is
>>>> covered.
>>>>>> Answer all questions above, if you think a management VF can work,
>>>>>> please show me your patch.
>>>>> The idea evolves from technical debate then pointing fingers like
>>>>> your
>>>> comment.
>>>>> I think a positive discussion with Michael and a pointer to the
>>>>> paper from
>>>> Jason gave a good direction of doing _right_ nesting that follows two
>> principles.
>>>>> a. efficiency property
>>>>> b. equivalence property
>>>>>
>>>>> (c. resource control is natural already)
>>>>>
>>>>> Both apply at VMM and at VM level enabling recursive virtualization,
>>>>> by
>>>> having VF that can act as PF inside the guest.
>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
>>>> Please just show me your patch resolving these opens, how about start
>>>> from defining virito-fs device context and your management VF?
>>> As answered, device context infrastructure is done, per device specific device-
>> context will be defined incrementally.
>>> I will not be including virtio-fs in this series. It will be done incrementally in
>> future utilizing the infrastructure build in this series.
>> Done? How do you conclude this? You just tell me what is the full set of virito-fs
>> device context now and how to migrate them.
>>
>> You cant? you refuse or you don't? Do you expect the HW designer to figure out
>> by themself?
> I wont be able to tell now as I don’t think it is necessary for this series.
> If one out of 30 devices cannot migrate because of unimaginable amount of complexity has been placed there, may be one will not implement it as member device.
>
>  From experience of migratable complex gpu devices, rdma devices (stateful having hundred thousand of stateful QPs), my understanding is complex state of virtio-fs can be defined and migratable.
> Mlx5 driver consist of 150,000 lines of code and that device is migratable with complex state.
> So I am optimistic that virtio-fs can be migratable too.
> It does not have to limited by my limited creativity of 2023.
> May be I am wrong, in that case one will not implement passthrough virtio-fs device.
your series wants to migrate device context, but doesn't define device 
context, does this sounds reasonable?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  6:52                                     ` Zhu, Lingshan
@ 2023-10-18  7:20                                       ` Parav Pandit
  2023-10-18  8:42                                         ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  7:20 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, October 18, 2023 12:22 PM
> 
> On 10/18/2023 2:41 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Wednesday, October 18, 2023 12:06 PM
> >>
> >> On 10/18/2023 1:02 PM, Parav Pandit wrote:
> >>>> From: virtio-comment@lists.oasis-open.org
> >>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> >>>> Sent: Monday, October 16, 2023 3:18 PM
> >>>>
> >>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Friday, October 13, 2023 3:14 PM
> >>>>>>>>>> How do you transfer the ownership?
> >>>>>>>>> An additional ownership deletgation by a new admin command.
> >>>>>>>> if you think this can work, do you want to cook a patch to
> >>>>>>>> implement this before you submitting this live migration series?
> >>>>>>> I answered this already above.
> >>>>>> talk is cheap, show me your patch
> >>>>> Huh. We presented the infrastructure that migrates, 30+ device
> >>>>> types,
> >>>> covering device context ideas from Oracle.
> >>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
> >>>>>
> >>>>> Please have some respect for other members who covered more ground
> >>>>> than
> >>>> your series.
> >>>>> What more? Apply the same nested concept on the member device as
> >>>> Michael suggested, it is nested virtualization maintain exact same
> semantics.
> >>>>> So a VF is mapped as PF to the L1 guest.
> >>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> >>>>>
> >>>>> This nested work can be extended in future, once first level
> >>>>> nesting is
> >>>> covered.
> >>>>>> Answer all questions above, if you think a management VF can
> >>>>>> work, please show me your patch.
> >>>>> The idea evolves from technical debate then pointing fingers like
> >>>>> your
> >>>> comment.
> >>>>> I think a positive discussion with Michael and a pointer to the
> >>>>> paper from
> >>>> Jason gave a good direction of doing _right_ nesting that follows
> >>>> two
> >> principles.
> >>>>> a. efficiency property
> >>>>> b. equivalence property
> >>>>>
> >>>>> (c. resource control is natural already)
> >>>>>
> >>>>> Both apply at VMM and at VM level enabling recursive
> >>>>> virtualization, by
> >>>> having VF that can act as PF inside the guest.
> >>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> >>>> Please just show me your patch resolving these opens, how about
> >>>> start from defining virito-fs device context and your management VF?
> >>> As answered, device context infrastructure is done, per device
> >>> specific device-
> >> context will be defined incrementally.
> >>> I will not be including virtio-fs in this series. It will be done
> >>> incrementally in
> >> future utilizing the infrastructure build in this series.
> >> Done? How do you conclude this? You just tell me what is the full set
> >> of virito-fs device context now and how to migrate them.
> >>
> >> You cant? you refuse or you don't? Do you expect the HW designer to
> >> figure out by themself?
> > I wont be able to tell now as I don’t think it is necessary for this series.
> > If one out of 30 devices cannot migrate because of unimaginable amount of
> complexity has been placed there, may be one will not implement it as member
> device.
> >
> >  From experience of migratable complex gpu devices, rdma devices (stateful
> having hundred thousand of stateful QPs), my understanding is complex state of
> virtio-fs can be defined and migratable.
> > Mlx5 driver consist of 150,000 lines of code and that device is migratable
> with complex state.
> > So I am optimistic that virtio-fs can be migratable too.
> > It does not have to limited by my limited creativity of 2023.
> > May be I am wrong, in that case one will not implement passthrough virtio-fs
> device.
> your series wants to migrate device context, but doesn't define device context,
> does this sounds reasonable?

Device generic context is defined at [1] and also the infrastructure for defining the device context in parallel by multiple people can be done post the work of [1].

Per each device type context will be defined incrementally post this work.

[1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.html

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-17  1:41                   ` Jason Wang
@ 2023-10-18  8:16                     ` Parav Pandit
  2023-10-18 10:19                       ` Michael S. Tsirkin
  2023-10-19  2:41                       ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  8:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 17, 2023 7:12 AM
> 
> On Fri, Oct 13, 2023 at 2:36 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Friday, October 13, 2023 6:46 AM
> >
> > [..]
> > > > > > > It's still not clear to me how this is done.
> > > > > > >
> > > > > > > 1) guest starts FLR
> > > > > > > 2) adminq freeze the VF
> > > > > > > 3) FLR is done
> > > > > > >
> > > > > > > If the freezing doesn't wait for the FLR, does it mean we
> > > > > > > need to migrate to a state like FLR is pending? If yes, do
> > > > > > > we need to migrate the other sub states like this? If not, why?
> > > > > > >
> > > > > > In most practical cases #2 followed by #1 should not happen as
> > > > > > on the source
> > > > > side the expected is mode change to stop from active.
> > > > >
> > > > > How does the hypervisor know if a guest is doing what without trapping?
> > > > >
> > > > Hypervisor does not know. The device knows being the recipient of #1 and
> #2.
> > >
> > > We are discussing the possibility in software/driver side isn't it?
> > >
> > > 1) is initiated from the guest
> > > 2) is initiated from the hypervisor
> > >
> > > Both are softwares, and you're saying 2) should not happen after 1)
> > > since the device knows what is being done by guests? How can devices
> > > control software behaviour?
> > >
> > Device do not control software behavior.
> > i.e. either hypervisor can initiate device mode change to stop (not freeze) or
> guest can initiate FLR.
> > Device knows which is initiated first as single recipient of both.
> > Therefore, device responds accordingly.
> > For example, in the sequence you described, A device will delay mode
> > change command response, until the FLR is completed.
> 
> Finally but ok.
> 
> >
> >
> > > This only possible thing is to make sure 3) is done before 2) That
> > > is what I'm asking but you are saying freeze doesn't need to wait for FLR...
> > >
> > I think I responded in previous email further down on synchronization point
> being fw.
> > I meant to say software do not need to wait for initiation of the freeze mode
> command.
> 
> For software, did you mean the hypervisor?
> 
Yes.

> > Just the command will complete at right time.
> >
> > This is anyway very corner case.
> > On source hypervisor as written in the theory of operation, the sequence is
> active->stop->freeze.
> > When mode change is done to stop, the vcpus are already suspended.
> 
> The problem here is not the vcpu but the when FLR is being done since it may
> change the device context.
> 
Once the freeze mode transition is completed, the hypervisor sw reads the final device context to migrate.

> >
> > I agree FLR may have been initiated and driver is waiting now for 100msec.
> 
> For driver, did you mean the driver in the guest?
> 
Yes for 100msec wait time, it is the guest driver.

> >
> > So yes, device single entity synchronized it.
> >
> > > >
> > > > > > But ok, since we active to freeze mode change is allowed, lets
> > > > > > discuss
> > > above.
> > > > > >
> > > > > > A device is the single synchronization point for any device
> > > > > > reset, FLR or admin
> > > > > command operation.
> > > > >
> > > > > So you agree we need synchronization? And I'm not sure I get the
> > > > > meaning of synchronization point, do you mean the
> > > > > synchronization between freeze/stop and virtio facilities?
> > > > >
> > > > Synchronization means, handling two events in parallel such as FLR and
> other.
> > >
> > > Great. So we have a perfect race:
> > >
> > > 1) guest initiates FLR
> > > 2) device start FLR
> > > 3) hypervisor stop and freeze the device
> > > 4) device is freeze
> > > 5) hypervisor read device context A
> > > 6) migrate device contextA
> > > 8) migration is done
> > > 9) FLR is done
> > > 10) hypervisor read device context B
> > >
> > > So we end up with inconsistent device context, no? Dest want B or
> > > A+B, but you give A.
> > >
> > Since #1 and #2 is done before #3, the device knows to finish the FLR, hence
> #9 is completed before #4.
> 
> Ok, that's my understanding and that's why I'm asking, but you said freeze/stop
> doesn't need to wait for FLR before.
> 
Hypervisor side does not need to wait to issue the freeze/stop command.
It just completes later from the device if FLR in this corner case was ongoing.
I covered this in v2 now.

> >
> > Alternatively, in above sequence when destination sees #10, it can
> immediately finish the FLR as dest device is not under FLR, treating it as no-op.
> >
> > Both ways to handle are fine. (and rare in practice, but yes, its possible).
> >
> > I will write both the options in the device requirements.
> >
> > > >
> > > > > > So, the migration driver do not need to wait for FLR to complete.
> > > > >
> > > > > I'm confused, you said below that device context could be changed by
> FLR.
> > > > >
> > > > Yes.
> > > > > If FLR needs to clear device context, we can have a race where
> > > > > device context is cleared when we are trying to read it?
> > > > >
> > > > I didn’t say clear the context.
> > > > FLR updates the device context.
> > >
> > > In what sense?
> > >
> > Indicating a new device context indicating a new device context and discard
> the old one.
> 
> For example, what will queue_address have after an FLR?
> 
All zeros.
It probably does not matter because device context to capture this FLR notion.
device_status shows that device is under reset after the FLR.

> > I am glad you asked this. I wanted to get the basic part captured before
> adding this optimization.
> 
> Ok.
> 
> > Probably it is good to add it now in the v2 as we crossed this stage now.
> >
> > > > Device is serving the device context read write commands, serving
> > > > FLR, answering mode change command, So device knows the best how
> > > > to avoid
> > > any race.
> > >
> > > You want to leave those details for the vendor to figure out? If
> > > devices know everything, why do we need device normative?
> > >
> > Device knows its implementation.
> > Implementation guidelines to be in the normative.
> > I will add it to the normative.
> >
> > > I see issues at least for FLR, I'm pretty sure they are others. If a
> > > design requires us to audit all the possible conflicts between
> > > virtio facilities and transport. It's a strong hint of layer
> > > violation and when it happens it for sure may hit a lot of problems that are
> very hard to find or debug thus we should drop such a design.
> > > I suggest using the RFC tag since the next version (if there is one)
> > > as I see it is immature in many ways.
> > >
> > Technical committee audits the required touch points like rest of the industry
> committees that I participated.
> > I disagree to your above point.
> > If you do not want to review, that is fine.
> 
> I don't want to hold my breath if I see something that is obviously wrong. Using
> RFC may help people to know that it is a draft that has something to be
> improved before it can be merged.
> 
> > We are reviewing with other members and also contributed by them.
> >
> > > What's more, solving races is much easier if the device
> > > functionality is self contained. For example, for a self contained
> > > device with the transport as the single interface, we can leverage
> > > from transport
> > > (PCI) for dealing with races, arbitration, ordering, QOS etc which
> > > is probably required in the internal channel between the owner and
> > > the member. But all of these were missed in your series and even if
> > > you can I'm not sure it's worthwhile to reinvent all of them.
> > >
> > At the end there is one physical device serving owner and member devices.
> 
> This doesn't happen yet. For example, a VF with adminq that can be isolated
> with PASID makes some sense.
PASID is for the process of user space.
Kernel space do not consume a PASID.
Depending on the PASID can work for mediation approach only.
Last I heard is the cpu or kernel took the support out of it as some special instruction did not work for that cpu.

> 
> > So a claim like things are on the VF hence you magically get 200% QoS
> guarantee is myth.
> 
> That's not my point, I'm saying VF could benefit from the e.g QOS support in
> PCIE. I'm not saying it's perfect.
> 
It is not the QOS support in PCIE.
It is the restrictions to respond in N usec and burden on the device to put something in always available memory for rarely used element is wasteful.

> >
> > Quoting "all of these" is also incorrect.
> >
> > Things added gradually, first functionally with reasonable performance,
> followed by notion and extension for QoS.
> > By definition of PCI transport for SR-IOV there is internal channel.
> >
> > It is reasonably well proposal in current form.
> > There are few race condition that you highlight are extremely rare in nature.
> 
> It's not rare since there's no way to know what the guest is doing.
I will be practical. It is rare, because a production environment guest is interested in running traffic and not constantly engaging in device reset.
And if it does, taking longer time to migrate is also fine, because guest is driving the device for production apps.
> It's actually the critical part for live migration to be correct. You are proposing
> migration so it must cover all those cases to make sure there is no case to make
> your proposal a dead end.
Yes, it should be correct.
> 
> > Suggestions are welcome to improve.
> 
> I have given some and I will give more.
> 
Sure, that is very helpful as usual.

> > There were couple of them by Michael too, I am addressing them in the v2.
> >
> > > For example, for the architecture like owner/member, if the virtio
> > > or transport facility could be controlled via device internal
> > > channels besides the transport, such a channel may complicate the
> synchronization a lot.
> > Two vendors who actually make the hw sriov devices are authoring these and
> others are also reviewing.
> > So I am more confident that it is solid enough.
> > Also, a similar design has been seen with other device for more than a year
> as GPL integrated with QEMU for a year now and with upstream kernel.
> >
> > > The device needs to
> > > be able to handle or synchronize requests from both PCI and owner in
> parallel.
> > > They are just too many possible races and most of my questions so
> > > far come from this viewpoint. I wouldn't go further for other stuff
> > > since I believe I've spotted sufficient issues and that's why I must
> > > stop at this patch before looking at the rest.
> > It is your call to stop or progress.
> > I find your reviews useful to improve this proposal, so I will fix them.
> 
> My point is to make the theory correct before looking at the others as I had a
> lot of questions (as demonstrated in this thread). I think it's not hard to
> understand as the rest of the series are based on the theory.
> 
> >
> > >
> > > Admin commands are fine if it does real administrative jobs such as
> > > provisioning since such work is beyond the core virtio functionality.
> > >
> > > Again, the goal of virtio spec is to have a device with sufficient
> > > guidelines that is easy to implement but not leave the vendors to
> > > waste their engineering resources in figuring or fuzzing the corner cases.
> > I have not seen an industry standard spec or a software that does not have
> corner cases.
> 
> Corner cases are probably not accurate. I meant, for you, it's probably a corner
> case, but for me it's kind of obvious.
> 
> > The spec proposal is from > 1 device vendors.
> 
> That's good but it doesn't mean it doesn't have any (major) issues.
> E.g vendors may choose to just implement part of the PCIE capabilities so they
> don't do audits for the rest.
> 
> >
> > I will focus on more practical aspects to progress and improve this spec.
> > >
> > > >
> > > > > > When admin cmd freeze the VF it can expect FLR_completed VF.
> > > > >
> > > > > We need to explain why and how about the resume? For example, is
> > > > > resuming required to wait for the completion of FLR, if not, why?
> > >
> > > This question is ignored.
> > >
> > I probably missed. Sorry about it.
> > No, the driver does not need to wait for FLR to finish to issue resume
> > command,
> 
> Good but I want to know if stop/freeze->active requires to wait for the
> completion of FLR. I guess the answer is yes.
> 
Yes.
> > as this typically done on the destination member device which should not be
> under FLR.
> > I will write up the requirements further.
> >
> > > > > In another thread you are saying that the PCI composition is
> > > > > done by hypervisor, so passthrough is really confusing at least for me.
> > > > >
> > > > I explained there what vPCI composition is done there.
> > > > PCI config space and msix side of composition is done.
> > > > The whole virtio interface is not composed.
> > >
> > > You need to describe this somewhere, no? That's what I'm saying.
> > >
> > Mostly not. What is not done is not written.
> >
> > > And passthrough is misleading here.
> > >
> > Passthrough is mentioned in theory of operation.
> > It is not present in requirements section.
> > So, it is fine.
> 
> I suggest documenting or defining the "passthrough" methodology somewhere.
> Michael tries to define it in another thread, if it's ok, let's use that. We can't
> require people to read VFIO code in order to know what happens in the virtio
> spec.
> 
Yes. I captured in v2 in assumptions section.

> >
> > > >
> > > > > > Ok. I assume "reset flow" is clear to you now that it points to section
> 2.4.
> > > > > > This section is not normative section, so using an extra word
> > > > > > like "flow" does
> > > > > not confuse anyone.
> > > > > > I will link to the section anyway.
> > > > >
> > > > > Probably, but you mention FLR flow as well.
> > > > As I said, not repeating the PCIe spec here. The reader knows what
> > > > FLR of the
> > > PCIe transport.
> > >
> > > Ok, I'm not a native speaker, but I really don't know the difference
> > > between "FLR" and "FLR flow".
> > >
> > Lets keep it simple. I will write it as FLR, as pci transport has it as FLR.
> 
> Ok.
> 
Done in v2.

> >
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > and may also undergo PCI function level
> > > > > > > > > > +reset(FLR) flow.
> > > > > > > > >
> > > > > > > > > Why is only FLR special here? I've asked FRS but you
> > > > > > > > > ignore the
> > > question.
> > > > > > > > >
> > > > > > > > FLR is special to bring clarity that guest owns the VF
> > > > > > > > doing FLR, hence
> > > > > > > hypervisor cannot mediate any registers of the VF.
> > > > > > >
> > > > > > > It's not about mediation at all, it's about how the device
> > > > > > > can implement what you want here correctly.
> > > > > > >
> > > > > > > See my above question.
> > > > > > >
> > > > > > Ok. it is clear that live migration commands cannot stay on
> > > > > > the member device
> > > > > because the member device can undergo device reset and FLR flows
> > > > > owned by the guest.
> > > > >
> > > > > I disagree, hypervisors can emulate FLR and never send FLR to real
> devices.
> > > > >
> > > > That would be some other trap alternative that needs to dissect
> > > > the device
> > > and build infrastructure for such dissection is not desired in the listed use
> case.
> > >
> > > Do you need to trap FLR or not? You're saying the hypervisor is in
> > > charge of vPCI, how is this differ to what you proposed? If not, how
> > > can vPCI be composed?
> > >
> > Live migration driver do not need to trap FLR.
> 
> Maybe I misunderstood your vPCI composition, but it's really helpful to
> document how it is expected to be done.
> 
We can probably have it in the cover letter as hypervisors may evolve and do more passthrough work than done today.

> >
> > > I believe you need to document how vpci is supposed to be done,
> > > since I believe your proposal can only work with such specific types
> > > of PCI composition. This is one of the important things that is missed in this
> series.
> > >
> > I don’t see a need to describe vpci composition as there may be more than
> one way to do it.
> 
> More than one way for sure, but this contradicts what you say: you said you
> don't trap FLR ...
> 
Don’t trap the FLR in the live migration driver.

> People like me may wonder for example why FLR is mentioned, as FLR can be
> trapped and emulated.
It can be, that emulation requires the knowledge of device specific things.
This may be fine in such stack which is composing the virtio device using non virtio devices.
> 
> Another example, when a device can be saved and restored, the hypervisor may
> schedule the device among multiple VMs, in that case, trapping FLR is a must.
> 
I think you meant sharing when you say scheduling.
Such can work when the data path is also trapped.
May be one can do using PASID and queue assignment.
But than it is not passthrough.
So two different use cases, in that case the whole PCI config space is fully replicated and hence, FLR never reaches to the real device.


> > What I think it is worth to describe is the whole pci device is not stored in
> device context.
> > I will try to add a short description around it.
> >
> > > >
> > > > So your disagreement is fine for non-passthrough devices.
> > > >
> > > > > > (and hypervisor is not involved in these two flows, hence the
> > > > > > admin command
> > > > > interface is designed such that it can fullfil above requirements).
> > > > > >
> > > > > > Theory of operation brings out this clarity. Please notice
> > > > > > that it is in
> > > > > introductory section with an example.
> > > > > > Not normative line.
> > > > > >
> > > > > > > >
> > > > > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > > > > +virtio specification;
> > > > > > > > >
> > > > > > > > > This seems unnecessary and obvious as it applies to all
> > > > > > > > > other PCI and virtio functionality.
> > > > > > > > >
> > > > > > > > Great. But your comment is contradicts.
> > > > > > > >
> > > > > > > > > What's more, for the things that need to be
> > > > > > > > > synchronized, I don't see any descriptions in this patch. And if it
> doesn't need, why?
> > > > > > > > With which operation should it be synchronized and why?
> > > > > > > > Can you please be specific?
> > > > > > >
> > > > > > > See my above question regarding FLR. And it may have others
> > > > > > > which I haven't had time to audit.
> > > > > > >
> > > > > > Ok. when you get chance to audit, lets discuss that time.
> > > > >
> > > > > Well, I'm not the author of this series, it should be your job
> > > > > otherwise it would be too late.
> > > > >
> > > > As author, what we think, I will cover. If you have specific
> > > > points to add value,
> > > please share, I will look into it.
> > >
> > > I've pointed out sufficient issues. I have a lot of others but I
> > > don't want to have a giant thread once again.
> > >
> > I see following things to improve in the requirements which I will do in v2.
> >
> > 1. Document race around FLR and admin commands for really rare corner
> case.
> > 2. Some text around not migrating the pci device registers 3.
> > Interaction with PM commands
> >
> > > >
> > > > > For example, how is the power management interaction with the
> > > freeze/stop?
> > > > >
> > > > Power management is owned by the guest, like any other virtio interface.
> > > > So freeze/stop do not interfere with it.
> > >
> > > I don't think this is a good answer. I'm asking how the PM interacts
> > > with freeze/stop, you answer it works well.
> >
> > >
> > > I'm not obliged to design hardware for you but figuring out the bad
> > > design for virtio. I'm not convinced with a proposal that misses a
> > > lot of obvious critical cases and for sure it's not my job to solve them.
> > >
> > I am not asking you to solve.
> 
> My point is that, it's better for you to have some investigation on the PM
> instead of me.
> 
Yes, I updated the device requirement in v2.

> >
> > > I've demonstrated the possible races with FLR. So did the PM. For
> > > example, if VF is in D3cold state, can we still read its device context?
> > I think yes, but I will double check.
> >  If yes, is it a violation of the PCIE spec? If not, why?
> 
> So you are emulating the state instead of a real suspension?
> 
No. not emulating the state. It is present in the device at PCI level under guest and member device control.
When controlling function (owner PF) administrate the member device to freeze, it put the whole device in freeze so D3->D0 cannot happen.

> > No, because device context is owned by the owner device and not the VF. SR-
> PCIM interface has defined it be outside of scope of PCIe spec.
> >
> > > How about other states? Can the device be freezed in the middle of
> > > PM state transitions? If yes, how can it work without migrating PCI
> > > states?
> > I will double check, but unlikely, it should be similar to FLR case to keep the
> device to avoid treating it differently.
> 
> The reason why I see it is different from FLR is that
> 
> 1) D3cold requires the VF to be off the power
> 2) State transition might takes more than what FLR did, PCI seems only cover
> the minimum delay but not maximum which may have implications for
> downtime
> 
D3cold is not controlled by the guest driver.
PM register can change D0 to D3hot.

> >
> > > Well, I meant we need a more precise definition of each state
> > > otherwise it could be ambiguous (as I pointed above).
> > Ok. so, few things about read and other messages, I will add.
> >
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > In "stop" mode, the device wont process descriptors.
> > > > > > >
> > > > > > > If the device won't process descriptors, why still allow it
> > > > > > > to receive
> > > > > notifications?
> > > > > > Because notification may still arrive and if the device may
> > > > > > update any counters as part of
> > > > >
> > > > > Which counters did you mean here?
> > > > >
> > > > The counter that Xuan is adding and any other state that device
> > > > may have to
> > > update as result of driver notification.
> > > > For example caching the posted avail index in the notification.
> > >
> > > A link to those proposals?
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00048.h
> > tml
> >
> 
> I don't see how this is related to "posted avail index" etc.
Driver notification counters are updated.

> 
> > > If the device must depend on those cached features to work it's
> > > really fragile. If not, we don't need to care about them.
> > It is not dependent.
> > It is the infrastructure to enable it.
> > Same for other shared memory region accesses.
> >
> > >
> > > >
> > > > > > it which needs to be migrated or store the received notification.
> > > > > >
> > > > > > > Or does it really matter if the device can receive or not here?
> > > > > > >
> > > > > > From device point of view, the device is given the chance to
> > > > > > update its device
> > > > > context as part of notifications or access to it.
> > > > >
> > > > > This is in conflict with what you said above " Device cannot
> > > > > process the queue ..."
> > > > >
> > > > No, it does not.
> > > > Device context is updated within the device without accessing the
> > > > queue
> > > memory of the guest.
> > >
> > > This is not documented or explained anywhere?
> > >
> > Why should it be explained?
> > device is not accessing the guest memory -> this is mentioned in stop mode.
> 
> Isn't it hard to see the difference between the following two?
> 
I am not following your question.
> 1) In stop mode, device is not accessing guest memory
> 2) device context is updated without accessing the queue memory of the guest
> 
Device context is read/written by the owner PF, so it does not touch the guest memory.

> 1) is to define the stop mode, 2) is to define the behaviour of device context
> 
> Or are you saying device context can only be fetched after the device is
> stopped?
> 
No device context can be fetched in all 3 modes.
It is weird for the hypervisor to fetch the device context while mode transition in progress.

> > Hence, there is no need to write above.
> >
> > > >
> > > > > Maybe you can give a concrete example.
> > > > >
> > > > The above one.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > + the member device context
> > > > > > > > >
> > > > > > > > > I don't think we define "device context" anywhere.
> > > > > > > > >
> > > > > > > > It is defined further in the description.
> > > > > > >
> > > > > > > Like this?
> > > > > > >
> > > > > > > """
> > > > > > >  +The member device has a device context which the owner
> > > > > > > driver can
> > > > > > > +either read or write. The member device context consist of
> > > > > > > +any
> > > > > > > device  +specific data which is needed by the device to
> > > > > > > resume its operation  +when the device mode """
> > > > > > >
> > > > > > Yes.
> > > > > > Further patch-3 adds the device context and also add the link
> > > > > > to it in the
> > > > > theory of operation section so reader can read more detail about it.
> > > > > >
> > > > > > > "Any" is probably too hard for vendors to implement. And in
> > > > > > > patch 3 I only see virtio device context. Does this mean we
> > > > > > > don't need transport
> > > > > > > (PCI) context at all? If yes, how can it work?
> > > > > > >
> > > > > > Right. PCI member device is present at source and destination
> > > > > > with its layout,
> > > > > only the virtio device context is transferred.
> > > > > > Which part cannot work?
> > > > >
> > > > > It is explained in another thread where you are saying the PCI
> > > > > requires mediation. I think any author should not ignore such
> > > > > important assumptions in both the change log and the patch.
> > > > >
> > > > > And again, the more I review the more I see how narrow this
> > > > > series can be
> > > used:
> > > > >
> > > > I explained this before and also covered in the cover letter.
> > > >
> > > > > 1) Only works for SR-IOV member device like VF
> > > > It can be extended to SIOV member device in future.
> > > > Today these are the only type of member device virtio has.
> > >
> > > That is exactly what I want to say, it can only work for the
> > > owner/member model. It can't work when the virtio device is not
> > > structured like that. And you missed that most of the existing
> > > virtio devices are not implemented in this model. It means they
> > > can't be migrated with a pure virtio specific extension. For you,
> > > SR-IOV is all but this is not true for virtio. PCI is not the only transport and
> SR-IOV is not the only architecture in PCI.
> > >
> > Each transport will have its own way to handle it.
> > When there is MMIO owner-member relationship arise, one will be able to do
> so as well.
> > In fact other transports will likely miss out as they have not established such
> pace.
> >
> > > And I'm pretty sure the owner/member is not the only requirement,
> > > there are a lot of other assumptions which are missed in this series.
> > >
> > One proposal does not do everything.
> > It is just impractical.
> 
> For other assumptions, I meant:
> 
> 1) how vpci is composed, if it can be composed as vhost, why do we need to
> mention "passthrough"
I am lost, above you said you want to capture how vpci is composed.
Here you say why to mention passthrough.

> 2) the cap/bar layout, for example if a cap shares BARs with others, it can't be
> "passthrough", no?
Why not, the cap exposes all the things to the guest.

> 
> >
> > > >
> > > > > 2) Mediate PCI but not virtio which is tricky
> > > > > 3) Can only work for a specific BAR/capability register layout
> > > > >
> > > > > Only 1) is described in the change log.
> > > > >
> > > > > The other important assumptions like 2) and 3) are not
> > > > > documented
> > > anywhere.
> > > > > And this patch never explains why 2) and 3) is needed or why it
> > > > > can be used for subsystems other than VFIO/Linux.
> > > > >
> > > > Since I am not mentioning vfio now, I will refrain from mentioning
> > > > others as well. :)
> > >
> > > It's not about VFIO at all. It's about to let people know under
> > > which case this proposal could work. Otherwise if a vendor develops
> > > a BAR/cap which is not at page boundary. How could you make it work with
> your proposal here?
> > >
> > Vendor is a cloud operator which is building the device, so it will always work
> it has the matching capabilities on source and destination.
> 
> I meant, for example, if common_cfg shares a BAR with others but doesn't own
> a page exclusively, you need to trap, no?
> 
For passthrough device why such restriction?

> >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >and device configuration space may change. \\
> > > > > > > > > > +\hline
> > > > > > > > >
> > > > > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > > > > >
> > > > > > > > All pci devices which belong to a single guest VM are not
> > > > > > > > stopped
> > > > > atomically.
> > > > > > > > Hence, one device which is in freeze mode, may still
> > > > > > > > receive driver notifications from other pci device,
> > > > > > >
> > > > > > > Device may choose to ignore those notifications, no?
> > > > > > >
> > > > > > > > or it may experience a read from the shared memory and get
> > > > > > > > garbage
> > > > > data.
> > > > > > >
> > > > > > > Could you give me an example for this?
> > > > > > >
> > > > > > Section 2.10 Shared Memory Regions.
> > > > >
> > > > > How can it experience a read in this case?
> > > > >
> > > > MMIO read/write can be initiated by the peer device while the
> > > > device is in
> > > stopped state.
> > >
> > > Ok, but what I want to say is how it can get the garbage data here?
> > >
> > If the device mode is changed to freeze while it is being read by the peer
> device, it can get garbage data or last data.
> > Which may not be the one that is expected.
> > So first all the initiator devices are stopped, ensure that they do not make any
> requests.
> >
> > And there are requests, which gets proper answer.
> 
> Ok.
> 
> >
> > > >
> > > > > Btw, shared regions are tricky for hardware.
> > > > >
> > > > > >
> > > > > > > > And things can break.
> > > > > > > > Hence the stop mode, ensures that all the devices get
> > > > > > > > enough chance to stop
> > > > > > > themselves, and later when freezed, to not change anything
> internally.
> > > > > > > >
> > > > > > > > > > +0x2   & Freeze &
> > > > > > > > > > + In this mode, the member device does not accept any
> > > > > > > > > > +driver notifications,
> > > > > > > > >
> > > > > > > > > This is too vague. Is the device allowed to be freezed
> > > > > > > > > in the middle of any virtio or PCI operations?
> > > > > > > > >
> > > > > > > > > For example, in the middle of feature negotiation etc.
> > > > > > > > > It may cause implementation specific sub-states which
> > > > > > > > > can't be
> > > migrated easily.
> > > > > > > > >
> > > > > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > > > > It is passthrough device, hence hypervisor layer do not
> > > > > > > > get to see sub-
> > > > > state.
> > > > > > > >
> > > > > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > > > > The device context already covers this sub-state.
> > > > > > >
> > > > > > > 1) driver writes driver_features
> > > > > > > 2) driver sets FEAUTRES_OK
> > > > > > >
> > > > > > > 3) device receive driver_features
> > > > > > > 4) device validating driver_features
> > > > > > > 5) device clears FEATURES_OK
> > > > > > >
> > > > > > > 6) driver read stats and realize FEATURES_OK is being
> > > > > > > cleared
> > > > > > >
> > > > > > > Is it valid to be frozen of the above?
> > > > > > No. device mode is frozen when hypervisor is sure that no more
> > > > > > access by the
> > > > > guest will be done.
> > > > >
> > > > > How, you don't trap so 1) and 2) are posted, how can hypervisor
> > > > > know if there's inflight transactions to any registers?
> > > > >
> > > > Because hypervisor has stopped the vcpus which are issuing them.
> > >
> > > MMIO are posted. vCPU is stopped but the transactions are inflight.
> > > How could the hypervisor/device know if there's any inflight PCIE
> > > transactions here? So I can imagine what happens in fact is the TLP
> > > for freezing is ordered with the TLP for posted MMIO. This is
> > > probably guaranteed for typical PCIE setup but how about the relaxed
> ordering?
> >
> > Vcpus do not generated relaxed ordering MMIOs.
> > In pci spec: " If this bit is Set, the Function is permitted to set
> > the Relaxed Ordering bit in the Attributes field of transactions it initiates".
> >
> > Function initiates RO requests, not the vcpu.
> > Hence, it is fine.
> >
> 
> Ok.
> 
> > > >
> > > > > > What can happen between #2 and #3, is device mode may change to
> stop.
> > > > >
> > > > > Why can't be freezed in this case? It's really hard to deduce
> > > > > why it can't just from your above descriptions.
> > > > >
> > > > On the source hypervisor, the mode changes are active->stop->freeze.
> > > > Hence when freeze is done, the hypervisor knows that all inflight
> > > > has been
> > > stopped by now.
> > >
> > > Ok, but how about freezing between 3) and 4). If we allow it, do we
> > > need to migrate to this state? If yes, how can it work with your
> > > device context? If not, shouldn't we document this?
> > >
> > May be, some of these are implementation details. I am not sure it belongs to
> spec.
> 
> The point is to make sure that your deivce context covers this case.
> If it can't be covered, it's a design defect.
> 
> > Like RSS update while packets are received.. such implementation details are
> not part of the spec.
> 
> This is definitely different, the driver can choose to synchronize or the end user
> can tolerate the possible out of order packets in this case.
> 
Right but it is not defined in the spec.

> This is not the case here, if freezing between 3) and 4) is allowed, your current
> device context can't cover this case and guests can't tolerate such kinds of
> errors after migration for sure.
> 
Ok. I will add the text around this in v3.

> >
> > > >
> > > > > Even if it had, is it even possible to list all the places where
> > > > > freezing is prohibited? We don't want to end up with a spec that
> > > > > is hard to implement or leave the vendor to figure out those tricky parts.
> > > > >
> > > > The general idea is not prohibiting the freeze/stop mode.
> > > > If the device needs more time, let device take time to do it.
> > >
> > > Ok, it means:
> > >
> > > 1) there're conditions from stop to freeze, then what are they?
> > No, there isn’t condition.
> > May be I didn’t follow the question.
> 
> E.g under which condition could the device change the status from active to
> stop etc. That's something I keep asking with a concrete example (e.g FLR).
> 
Device mode is changed by the driver from active to stop. This is the admin mode.
FLR do not change the mode to stop/freeze because it is guest driver controlled operational state of the device.

> > > 2) how much time at most? E.g FLR takes at most 100ms.
> > From the driver side, it is 100msec for device side it can be less too.
> > As soon as FLR is done or enough to record it, is done, stop can continue.
> >
> > > 3) If it needs more time, can this time satisfy the downtime requirement?
> > >
> > Guest VM for all practical purposes is not busy in doing FLR, it is a corner
> case, yet we have to cover it.
> 
> Corner case in what sense? A loop in a simple shell script can trigger this easily.
> 
Sure.
Which is not the practical application of the guest VM.
Hence, it is corner case.

> > And yes, it satisfy the downtime requirements, because VM is already not
> interested in the packets, it is busy doing the FLR.
> 
> Well, it has subtle differences. VM may have more than one interface, just one
> of the interfaces is doing FLR.
> 
Sure, it can do. But in that case the time of that VF stop is not critical.
VM is busy in non-critical work.

> >
> > > >
> > > >
> > > > > > And in stop mode, device context would capture #5 or #4,
> > > > > > depending where is
> > > > > device at that point.
> > > > > >
> > > > > > > >
> > > > > > > > > And what's more, the above state machine seems to be
> > > > > > > > > virtio specific, but you don't explain the interaction
> > > > > > > > > with the device status state
> > > > > > > machine.
> > > > > > > > First, above is not a state machine.
> > > > > > >
> > > > > > > So how do readers know if a state can go to another state and when?
> > > > > > >
> > > > > > Not sure what you mean by reader. Can you please explain.
> > > > >
> > > > > The people who read virtio spec.
> > > > >
> > > > So question is "how reader knows if a state can go to another
> > > > state and
> > > when"?
> > > > It is described and listed in the table, when a mode can change.
> > >
> > > It's not only "if" but also "when". Your table partially answers the
> > > "if '' but not "when". I think you should know now the state
> > > transition is conditional. So let's try our best to ease the life of the vendor.
> > What do you mean when?
> > I do not understand that "mode change is conditional"? it is not based on the
> condition.
> > [..]
> 
> See above.
> 
> >
> > > > > Let's define the synchronization point first. And it
> > > > > demonstrates at least devices need to synchronize between the
> > > > > free/stop and virtio device status machine which is not as easy as what
> is done in this patch.
> > > > >
> > > > Synchronization point = device.
> > >
> > > This is obvious as we can't rule stuff outside virtio, and we are
> > > talking about devices not drivers here. But the spec needs
> > > sufficient guidance/normative for the vendor to implement. It's more
> > > than just saying "device is synchronization point".
> > >
> > The requirements are already covering what device needs to do.
> > Some interaction points are missing, as I acked above, I will add them.
> >
> > [..]
> > > > > Until virtio reset, this is how virtio works now. I've pointed
> > > > > out that it may cause extra troubles when trying to resume, but
> > > > > you don't tell me what's wrong to keep that?
> > > > >
> > > > If kept, hypervisor may not be able to decide when to change the
> > > > mode from
> > > active->stop.
> > >
> > > Why? It is simply done when mgmt requires a migration?
> > >
> > Mgmt is bit higher level entity. Underneath the software layers may wait until
> the time is right to migrate.
> 
> I don't understand, anyhow the migration request could not be sent to the
> device directly without the assistance in hypervisor.
> 
> > The fundamental point is, the device context is expected to return the
> incremental value, that is changed content from last time.
> > So once all changed content is read, its empty.
> 
> You can't easily define an incremental value for all types of states or structures:
> 
> 1) device with complicated states like RAM or other
> 2) the device state has complicated data structures
> 
What I parse is, that device context is complicated structure.
So it will be defined incrementally as it becomes more mature.

> >
> > > What's more important, PCI allows multiple common_cfgs. So the
> > > hypervisor can choose to reserve one common_cfg for live migration.
> > > In this case we don't have to read to clear semantics.
> > Common_cfg does not serve large device context, nor it serves DMA.
> 
> Well, I'd think e.g the address of the descriptor table is part of the device
> context, and it can be read some common_cfg.
It can be read but we are talking about not saving 64 VQs tables, and RSS, flow filters, statistics all in some common config registers.

> 
> >
> > >
> > > Or, are you saying the value read from common_cfg is not device context?
> > The value of common config is part of the device context that represents
> current common config.
> >
> > > Isn't this conflict with your vague definition of device context?
> > >
> > You mentioned you stop at this patch,
> 
> Stop means stopping comment.
> 
> > so likely you didn’t read device context patch, hence you quote it vague.
> > So I don’t know what you mean by vague.
> 
> So in this patch you define device context as:
> 
> "The member device context consist of any device specific data which is needed
> by the device to resume its operation"
> 
> So the address of the descriptor table satisfy this definition? If not, why?
> 
Address of the descriptor table is part of device context.

> > Please let me know what you additional thing you want to see in device
> context after you reach that patch.
> >
> >
> > > > We can opt for a mode where full device context is read in each
> > > > mode
> > > without clearing it.
> > > > But than it can be very specific to a version of qemu, which we
> > > > are avoiding it
> > > here.
> > > >
> > > > > > 2. device context returns incremental value from the previous
> > > > > > read. So, it
> > > > > needs to clear it.
> > > > >
> > > > > I don't understand here. This is not the case for most of the devices.
> > > > >
> > > > Not sure which devices you mean here with "most of the devices".
> > > > Device context functions like a write record pages (aka dirty pages).
> > >
> > > It's definitely different. We want to migrate dirty pages lively
> > > which can consume a lot of bandwidth. So reporting delta makes a lot
> > > of sense here since it would have a lot of rounds of syncing and it
> > > doesn't result in blockers resuming.
> > >
> > Write records are reported as delta from the previous read.
> >
> > > For device context, how many rounds of syncing did you expect, and
> > > if we have N rounds, we need to restore N rounds in order to resume?
> > > Do you want to live migrating device states? If it's only 1 or 2 rounds, why
> bother?
> > >
> > Live migrate the device context. Typically in current software using it, it is 2
> rounds.
> 
> If it's just 2 rounds, why bother for delta? It is only helpful is we want to live
> migrate some device with giant states with sevreal rounds, and in that can we
> should leave it as a device specific state.
> 
The number of rounds matter less. The number of things a device needs to setup is a lot.
And a VM may have many devices so just like pre-copy of dirty pages, it is an extension for the device state (context) to pre-copy.

> > The interface is generic that if needed more rounds are possible.
> >
> > Even device for most practical purpose will implement 2 rounds.
> >
> > > And for the delta, how do you know you can easily define deltas for
> > > every type of device, especially the ones with complicated internal
> > > states? Defining states has already been demonstrated as a
> > > complicated task for some devices like virtio-FS and you want to complicate
> it furtherly?
> > >
> > What is your question? If you say virtio-fs is complicated state, may be it
> should not have existed itself in the virtio spec as first place.
> 
> We have just more than FS that can't work for live migration. Crypto and GPU
> are two other examples, and I'm pretty sure we have more.
> 
Industry have migrated gpu, rdma, nvme, virtio-net, virtio-blk, devices, and susupend/resumed gpu devices too.
So I can imagine it will happen as wider devices adopt the device context.

> Until we figure out how they can, we can't say a device context work for all
> types. No?
> 
No. 
> 1) Trying to define a format that works for all types of devices
> 2) Leavce the states to be defined by individual device types
> 
> Which method is esay?
> 
Best of both. i.e. generic fields for generic virtio items like vq config, common config, device config area.
And device specific context fields like rss, flow filters, counters.

> > But I differ to think that.
> > Virtio-fs guest side state wont be changed as part of it.
> > Virtio-fs is the first device which has considered and listed to migrate the
> device state.
> > So it should be possible.
> 
> I wouldn't repeart the discussion of virtio-FS migration here, you can serach the
> archives for more details.
> 
> But the point is obvious, it's really hard to say a simple device context can work
> for all type of devices. We should allow a device specific states definition. This
> seems to be agreed by Michale and LingShan.
> 
Sure. I covered this in v2 at [2].
Device specific state definitions will be able to grow.

[2] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.html

> >
> > > What is proposed in this series is an ad-hoc optimization for a
> > > specific deivce type within a specific subsystem (e.g VFIO) in a
> > > specific operating system which is not the general.
> > >
> > Oh now you mention vfio. Not me. :)
> >
> > I am not going to comment on this. It is not ad-hoc.
> 
> You need to justify how it is not. Based on the current discussion, you have
> demonstreated a lot of asusmptions in order to make your proposal to work.
> 
Listed in v2 now.

> > It uses similar dirty page tracking like technique present in cpu hw and other
> devices.
> >
> > > As demsonsted many times, starting from something simple and stupid
> > > is the most easy way.
> > >
> >
> > > > Whatever is already returned is/should not be repeated in
> > > > subsequent reads,
> > > though device can choose to do so.
> > > >
> > > > > >
> > > > > > > > And which software stack may find this useful?
> > > > > > > > Is there any existing software that can utilize it?
> > > > > > >
> > > > > > > Libvirt.
> > > > > > >
> > > > > > Does libvirt restore on migration failure?
> > > > >
> > > > > Yes.
> > > > >
> > > > Ok. the device will be able to resume when it is marked active.
> > > > The device context returned  is the incremental delta as explained above.
> > >
> > > I disagree, see my above reply.
> > I replied above.
> >
> > >
> > > >
> > > > > >
> > > > > > > > Why that device context present with the software
> > > > > > > > vanished, in your
> > > > > > > assumption, if it is?
> > > > > > > >
> > > > > > > > > > Typically, on
> > > > > > > > > > +the source hypervisor, the owner driver reads the
> > > > > > > > > > +device context once when the device is in
> > > > > > > > > > +\field{Active} or \field{Stop} mode and later once
> > > > > > > > > > +the member device is in
> > > > > \field{Freeze} mode.
> > > > > > > > >
> > > > > > > > > Why need the read while device context could be changed?
> > > > > > > > > Or is the dirty page part of the device context?
> > > > > > > > >
> > > > > > > > It is not part of the dirty page.
> > > > > > > > It needs to read in the active/stop mode, so that it can
> > > > > > > > be shared with
> > > > > > > destination hypervisor, which will pre-setup the complex
> > > > > > > context of the device, while it is still running on the source side.
> > > > > > >
> > > > > > > Is such a method used by any hypervisor?
> > > > > > Yes. qemu which uses vfio interface uses it.
> > > > >
> > > > > Ok, such software technology could be used for all types of
> > > > > devices, I don't see any advantages to mention it here unless it's unique
> to virtio.
> > > > >
> > > > It is theory of operation that brings the clarity and rationale.
> > >
> > > I think it's not. Since it's not something that is unique to virtio.
> > >
> > > > So I will keep it.
> > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +Typically, the device context is read and written one
> > > > > > > > > > +time on the source and the destination hypervisor
> > > > > > > > > > +respectively once the device is in \field{Freeze} mode.
> > > > > > > > > > +On the destination hypervisor, after writing the
> > > > > > > > > > +device context, when the device mode set to
> > > > > > > > > > +\field{Active}, the device uses the most recently set
> > > > > > > > > > +device context and resumes the device
> > > > > > > > > operation.
> > > > > > > > >
> > > > > > > > > There's no context sequence, so this is obvious. It's
> > > > > > > > > the semantic of all other existing interfaces.
> > > > > > > > >
> > > > > > > > Can you please what which existing interfaces do you mean here?
> > > > > > >
> > > > > > > For any common cfg member. E.g queue_addr.
> > > > > > >
> > > > > > > The driver wrote 100 different values to queue_addr and the
> > > > > > > device used the value written last time.
> > > > > > >
> > > > > > o.k. I don’t see any problem in stating what is done, which is
> > > > > > less vague. 😊
> > > > > >
> > > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +In an alternative flow, on the source hypervisor the
> > > > > > > > > > +owner driver may choose to read the device context
> > > > > > > > > > +first time while the device is in \field{Active} mode
> > > > > > > > > > +and second time once the device is in \field{Freeze}
> > > > > > > > > mode.
> > > > > > > > >
> > > > > > > > > Who is going to synchronize the device context with
> > > > > > > > > possible configuration from the driver?
> > > > > > > > >
> > > > > > > > Not sure I understand the question.
> > > > > > > > If I understand you right, do you mean that, When
> > > > > > > > configuration change is done by the guest driver, how does
> > > > > > > > device
> > > context change?
> > > > > > > >
> > > > > > >
> > > > > > > Yes.
> > > > > > >
> > > > > > > > If so, device context reading will reflect the new configuration.
> > > > > > >
> > > > > > > How do you do that? For example:
> > > > > > >
> > > > > > > static inline void vp_iowrite64_twopart(u64 val,
> > > > > > >                                         __le32 __iomem *lo,
> > > > > > >                                         __le32 __iomem *hi) {
> > > > > > >         vp_iowrite32((u32)val, lo);
> > > > > > >         vp_iowrite32(val >> 32, hi); }
> > > > > > >
> > > > > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > > > > >
> > > > > > Yes. the device context
> VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > > > > section captures the partial value.
> > > > >
> > > > > There's no way for the device to know whether or not it's a
> > > > > partial value or
> > > not.
> > > > > No?
> > > > >
> > > > Device does not need to know, because when the guest vm and the
> > > > device is
> > > resumed on the destination, it the guest vm will continue with
> > > writing the 2nd part.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > Similarly, on the
> > > > > > > > > > +destination hypervisor writes the device context
> > > > > > > > > > +first time while the device is still running in
> > > > > > > > > > +\field{Active} mode on the source hypervisor and
> > > > > > > > > > +writes the device context second time while the
> > > > > > > > > > +device is in
> > > > > > > > > \field{Freeze} mode.
> > > > > > > > > > +This flow may result in very short setup time as the
> > > > > > > > > > +device context likely have minimal changes from the
> > > > > > > > > > +previously written device
> > > > > > > context.
> > > > > > > > >
> > > > > > > > > Is the hypervisor who is in charge of doing the
> > > > > > > > > comparison and writing only the delta?
> > > > > > > > >
> > > > > > > > The spec commands allow to do so. So possibility exists
> > > > > > > > from spec
> > > wise.
> > > > > > >
> > > > > > > There are various optimizations for migration for sure, I
> > > > > > > don't think mentioning any specific one is good.
> > > > > > >
> > > > > > The text is informative text similar to,
> > > > > >
> > > > > > " However, some devices benefit from the ability to find out
> > > > > > the amount of available data in the queue without accessing
> > > > > > the virtqueue in
> > > > > memory"
> > > > > >
> > > > > > " To help with these optimizations, when
> > > > > > VIRTIO_F_NOTIFICATION_DATA has
> > > > > been negotiated".
> > > > > >
> > > > > > Is this the only optimization in virtio? No, but we still
> > > > > > mention the rationale of
> > > > > why it exists.
> > > > >
> > > > > The above is a good example as it explain
> > > > > VIRTIO_F_NOTIFICATION_DATA is the only way without accessing the
> > > > > virtqueue. But this is not the case of
> > > migration.
> > > > > You said it's just a possibility but not a must which is not the
> > > > > case for VIRTIO_F_NOTIFICATION_DATA.
> > > > >
> > > > It is one of the optimization apart. The comparison is of
> > > > one_of_example or
> > > not.
> > >
> > > I don't get this.
> > Theory of operation is describing a flow how things are done and how the
> constructs are helpful to achieve it.
> 
> Immature optimzation doesn't belong to theory for sure. I see your delta
> reporting immature in many ways. That's the point.
> 
> Thanks
> 
> > And it is not the end of the list.
> > That does not mean one should not write those.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] RE: [PATCH v1 4/8] admin: Add device migration admin commands
  2023-10-18  6:46   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-10-18  8:24     ` Parav Pandit
  2023-10-18 10:26       ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  8:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 18, 2023 12:17 PM
> 
> On Sun, Oct 08, 2023 at 02:25:51PM +0300, Parav Pandit wrote:
> 
> ...
> 
> > +\paragraph{Device Context Size Get Command} \label{par:Basic
> > +Facilities of a Virtio Device / Device groups / Group administration
> > +commands / Device Migration / Device Context Size Get Command}
> > +
> > +This command returns the remaining estimated device context size. The
> > +driver can query the remaining estimated device context size for the
> > +current mode or for the \field{Freeze} mode. While reading the device
> > +context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the actual
> > +device context size may differ than what is being returned by this
> > +command. After reading the device context using command
> > +VIRTIO_ADMIN_CMD_DEV_CTX_READ, the remaining estimated context
> size
> > +usually reduces by amount of device context read by the driver using
> > +VIRTIO_ADMIN_CMD_DEV_CTX_READ command. If the device context is
> > +updated rapidly the remaining estimated context size may also
> > +increase even after reading the device context using
> VIRTIO_ADMIN_CMD_DEV_CTX_READ command.
> > +
> > +For the command VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, \field{opcode}
> is set to 0x9.
> > +The \field{group_member_id} refers to the member device to be accessed.
> > +
> > +\begin{lstlisting}
> > +struct virtio_admin_cmd_dev_ctx_size_get_data {
> > +        u8 freeze_mode;
> > +};
> > +\end{lstlisting}
> > +
> > +The \field{command_specific_data} is in the format \field{struct
> > +virtio_admin_cmd_dev_ctx_size_get_data}.
> > +When \field{freeze_mode} is set to 1, the device returns the
> > +estimated device context size when the device will be in \field{Freeze} mode.
> > +As the device context is read from the device, the remaining
> > +estimated context size may decrease. For example, member device mode
> > +is \field{Stop}, the device has estimated total device context size
> > +as 12KB; the device would return 12KB for the first
> > +VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command, once the driver has
> > +already read 8KB of device context data using
> > +VIRTIO_ADMIN_CMD_DEV_CTX_READ command, and the remaining data is
> 4KB,
> > +hence the device returns 4KB in the subsequent
> > +VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command.
> > +
> > +\begin{lstlisting}
> > +struct virtio_admin_cmd_dev_ctx_size_get_result {
> > +        le64 size;
> > +};
> > +\end{lstlisting}
> 
> So we have a 64 bit size? How are we going to return so much?
> 
I agree it is a lot.
But this is the case because one has defined struct virtio_pci_cap64.

> 
> > +
> > +When the command completes successfully,
> > +\field{command_specific_result} is in the format \field{struct
> virtio_admin_cmd_dev_ctx_size_get_result}.
> > +
> > +Once the device context is fully read, this command returns zero for
> > +\field{size} until the new device context is generated.
> > +
> > +\paragraph{Device Context Read Command} \label{par:Basic Facilities
> > +of a Virtio Device / Device groups / Group administration commands /
> > +Device Migration / Device Context Read Command}
> > +
> > +This command reads the current device context.
> > +For the command VIRTIO_ADMIN_CMD_DEV_CTX_READ, \field{opcode} is
> set to 0xa.
> > +The \field{group_member_id} refers to the member device to be accessed.
> > +
> > +This command has no command specific data.
> 
> So I am not sure this is wise. Multi-year experience with QEMU taught us that
> we are likely to make mistakes when defining migration format - forget some
> fields, validate them incorrectly, and so on.
> Making a somewhat safe assumption that we'll make mistakes in the spec, too,
> I'd like to see some kind of idea of how we'll support compatibility and/or
> graceful failure if/when we do.
> 
Very valid point. In v2 I added the compatibility command at [1] as "Device Context Supported Fields Query Command"

[1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00195.html
> 
> > +\begin{lstlisting}
> > +struct virtio_admin_cmd_dev_ctx_rd_len {
> > +        le32 context_len;
> > +};
> > +
> > +struct virtio_admin_cmd_dev_ctx_rd_result {
> > +        u8 data[];
> > +};
> > +\end{lstlisting}
> 
> so callers needs to pin whatever device tells it to?
> 
How much memory to pin is driver choice, it can pin 4K and keep reading 1MB memory by 1MB/4K size calls.
Or it can pin 1MB and read full.

> admin commands support truncation intentionally.
> 
> it is not clear, to me, that it's ok to have device just save as much state as it
> wants to.
Why would device store some unreasonable amount of state?

> 
> > +
> > +When the command completes successfully,
> > +\field{command_specific_result} is in the format \field{struct
> > +virtio_admin_cmd_dev_ctx_rd_result}
> > +returned by the device containing the device context data and
> > +\field{command_specific_output} is in format of \field{struct
> > +virtio_admin_cmd_dev_ctx_rd_len} containing length of context data
> > +returned by the device in the command response. When the length
> > +returned is zero or when the returned context data is less the data
> > +requested by the driver, the device do not have any device context
> > +data left that the device can report, at this point the device context stream
> ends.
> > +
> > +The driver can read the whole device context data using one or
> > +multiple commands. When the device context does not fit in the
> > +\field{command_specific_result}, driver reads the subsequent
> > +remaining bytes using one or more subsequent commands.
> 
> how?
For example, driver requested 100 bytes.
a. Device returned 40 bytes. So the whole context fit, nothing more to be done.
Or
b. Device returned 100 bytes, so driver does not know if there is more or not.
Hence, driver request again for additional 100 bytes in 2nd call.
If there is nothing, device returns success with zero bytes.
If the device has 30 bytes more it returns 30 bytes. The returned bytes < requested bytes hence the read stream ends.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  7:20                                       ` Parav Pandit
@ 2023-10-18  8:42                                         ` Zhu, Lingshan
  2023-10-18  8:53                                           ` Michael S. Tsirkin
  2023-10-18  9:48                                           ` Parav Pandit
  0 siblings, 2 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-18  8:42 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/18/2023 3:20 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, October 18, 2023 12:22 PM
>>
>> On 10/18/2023 2:41 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Wednesday, October 18, 2023 12:06 PM
>>>>
>>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
>>>>>> Sent: Monday, October 16, 2023 3:18 PM
>>>>>>
>>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
>>>>>>>>>>>> How do you transfer the ownership?
>>>>>>>>>>> An additional ownership deletgation by a new admin command.
>>>>>>>>>> if you think this can work, do you want to cook a patch to
>>>>>>>>>> implement this before you submitting this live migration series?
>>>>>>>>> I answered this already above.
>>>>>>>> talk is cheap, show me your patch
>>>>>>> Huh. We presented the infrastructure that migrates, 30+ device
>>>>>>> types,
>>>>>> covering device context ideas from Oracle.
>>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
>>>>>>>
>>>>>>> Please have some respect for other members who covered more ground
>>>>>>> than
>>>>>> your series.
>>>>>>> What more? Apply the same nested concept on the member device as
>>>>>> Michael suggested, it is nested virtualization maintain exact same
>> semantics.
>>>>>>> So a VF is mapped as PF to the L1 guest.
>>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
>>>>>>>
>>>>>>> This nested work can be extended in future, once first level
>>>>>>> nesting is
>>>>>> covered.
>>>>>>>> Answer all questions above, if you think a management VF can
>>>>>>>> work, please show me your patch.
>>>>>>> The idea evolves from technical debate then pointing fingers like
>>>>>>> your
>>>>>> comment.
>>>>>>> I think a positive discussion with Michael and a pointer to the
>>>>>>> paper from
>>>>>> Jason gave a good direction of doing _right_ nesting that follows
>>>>>> two
>>>> principles.
>>>>>>> a. efficiency property
>>>>>>> b. equivalence property
>>>>>>>
>>>>>>> (c. resource control is natural already)
>>>>>>>
>>>>>>> Both apply at VMM and at VM level enabling recursive
>>>>>>> virtualization, by
>>>>>> having VF that can act as PF inside the guest.
>>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
>>>>>> Please just show me your patch resolving these opens, how about
>>>>>> start from defining virito-fs device context and your management VF?
>>>>> As answered, device context infrastructure is done, per device
>>>>> specific device-
>>>> context will be defined incrementally.
>>>>> I will not be including virtio-fs in this series. It will be done
>>>>> incrementally in
>>>> future utilizing the infrastructure build in this series.
>>>> Done? How do you conclude this? You just tell me what is the full set
>>>> of virito-fs device context now and how to migrate them.
>>>>
>>>> You cant? you refuse or you don't? Do you expect the HW designer to
>>>> figure out by themself?
>>> I wont be able to tell now as I don’t think it is necessary for this series.
>>> If one out of 30 devices cannot migrate because of unimaginable amount of
>> complexity has been placed there, may be one will not implement it as member
>> device.
>>>   From experience of migratable complex gpu devices, rdma devices (stateful
>> having hundred thousand of stateful QPs), my understanding is complex state of
>> virtio-fs can be defined and migratable.
>>> Mlx5 driver consist of 150,000 lines of code and that device is migratable
>> with complex state.
>>> So I am optimistic that virtio-fs can be migratable too.
>>> It does not have to limited by my limited creativity of 2023.
>>> May be I am wrong, in that case one will not implement passthrough virtio-fs
>> device.
>> your series wants to migrate device context, but doesn't define device context,
>> does this sounds reasonable?
> Device generic context is defined at [1] and also the infrastructure for defining the device context in parallel by multiple people can be done post the work of [1].
>
> Per each device type context will be defined incrementally post this work.
>
> [1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.html
This is not post of the work, you should define them before you use them 
in this series.

And you need to prove why admin vq are better than registers solution if 
you want a merge.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  8:42                                         ` Zhu, Lingshan
@ 2023-10-18  8:53                                           ` Michael S. Tsirkin
  2023-10-18  9:48                                           ` Parav Pandit
  1 sibling, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18  8:53 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 04:42:53PM +0800, Zhu, Lingshan wrote:
> And you need to prove why admin vq are better than registers solution if you
> want a merge.

First, no one seems to want a register based solution for tracking
memory changes.  So I feel the point should rather be to prove why is
sticking some migration features in registers and using dma for others
makes sense.  Second I guess we could have a register based interface
for admin commands, no?

I really wish both of you guys started looking for solutions that
satisfy all use-cases instead of just going head to head. Lingshan that
is why I asked you to try and list advantages of the architecture in
Parav's patches. Do you think you could try and address more use-cases?

-- 
MST

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  8:42                                         ` Zhu, Lingshan
  2023-10-18  8:53                                           ` Michael S. Tsirkin
@ 2023-10-18  9:48                                           ` Parav Pandit
  2023-10-18  9:56                                             ` Michael S. Tsirkin
  2023-10-19  8:15                                             ` Zhu, Lingshan
  1 sibling, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18  9:48 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, October 18, 2023 2:13 PM
> 
> On 10/18/2023 3:20 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Wednesday, October 18, 2023 12:22 PM
> >>
> >> On 10/18/2023 2:41 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Wednesday, October 18, 2023 12:06 PM
> >>>>
> >>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
> >>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> >>>>>> Sent: Monday, October 16, 2023 3:18 PM
> >>>>>>
> >>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
> >>>>>>>>>>>> How do you transfer the ownership?
> >>>>>>>>>>> An additional ownership deletgation by a new admin command.
> >>>>>>>>>> if you think this can work, do you want to cook a patch to
> >>>>>>>>>> implement this before you submitting this live migration series?
> >>>>>>>>> I answered this already above.
> >>>>>>>> talk is cheap, show me your patch
> >>>>>>> Huh. We presented the infrastructure that migrates, 30+ device
> >>>>>>> types,
> >>>>>> covering device context ideas from Oracle.
> >>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
> >>>>>>>
> >>>>>>> Please have some respect for other members who covered more
> >>>>>>> ground than
> >>>>>> your series.
> >>>>>>> What more? Apply the same nested concept on the member device as
> >>>>>> Michael suggested, it is nested virtualization maintain exact
> >>>>>> same
> >> semantics.
> >>>>>>> So a VF is mapped as PF to the L1 guest.
> >>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> >>>>>>>
> >>>>>>> This nested work can be extended in future, once first level
> >>>>>>> nesting is
> >>>>>> covered.
> >>>>>>>> Answer all questions above, if you think a management VF can
> >>>>>>>> work, please show me your patch.
> >>>>>>> The idea evolves from technical debate then pointing fingers
> >>>>>>> like your
> >>>>>> comment.
> >>>>>>> I think a positive discussion with Michael and a pointer to the
> >>>>>>> paper from
> >>>>>> Jason gave a good direction of doing _right_ nesting that follows
> >>>>>> two
> >>>> principles.
> >>>>>>> a. efficiency property
> >>>>>>> b. equivalence property
> >>>>>>>
> >>>>>>> (c. resource control is natural already)
> >>>>>>>
> >>>>>>> Both apply at VMM and at VM level enabling recursive
> >>>>>>> virtualization, by
> >>>>>> having VF that can act as PF inside the guest.
> >>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> >>>>>> Please just show me your patch resolving these opens, how about
> >>>>>> start from defining virito-fs device context and your management VF?
> >>>>> As answered, device context infrastructure is done, per device
> >>>>> specific device-
> >>>> context will be defined incrementally.
> >>>>> I will not be including virtio-fs in this series. It will be done
> >>>>> incrementally in
> >>>> future utilizing the infrastructure build in this series.
> >>>> Done? How do you conclude this? You just tell me what is the full
> >>>> set of virito-fs device context now and how to migrate them.
> >>>>
> >>>> You cant? you refuse or you don't? Do you expect the HW designer to
> >>>> figure out by themself?
> >>> I wont be able to tell now as I don’t think it is necessary for this series.
> >>> If one out of 30 devices cannot migrate because of unimaginable
> >>> amount of
> >> complexity has been placed there, may be one will not implement it as
> >> member device.
> >>>   From experience of migratable complex gpu devices, rdma devices
> >>> (stateful
> >> having hundred thousand of stateful QPs), my understanding is complex
> >> state of virtio-fs can be defined and migratable.
> >>> Mlx5 driver consist of 150,000 lines of code and that device is
> >>> migratable
> >> with complex state.
> >>> So I am optimistic that virtio-fs can be migratable too.
> >>> It does not have to limited by my limited creativity of 2023.
> >>> May be I am wrong, in that case one will not implement passthrough
> >>> virtio-fs
> >> device.
> >> your series wants to migrate device context, but doesn't define
> >> device context, does this sounds reasonable?
> > Device generic context is defined at [1] and also the infrastructure for defining
> the device context in parallel by multiple people can be done post the work of
> [1].
> >
> > Per each device type context will be defined incrementally post this work.
> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.h
> > tml
> This is not post of the work, you should define them before you use them in this
> series.
> 
I don’t agree to cook ocean in this patch series.
No practical spec devel community does it.
As long as we feel comfortable that device context framework is extendible, it is fine.
If virtio-fs seems very hard, may be one will come with a new light weight FS device. I really don’t know.

> And you need to prove why admin vq are better than registers solution if you
> want a merge.
Michael already responded the practical aspects.
Since you may claim, I didn’t answer, below is the technical details.

Why admin commands and aq is better is because of below reasons in my view:

Functionally better:
1. When the live migration registers are located on the VF itself, VMM does not have control of it.
These registers reset, on FLR and device reset because these are virtio registers of the device.
Hence, VMM lost the state for the job that VMM was supposed to do.
Therefore, passthrough mode cannot depend on these registers.

2. Any bulk data transfer of device context and dirty page tracking requires DMA.
Hence those DMA must happen to the device which is different than VF itself.
If it is on the VF itself, it has two problems.
2.a. VF device reset and FLR will clear them, and device context is lost.

2.b. the DMA occurs at the PCI RID level.
IOMMU cannot bifurcate the DMA of one RID to two different address space of guest and hypervisor.
This requires PASID support.
Using PASID has following problems.
2.b.1 PASID typically not used by the kernel software. It is only meant for the user processes.
Hence for kernel work a reserving PASID won't be acceptable upstream kernel.
2.b.2 Somehow if this is done, When the VF itself supports PASID, it required now vPASID support.
This is again not where industry is going in other forums where I am part of. Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
2.b.3 One of the widely used cpu seems to have dropped the support due to limitation of an instruction around PASID.
So it cannot be used there, this further limits virtio passthrough users.

Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is functional problems.

Scale wise better:
3. Admin command and admin vq are used _only_ when one does device migration command.
One does not migrate VMs every few msec.
Hence such functionality to be better be done which is efficient for performance, but without consuming on-chip memory.
Admin command and admin vq satisfy those.

4. Once the software matures further, admin command would prefer completion interrupt, instead of poll.
How to get notification/interrupt? Well, virtqueue defines this already.
Should we replicate that in some PF registers?
It can be. But once you put all the functionalities of admin command and aq in registers the whole thing becomes yet another register_q.

5. Can these registers be placed in the PF to overcome #1 and #2 for passthrough?
In theory yes.
In practice, no, as there are many commands that flow, which needs to scale to reasonable number of VFs.
Admin commands over admin vq provides this generic facility.

6. Most modern devices who attempts to scale, cut down their register footprint, registers are used only for main bootstap, init time config work.
Even in virtio spec, one can read:
"Device configuration space is generally used for rarely changing or initialization-time parameters."

Adding some additional registers to a PF device config space for non init time parameters does not make sense.

7. Additionally, a nested virtualization should be done by truly nesting the device at right abstraction point of owner-member relationship.
This follows two principles of (a) efficiency and (b) equivalency of what Jason paper pointed.
And we ask for nested VF extension we will get our guidance from PCI-SIG, of why it should be done if it is matching with rest of the ecosystem components that support/don’t support the nesting.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  9:48                                           ` Parav Pandit
@ 2023-10-18  9:56                                             ` Michael S. Tsirkin
  2023-10-18 10:22                                               ` Parav Pandit
  2023-10-19  8:15                                             ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18  9:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 09:48:55AM +0000, Parav Pandit wrote:
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Wednesday, October 18, 2023 2:13 PM
> > 
> > On 10/18/2023 3:20 PM, Parav Pandit wrote:
> > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >> Sent: Wednesday, October 18, 2023 12:22 PM
> > >>
> > >> On 10/18/2023 2:41 PM, Parav Pandit wrote:
> > >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>> Sent: Wednesday, October 18, 2023 12:06 PM
> > >>>>
> > >>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
> > >>>>>> From: virtio-comment@lists.oasis-open.org
> > >>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> > >>>>>> Sent: Monday, October 16, 2023 3:18 PM
> > >>>>>>
> > >>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> > >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
> > >>>>>>>>>>>> How do you transfer the ownership?
> > >>>>>>>>>>> An additional ownership deletgation by a new admin command.
> > >>>>>>>>>> if you think this can work, do you want to cook a patch to
> > >>>>>>>>>> implement this before you submitting this live migration series?
> > >>>>>>>>> I answered this already above.
> > >>>>>>>> talk is cheap, show me your patch
> > >>>>>>> Huh. We presented the infrastructure that migrates, 30+ device
> > >>>>>>> types,
> > >>>>>> covering device context ideas from Oracle.
> > >>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
> > >>>>>>>
> > >>>>>>> Please have some respect for other members who covered more
> > >>>>>>> ground than
> > >>>>>> your series.
> > >>>>>>> What more? Apply the same nested concept on the member device as
> > >>>>>> Michael suggested, it is nested virtualization maintain exact
> > >>>>>> same
> > >> semantics.
> > >>>>>>> So a VF is mapped as PF to the L1 guest.
> > >>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> > >>>>>>>
> > >>>>>>> This nested work can be extended in future, once first level
> > >>>>>>> nesting is
> > >>>>>> covered.
> > >>>>>>>> Answer all questions above, if you think a management VF can
> > >>>>>>>> work, please show me your patch.
> > >>>>>>> The idea evolves from technical debate then pointing fingers
> > >>>>>>> like your
> > >>>>>> comment.
> > >>>>>>> I think a positive discussion with Michael and a pointer to the
> > >>>>>>> paper from
> > >>>>>> Jason gave a good direction of doing _right_ nesting that follows
> > >>>>>> two
> > >>>> principles.
> > >>>>>>> a. efficiency property
> > >>>>>>> b. equivalence property
> > >>>>>>>
> > >>>>>>> (c. resource control is natural already)
> > >>>>>>>
> > >>>>>>> Both apply at VMM and at VM level enabling recursive
> > >>>>>>> virtualization, by
> > >>>>>> having VF that can act as PF inside the guest.
> > >>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> > >>>>>> Please just show me your patch resolving these opens, how about
> > >>>>>> start from defining virito-fs device context and your management VF?
> > >>>>> As answered, device context infrastructure is done, per device
> > >>>>> specific device-
> > >>>> context will be defined incrementally.
> > >>>>> I will not be including virtio-fs in this series. It will be done
> > >>>>> incrementally in
> > >>>> future utilizing the infrastructure build in this series.
> > >>>> Done? How do you conclude this? You just tell me what is the full
> > >>>> set of virito-fs device context now and how to migrate them.
> > >>>>
> > >>>> You cant? you refuse or you don't? Do you expect the HW designer to
> > >>>> figure out by themself?
> > >>> I wont be able to tell now as I don’t think it is necessary for this series.
> > >>> If one out of 30 devices cannot migrate because of unimaginable
> > >>> amount of
> > >> complexity has been placed there, may be one will not implement it as
> > >> member device.
> > >>>   From experience of migratable complex gpu devices, rdma devices
> > >>> (stateful
> > >> having hundred thousand of stateful QPs), my understanding is complex
> > >> state of virtio-fs can be defined and migratable.
> > >>> Mlx5 driver consist of 150,000 lines of code and that device is
> > >>> migratable
> > >> with complex state.
> > >>> So I am optimistic that virtio-fs can be migratable too.
> > >>> It does not have to limited by my limited creativity of 2023.
> > >>> May be I am wrong, in that case one will not implement passthrough
> > >>> virtio-fs
> > >> device.
> > >> your series wants to migrate device context, but doesn't define
> > >> device context, does this sounds reasonable?
> > > Device generic context is defined at [1] and also the infrastructure for defining
> > the device context in parallel by multiple people can be done post the work of
> > [1].
> > >
> > > Per each device type context will be defined incrementally post this work.
> > >
> > > [1]
> > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.h
> > > tml
> > This is not post of the work, you should define them before you use them in this
> > series.
> > 
> I don’t agree to cook ocean in this patch series.
> No practical spec devel community does it.
> As long as we feel comfortable that device context framework is extendible, it is fine.
> If virtio-fs seems very hard, may be one will come with a new light weight FS device. I really don’t know.
> 
> > And you need to prove why admin vq are better than registers solution if you
> > want a merge.
> Michael already responded the practical aspects.
> Since you may claim, I didn’t answer, below is the technical details.
> 
> Why admin commands and aq is better is because of below reasons in my view:
> 
> Functionally better:
> 1. When the live migration registers are located on the VF itself, VMM does not have control of it.
> These registers reset, on FLR and device reset because these are virtio registers of the device.
> Hence, VMM lost the state for the job that VMM was supposed to do.
> Therefore, passthrough mode cannot depend on these registers.
> 
> 2. Any bulk data transfer of device context and dirty page tracking requires DMA.
> Hence those DMA must happen to the device which is different than VF itself.
> If it is on the VF itself, it has two problems.
> 2.a. VF device reset and FLR will clear them, and device context is lost.
> 
> 2.b. the DMA occurs at the PCI RID level.
> IOMMU cannot bifurcate the DMA of one RID to two different address space of guest and hypervisor.
> This requires PASID support.
> Using PASID has following problems.
> 2.b.1 PASID typically not used by the kernel software. It is only meant for the user processes.
> Hence for kernel work a reserving PASID won't be acceptable upstream kernel.
> 2.b.2 Somehow if this is done, When the VF itself supports PASID, it required now vPASID support.
> This is again not where industry is going in other forums where I am part of. Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
> 2.b.3 One of the widely used cpu seems to have dropped the support due to limitation of an instruction around PASID.
> So it cannot be used there, this further limits virtio passthrough users.
> 
> Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is functional problems.
> 
> Scale wise better:
> 3. Admin command and admin vq are used _only_ when one does device migration command.
> One does not migrate VMs every few msec.
> Hence such functionality to be better be done which is efficient for performance, but without consuming on-chip memory.
> Admin command and admin vq satisfy those.
> 
> 4. Once the software matures further, admin command would prefer completion interrupt, instead of poll.
> How to get notification/interrupt? Well, virtqueue defines this already.
> Should we replicate that in some PF registers?
> It can be. But once you put all the functionalities of admin command and aq in registers the whole thing becomes yet another register_q.
> 
> 5. Can these registers be placed in the PF to overcome #1 and #2 for passthrough?
> In theory yes.
> In practice, no, as there are many commands that flow, which needs to scale to reasonable number of VFs.
> Admin commands over admin vq provides this generic facility.
> 
> 6. Most modern devices who attempts to scale, cut down their register footprint, registers are used only for main bootstap, init time config work.
> Even in virtio spec, one can read:
> "Device configuration space is generally used for rarely changing or initialization-time parameters."
> 
> Adding some additional registers to a PF device config space for non init time parameters does not make sense.
> 
> 7. Additionally, a nested virtualization should be done by truly nesting the device at right abstraction point of owner-member relationship.
> This follows two principles of (a) efficiency and (b) equivalency of what Jason paper pointed.
> And we ask for nested VF extension we will get our guidance from PCI-SIG, of why it should be done if it is matching with rest of the ecosystem components that support/don’t support the nesting.


For completeness, and to shorten the thread, can you please list known
issues/use cases that are addressed by the status bit interface and how
you plan for them to be addressed?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  8:16                     ` Parav Pandit
@ 2023-10-18 10:19                       ` Michael S. Tsirkin
  2023-10-18 10:33                         ` Parav Pandit
  2023-10-19  2:41                       ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18 10:19 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Wed, Oct 18, 2023 at 08:16:01AM +0000, Parav Pandit wrote:
> > This doesn't happen yet. For example, a VF with adminq that can be isolated
> > with PASID makes some sense.
> PASID is for the process of user space.
> Kernel space do not consume a PASID.

It could if we need it to.

> Depending on the PASID can work for mediation approach only.
> Last I heard is the cpu or kernel took the support out of it as some special instruction did not work for that cpu.

Are you talking about ENQCMD things? That's nice but I don't think it's
necessary for isolation here.

...

> > The reason why I see it is different from FLR is that
> > 
> > 1) D3cold requires the VF to be off the power
> > 2) State transition might takes more than what FLR did, PCI seems only cover
> > the minimum delay but not maximum which may have implications for
> > downtime
> > 
> D3cold is not controlled by the guest driver.
> PM register can change D0 to D3hot.

It likely makes sense to document how all this interacts with things
like PM and other pci config. maybe you did in v2 didn't look at it yet.



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  9:56                                             ` Michael S. Tsirkin
@ 2023-10-18 10:22                                               ` Parav Pandit
  2023-10-18 10:47                                                 ` Michael S. Tsirkin
  2023-10-23  3:44                                                 ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18 10:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 18, 2023 3:26 PM

> For completeness, and to shorten the thread, can you please list known
> issues/use cases that are addressed by the status bit interface and how you plan
> for them to be addressed?

I will avoid listing known issues for a moment for status bit in this email.

Status bit interface helps in following good ways.
1. suspend/resume the device fully by the guest by negotiating the new feature.
This can be useful in the guest-controlled PM flows of suspend/resume.
I still think for this, only feature bit is necessary, and device_status modification is not needed.
D0->D3 and D3->D0 transition of the pci can suspend and resume the device which can preserve the last device_status value before entering D3.
(Like preserving all rest of the fields of common and other device config).
This is orthogonal and needed regardless of device migration.

2. If one does not want to passthrough a member device, but build a mediation-based device on top of existing virtio device,
It can be useful with mediating software.
Here the mediating software has ample duplicated knowledge of what the member device already has.
This can fulfil the nested requirement differently provided a platform support it.
(PASID limitation will be practical blocker here).

How to I plan to address above two?
a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI PM drives the state.
This will work orthogonal to VMM side migration and will co-exist with VMM based device migration.

b. nested use case:
L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
L1 guest to enable SR-IOV and mapping the VF to L2 guest.
Consulting industry ecosystem to support nested outside of virtio.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] Re: [PATCH v1 4/8] admin: Add device migration admin commands
  2023-10-18  8:24     ` [virtio-comment] " Parav Pandit
@ 2023-10-18 10:26       ` Michael S. Tsirkin
  2023-10-18 10:41         ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18 10:26 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 08:24:56AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, October 18, 2023 12:17 PM
> > 
> > On Sun, Oct 08, 2023 at 02:25:51PM +0300, Parav Pandit wrote:
> > 
> > ...
> > 
> > > +\paragraph{Device Context Size Get Command} \label{par:Basic
> > > +Facilities of a Virtio Device / Device groups / Group administration
> > > +commands / Device Migration / Device Context Size Get Command}
> > > +
> > > +This command returns the remaining estimated device context size. The
> > > +driver can query the remaining estimated device context size for the
> > > +current mode or for the \field{Freeze} mode. While reading the device
> > > +context using VIRTIO_ADMIN_CMD_DEV_CTX_READ command, the actual
> > > +device context size may differ than what is being returned by this
> > > +command. After reading the device context using command
> > > +VIRTIO_ADMIN_CMD_DEV_CTX_READ, the remaining estimated context
> > size
> > > +usually reduces by amount of device context read by the driver using
> > > +VIRTIO_ADMIN_CMD_DEV_CTX_READ command. If the device context is
> > > +updated rapidly the remaining estimated context size may also
> > > +increase even after reading the device context using
> > VIRTIO_ADMIN_CMD_DEV_CTX_READ command.
> > > +
> > > +For the command VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET, \field{opcode}
> > is set to 0x9.
> > > +The \field{group_member_id} refers to the member device to be accessed.
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_admin_cmd_dev_ctx_size_get_data {
> > > +        u8 freeze_mode;
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +The \field{command_specific_data} is in the format \field{struct
> > > +virtio_admin_cmd_dev_ctx_size_get_data}.
> > > +When \field{freeze_mode} is set to 1, the device returns the
> > > +estimated device context size when the device will be in \field{Freeze} mode.
> > > +As the device context is read from the device, the remaining
> > > +estimated context size may decrease. For example, member device mode
> > > +is \field{Stop}, the device has estimated total device context size
> > > +as 12KB; the device would return 12KB for the first
> > > +VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command, once the driver has
> > > +already read 8KB of device context data using
> > > +VIRTIO_ADMIN_CMD_DEV_CTX_READ command, and the remaining data is
> > 4KB,
> > > +hence the device returns 4KB in the subsequent
> > > +VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command.
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_admin_cmd_dev_ctx_size_get_result {
> > > +        le64 size;
> > > +};
> > > +\end{lstlisting}
> > 
> > So we have a 64 bit size? How are we going to return so much?
> > 
> I agree it is a lot.
> But this is the case because one has defined struct virtio_pci_cap64.

Good point.

That's there so we can define 64 bit resources.

But if you have them and you want to migrate then you need change
tracking and incremental state save, you can not do it in
one shot with device context.


> > 
> > > +
> > > +When the command completes successfully,
> > > +\field{command_specific_result} is in the format \field{struct
> > virtio_admin_cmd_dev_ctx_size_get_result}.
> > > +
> > > +Once the device context is fully read, this command returns zero for
> > > +\field{size} until the new device context is generated.
> > > +
> > > +\paragraph{Device Context Read Command} \label{par:Basic Facilities
> > > +of a Virtio Device / Device groups / Group administration commands /
> > > +Device Migration / Device Context Read Command}
> > > +
> > > +This command reads the current device context.
> > > +For the command VIRTIO_ADMIN_CMD_DEV_CTX_READ, \field{opcode} is
> > set to 0xa.
> > > +The \field{group_member_id} refers to the member device to be accessed.
> > > +
> > > +This command has no command specific data.
> > 
> > So I am not sure this is wise. Multi-year experience with QEMU taught us that
> > we are likely to make mistakes when defining migration format - forget some
> > fields, validate them incorrectly, and so on.
> > Making a somewhat safe assumption that we'll make mistakes in the spec, too,
> > I'd like to see some kind of idea of how we'll support compatibility and/or
> > graceful failure if/when we do.
> > 
> Very valid point. In v2 I added the compatibility command at [1] as "Device Context Supported Fields Query Command"
> 
> [1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00195.html
> > 
> > > +\begin{lstlisting}
> > > +struct virtio_admin_cmd_dev_ctx_rd_len {
> > > +        le32 context_len;
> > > +};
> > > +
> > > +struct virtio_admin_cmd_dev_ctx_rd_result {
> > > +        u8 data[];
> > > +};
> > > +\end{lstlisting}
> > 
> > so callers needs to pin whatever device tells it to?
> > 
> How much memory to pin is driver choice, it can pin 4K and keep reading 1MB memory by 1MB/4K size calls.
> Or it can pin 1MB and read full.
> 
> > admin commands support truncation intentionally.
> > 
> > it is not clear, to me, that it's ok to have device just save as much state as it
> > wants to.
> Why would device store some unreasonable amount of state?

Because I know how hardware vendor minds work: oh we can request
a bunch of memory, let's do that just in case ;)

> > 
> > > +
> > > +When the command completes successfully,
> > > +\field{command_specific_result} is in the format \field{struct
> > > +virtio_admin_cmd_dev_ctx_rd_result}
> > > +returned by the device containing the device context data and
> > > +\field{command_specific_output} is in format of \field{struct
> > > +virtio_admin_cmd_dev_ctx_rd_len} containing length of context data
> > > +returned by the device in the command response. When the length
> > > +returned is zero or when the returned context data is less the data
> > > +requested by the driver, the device do not have any device context
> > > +data left that the device can report, at this point the device context stream
> > ends.
> > > +
> > > +The driver can read the whole device context data using one or
> > > +multiple commands. When the device context does not fit in the
> > > +\field{command_specific_result}, driver reads the subsequent
> > > +remaining bytes using one or more subsequent commands.
> > 
> > how?
> For example, driver requested 100 bytes.
> a. Device returned 40 bytes. So the whole context fit, nothing more to be done.
> Or
> b. Device returned 100 bytes, so driver does not know if there is more or not.
> Hence, driver request again for additional 100 bytes in 2nd call.
> If there is nothing, device returns success with zero bytes.
> If the device has 30 bytes more it returns 30 bytes. The returned bytes < requested bytes hence the read stream ends.

do you then expect device to maintain all data in some internal buffer,
then keep an offset internally and each following command returns a
different chunk?  all this sounds rather fragile, not to mention
wasteful wrt using memory for that internal buffer.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18 10:19                       ` Michael S. Tsirkin
@ 2023-10-18 10:33                         ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18 10:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 18, 2023 3:50 PM
> 
> On Wed, Oct 18, 2023 at 08:16:01AM +0000, Parav Pandit wrote:
> > > This doesn't happen yet. For example, a VF with adminq that can be
> > > isolated with PASID makes some sense.
> > PASID is for the process of user space.
> > Kernel space do not consume a PASID.
> 
> It could if we need it to.
> 
> > Depending on the PASID can work for mediation approach only.
> > Last I heard is the cpu or kernel took the support out of it as some special
> instruction did not work for that cpu.
> 
> Are you talking about ENQCMD things? That's nice but I don't think it's
> necessary for isolation here.
>
Yes. I believe SVA is disabled.
 
> ...
> 
> > > The reason why I see it is different from FLR is that
> > >
> > > 1) D3cold requires the VF to be off the power
> > > 2) State transition might takes more than what FLR did, PCI seems
> > > only cover the minimum delay but not maximum which may have
> > > implications for downtime
> > >
> > D3cold is not controlled by the guest driver.
> > PM register can change D0 to D3hot.
> 
> It likely makes sense to document how all this interacts with things like PM and
> other pci config. maybe you did in v2 didn't look at it yet.
> 
Yeah, v2 covered it. Please see.
I didn't repeat what pci already covers for D3hot and D3 cold.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] RE: [PATCH v1 4/8] admin: Add device migration admin commands
  2023-10-18 10:26       ` [virtio-comment] " Michael S. Tsirkin
@ 2023-10-18 10:41         ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18 10:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 18, 2023 3:56 PM

> On Wed, Oct 18, 2023 at 08:24:56AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Wednesday, October 18, 2023 12:17 PM
> > >
> > > On Sun, Oct 08, 2023 at 02:25:51PM +0300, Parav Pandit wrote:
> > >
> > > ...
> > >
> > > > +\paragraph{Device Context Size Get Command} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device groups / Group
> > > > +administration commands / Device Migration / Device Context Size
> > > > +Get Command}
> > > > +
> > > > +This command returns the remaining estimated device context size.
> > > > +The driver can query the remaining estimated device context size
> > > > +for the current mode or for the \field{Freeze} mode. While
> > > > +reading the device context using VIRTIO_ADMIN_CMD_DEV_CTX_READ
> > > > +command, the actual device context size may differ than what is
> > > > +being returned by this command. After reading the device context
> > > > +using command VIRTIO_ADMIN_CMD_DEV_CTX_READ, the remaining
> > > > +estimated context
> > > size
> > > > +usually reduces by amount of device context read by the driver
> > > > +using VIRTIO_ADMIN_CMD_DEV_CTX_READ command. If the device
> > > > +context is updated rapidly the remaining estimated context size
> > > > +may also increase even after reading the device context using
> > > VIRTIO_ADMIN_CMD_DEV_CTX_READ command.
> > > > +
> > > > +For the command VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET,
> \field{opcode}
> > > is set to 0x9.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_dev_ctx_size_get_data {
> > > > +        u8 freeze_mode;
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +The \field{command_specific_data} is in the format \field{struct
> > > > +virtio_admin_cmd_dev_ctx_size_get_data}.
> > > > +When \field{freeze_mode} is set to 1, the device returns the
> > > > +estimated device context size when the device will be in \field{Freeze}
> mode.
> > > > +As the device context is read from the device, the remaining
> > > > +estimated context size may decrease. For example, member device
> > > > +mode is \field{Stop}, the device has estimated total device
> > > > +context size as 12KB; the device would return 12KB for the first
> > > > +VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command, once the driver
> has
> > > > +already read 8KB of device context data using
> > > > +VIRTIO_ADMIN_CMD_DEV_CTX_READ command, and the remaining
> data is
> > > 4KB,
> > > > +hence the device returns 4KB in the subsequent
> > > > +VIRTIO_ADMIN_CMD_DEV_CTX_SIZE_GET command.
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_dev_ctx_size_get_result {
> > > > +        le64 size;
> > > > +};
> > > > +\end{lstlisting}
> > >
> > > So we have a 64 bit size? How are we going to return so much?
> > >
> > I agree it is a lot.
> > But this is the case because one has defined struct virtio_pci_cap64.
> 
> Good point.
> 
> That's there so we can define 64 bit resources.
> 
> But if you have them and you want to migrate then you need change tracking
> and incremental state save, you can not do it in one shot with device context.
>
Each read of the device context would return different incremental state.
It would be good possibly to have explicit signal on the device context read to indicate end of stream, instead of relying on the size.
 
> 
> > >
> > > > +
> > > > +When the command completes successfully,
> > > > +\field{command_specific_result} is in the format \field{struct
> > > virtio_admin_cmd_dev_ctx_size_get_result}.
> > > > +
> > > > +Once the device context is fully read, this command returns zero
> > > > +for \field{size} until the new device context is generated.
> > > > +
> > > > +\paragraph{Device Context Read Command} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device groups / Group
> > > > +administration commands / Device Migration / Device Context Read
> > > > +Command}
> > > > +
> > > > +This command reads the current device context.
> > > > +For the command VIRTIO_ADMIN_CMD_DEV_CTX_READ, \field{opcode}
> is
> > > set to 0xa.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +This command has no command specific data.
> > >
> > > So I am not sure this is wise. Multi-year experience with QEMU
> > > taught us that we are likely to make mistakes when defining
> > > migration format - forget some fields, validate them incorrectly, and so on.
> > > Making a somewhat safe assumption that we'll make mistakes in the
> > > spec, too, I'd like to see some kind of idea of how we'll support
> > > compatibility and/or graceful failure if/when we do.
> > >
> > Very valid point. In v2 I added the compatibility command at [1] as "Device
> Context Supported Fields Query Command"
> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00195.h
> > tml
> > >
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_dev_ctx_rd_len {
> > > > +        le32 context_len;
> > > > +};
> > > > +
> > > > +struct virtio_admin_cmd_dev_ctx_rd_result {
> > > > +        u8 data[];
> > > > +};
> > > > +\end{lstlisting}
> > >
> > > so callers needs to pin whatever device tells it to?
> > >
> > How much memory to pin is driver choice, it can pin 4K and keep reading 1MB
> memory by 1MB/4K size calls.
> > Or it can pin 1MB and read full.
> >
> > > admin commands support truncation intentionally.
> > >
> > > it is not clear, to me, that it's ok to have device just save as
> > > much state as it wants to.
> > Why would device store some unreasonable amount of state?
> 
> Because I know how hardware vendor minds work: oh we can request a bunch
> of memory, let's do that just in case ;)

For a device that has large memory, it is unlikely it can store it in second copy.

> 
> > >
> > > > +
> > > > +When the command completes successfully,
> > > > +\field{command_specific_result} is in the format \field{struct
> > > > +virtio_admin_cmd_dev_ctx_rd_result}
> > > > +returned by the device containing the device context data and
> > > > +\field{command_specific_output} is in format of \field{struct
> > > > +virtio_admin_cmd_dev_ctx_rd_len} containing length of context
> > > > +data returned by the device in the command response. When the
> > > > +length returned is zero or when the returned context data is less
> > > > +the data requested by the driver, the device do not have any
> > > > +device context data left that the device can report, at this
> > > > +point the device context stream
> > > ends.
> > > > +
> > > > +The driver can read the whole device context data using one or
> > > > +multiple commands. When the device context does not fit in the
> > > > +\field{command_specific_result}, driver reads the subsequent
> > > > +remaining bytes using one or more subsequent commands.
> > >
> > > how?
> > For example, driver requested 100 bytes.
> > a. Device returned 40 bytes. So the whole context fit, nothing more to be
> done.
> > Or
> > b. Device returned 100 bytes, so driver does not know if there is more or not.
> > Hence, driver request again for additional 100 bytes in 2nd call.
> > If there is nothing, device returns success with zero bytes.
> > If the device has 30 bytes more it returns 30 bytes. The returned bytes <
> requested bytes hence the read stream ends.
> 
> do you then expect device to maintain all data in some internal buffer, then
> keep an offset internally and each following command returns a different
> chunk?  all this sounds rather fragile, not to mention wasteful wrt using
> memory for that internal buffer.

For small device in range of few hundred KB that changes the context often should be fine this way as this is only transient memory.
For device with large memory, an internal buffer may not work (depends on device implementation), it can fetch from the running hw context and return without another copy.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18 10:22                                               ` Parav Pandit
@ 2023-10-18 10:47                                                 ` Michael S. Tsirkin
  2023-10-18 10:57                                                   ` Parav Pandit
  2023-10-19  8:18                                                   ` Zhu, Lingshan
  2023-10-23  3:44                                                 ` Jason Wang
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-18 10:47 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 10:22:57AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, October 18, 2023 3:26 PM
> 
> > For completeness, and to shorten the thread, can you please list known
> > issues/use cases that are addressed by the status bit interface and how you plan
> > for them to be addressed?
> 
> I will avoid listing known issues for a moment for status bit in this email.
> 
> Status bit interface helps in following good ways.
> 1. suspend/resume the device fully by the guest by negotiating the new feature.
> This can be useful in the guest-controlled PM flows of suspend/resume.
> I still think for this, only feature bit is necessary, and device_status modification is not needed.
> D0->D3 and D3->D0 transition of the pci can suspend and resume the device which can preserve the last device_status value before entering D3.
> (Like preserving all rest of the fields of common and other device config).
> This is orthogonal and needed regardless of device migration.
> 
> 2. If one does not want to passthrough a member device, but build a mediation-based device on top of existing virtio device,
> It can be useful with mediating software.
> Here the mediating software has ample duplicated knowledge of what the member device already has.
> This can fulfil the nested requirement differently provided a platform support it.
> (PASID limitation will be practical blocker here).
> 
> How to I plan to address above two?
> a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI PM drives the state.
> This will work orthogonal to VMM side migration and will co-exist with VMM based device migration.

OK that sounds kind of reasonable. Lingshan, Jason are you interested in
suspend/resume? Want to start a thread on best way to support that?

> b. nested use case:
> L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> Consulting industry ecosystem to support nested outside of virtio.

Can't say I like this much, *a lot* of things to implement,
and burning up a VF for control path is not nice.
As an alternative, I suggest a new admin command pci capability
with basically a PA and a valid bit. Easy to emulate and add to
a VF. And maybe some way to suggest a safe place for it that
won't conflict with anything? Still trying to figure out if
we should add PASID in there, or what. Maybe optionally?
If actual hardware does it we'd be burning up 20 bits,
but for a software implementation it's free.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18 10:47                                                 ` Michael S. Tsirkin
@ 2023-10-18 10:57                                                   ` Parav Pandit
  2023-10-19  8:18                                                   ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-18 10:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Wednesday, October 18, 2023 4:17 PM
> 
> On Wed, Oct 18, 2023 at 10:22:57AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Wednesday, October 18, 2023 3:26 PM
> >
> > > For completeness, and to shorten the thread, can you please list
> > > known issues/use cases that are addressed by the status bit
> > > interface and how you plan for them to be addressed?
> >
> > I will avoid listing known issues for a moment for status bit in this email.
> >
> > Status bit interface helps in following good ways.
> > 1. suspend/resume the device fully by the guest by negotiating the new
> feature.
> > This can be useful in the guest-controlled PM flows of suspend/resume.
> > I still think for this, only feature bit is necessary, and device_status
> modification is not needed.
> > D0->D3 and D3->D0 transition of the pci can suspend and resume the device
> which can preserve the last device_status value before entering D3.
> > (Like preserving all rest of the fields of common and other device config).
> > This is orthogonal and needed regardless of device migration.
> >
> > 2. If one does not want to passthrough a member device, but build a
> > mediation-based device on top of existing virtio device, It can be useful with
> mediating software.
> > Here the mediating software has ample duplicated knowledge of what the
> member device already has.
> > This can fulfil the nested requirement differently provided a platform support
> it.
> > (PASID limitation will be practical blocker here).
> >
> > How to I plan to address above two?
> > a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI
> PM drives the state.
> > This will work orthogonal to VMM side migration and will co-exist with VMM
> based device migration.
> 
> OK that sounds kind of reasonable. Lingshan, Jason are you interested in
> suspend/resume? Want to start a thread on best way to support that?
>
There is already a thread from AMD on it who was insisting on changing the behavior of reset under suspend.
Just a feature bit would be sufficient without any weird breakage of reset.
But I would let Jason/Lingshan to comment if I missed something.
 
> > b. nested use case:
> > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> > Consulting industry ecosystem to support nested outside of virtio.

> As an alternative, I suggest a new admin command pci capability with basically a
> PA and a valid bit. Easy to emulate and add to a VF. And maybe some way to
> suggest a safe place for it that won't conflict with anything? Still trying to figure
> out if we should add PASID in there, or what. Maybe optionally?
> If actual hardware does it we'd be burning up 20 bits, but for a software
> implementation it's free.
Above scheme is as second solution for non-passthrough that utilizes the admin commands of this series.
Did I understand you right?

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  8:16                     ` Parav Pandit
  2023-10-18 10:19                       ` Michael S. Tsirkin
@ 2023-10-19  2:41                       ` Jason Wang
  1 sibling, 0 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-19  2:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, mst@redhat.com,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas, Zhu Lingshan

On Wed, Oct 18, 2023 at 4:16 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 17, 2023 7:12 AM
> >
> > On Fri, Oct 13, 2023 at 2:36 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Friday, October 13, 2023 6:46 AM
> > >
> > > [..]
> > > > > > > > It's still not clear to me how this is done.
> > > > > > > >
> > > > > > > > 1) guest starts FLR
> > > > > > > > 2) adminq freeze the VF
> > > > > > > > 3) FLR is done
> > > > > > > >
> > > > > > > > If the freezing doesn't wait for the FLR, does it mean we
> > > > > > > > need to migrate to a state like FLR is pending? If yes, do
> > > > > > > > we need to migrate the other sub states like this? If not, why?
> > > > > > > >
> > > > > > > In most practical cases #2 followed by #1 should not happen as
> > > > > > > on the source
> > > > > > side the expected is mode change to stop from active.
> > > > > >
> > > > > > How does the hypervisor know if a guest is doing what without trapping?
> > > > > >
> > > > > Hypervisor does not know. The device knows being the recipient of #1 and
> > #2.
> > > >
> > > > We are discussing the possibility in software/driver side isn't it?
> > > >
> > > > 1) is initiated from the guest
> > > > 2) is initiated from the hypervisor
> > > >
> > > > Both are softwares, and you're saying 2) should not happen after 1)
> > > > since the device knows what is being done by guests? How can devices
> > > > control software behaviour?
> > > >
> > > Device do not control software behavior.
> > > i.e. either hypervisor can initiate device mode change to stop (not freeze) or
> > guest can initiate FLR.
> > > Device knows which is initiated first as single recipient of both.
> > > Therefore, device responds accordingly.
> > > For example, in the sequence you described, A device will delay mode
> > > change command response, until the FLR is completed.
> >
> > Finally but ok.
> >
> > >
> > >
> > > > This only possible thing is to make sure 3) is done before 2) That
> > > > is what I'm asking but you are saying freeze doesn't need to wait for FLR...
> > > >
> > > I think I responded in previous email further down on synchronization point
> > being fw.
> > > I meant to say software do not need to wait for initiation of the freeze mode
> > command.
> >
> > For software, did you mean the hypervisor?
> >
> Yes.
>
> > > Just the command will complete at right time.
> > >
> > > This is anyway very corner case.
> > > On source hypervisor as written in the theory of operation, the sequence is
> > active->stop->freeze.
> > > When mode change is done to stop, the vcpus are already suspended.
> >
> > The problem here is not the vcpu but the when FLR is being done since it may
> > change the device context.
> >
> Once the freeze mode transition is completed, the hypervisor sw reads the final device context to migrate.
>
> > >
> > > I agree FLR may have been initiated and driver is waiting now for 100msec.
> >
> > For driver, did you mean the driver in the guest?
> >
> Yes for 100msec wait time, it is the guest driver.
>
> > >
> > > So yes, device single entity synchronized it.
> > >
> > > > >
> > > > > > > But ok, since we active to freeze mode change is allowed, lets
> > > > > > > discuss
> > > > above.
> > > > > > >
> > > > > > > A device is the single synchronization point for any device
> > > > > > > reset, FLR or admin
> > > > > > command operation.
> > > > > >
> > > > > > So you agree we need synchronization? And I'm not sure I get the
> > > > > > meaning of synchronization point, do you mean the
> > > > > > synchronization between freeze/stop and virtio facilities?
> > > > > >
> > > > > Synchronization means, handling two events in parallel such as FLR and
> > other.
> > > >
> > > > Great. So we have a perfect race:
> > > >
> > > > 1) guest initiates FLR
> > > > 2) device start FLR
> > > > 3) hypervisor stop and freeze the device
> > > > 4) device is freeze
> > > > 5) hypervisor read device context A
> > > > 6) migrate device contextA
> > > > 8) migration is done
> > > > 9) FLR is done
> > > > 10) hypervisor read device context B
> > > >
> > > > So we end up with inconsistent device context, no? Dest want B or
> > > > A+B, but you give A.
> > > >
> > > Since #1 and #2 is done before #3, the device knows to finish the FLR, hence
> > #9 is completed before #4.
> >
> > Ok, that's my understanding and that's why I'm asking, but you said freeze/stop
> > doesn't need to wait for FLR before.
> >
> Hypervisor side does not need to wait to issue the freeze/stop command.
> It just completes later from the device if FLR in this corner case was ongoing.
> I covered this in v2 now.
>
> > >
> > > Alternatively, in above sequence when destination sees #10, it can
> > immediately finish the FLR as dest device is not under FLR, treating it as no-op.
> > >
> > > Both ways to handle are fine. (and rare in practice, but yes, its possible).
> > >
> > > I will write both the options in the device requirements.
> > >
> > > > >
> > > > > > > So, the migration driver do not need to wait for FLR to complete.
> > > > > >
> > > > > > I'm confused, you said below that device context could be changed by
> > FLR.
> > > > > >
> > > > > Yes.
> > > > > > If FLR needs to clear device context, we can have a race where
> > > > > > device context is cleared when we are trying to read it?
> > > > > >
> > > > > I didn’t say clear the context.
> > > > > FLR updates the device context.
> > > >
> > > > In what sense?
> > > >
> > > Indicating a new device context indicating a new device context and discard
> > the old one.
> >
> > For example, what will queue_address have after an FLR?
> >
> All zeros.
> It probably does not matter because device context to capture this FLR notion.
> device_status shows that device is under reset after the FLR.
>
> > > I am glad you asked this. I wanted to get the basic part captured before
> > adding this optimization.
> >
> > Ok.
> >
> > > Probably it is good to add it now in the v2 as we crossed this stage now.
> > >
> > > > > Device is serving the device context read write commands, serving
> > > > > FLR, answering mode change command, So device knows the best how
> > > > > to avoid
> > > > any race.
> > > >
> > > > You want to leave those details for the vendor to figure out? If
> > > > devices know everything, why do we need device normative?
> > > >
> > > Device knows its implementation.
> > > Implementation guidelines to be in the normative.
> > > I will add it to the normative.
> > >
> > > > I see issues at least for FLR, I'm pretty sure they are others. If a
> > > > design requires us to audit all the possible conflicts between
> > > > virtio facilities and transport. It's a strong hint of layer
> > > > violation and when it happens it for sure may hit a lot of problems that are
> > very hard to find or debug thus we should drop such a design.
> > > > I suggest using the RFC tag since the next version (if there is one)
> > > > as I see it is immature in many ways.
> > > >
> > > Technical committee audits the required touch points like rest of the industry
> > committees that I participated.
> > > I disagree to your above point.
> > > If you do not want to review, that is fine.
> >
> > I don't want to hold my breath if I see something that is obviously wrong. Using
> > RFC may help people to know that it is a draft that has something to be
> > improved before it can be merged.
> >
> > > We are reviewing with other members and also contributed by them.
> > >
> > > > What's more, solving races is much easier if the device
> > > > functionality is self contained. For example, for a self contained
> > > > device with the transport as the single interface, we can leverage
> > > > from transport
> > > > (PCI) for dealing with races, arbitration, ordering, QOS etc which
> > > > is probably required in the internal channel between the owner and
> > > > the member. But all of these were missed in your series and even if
> > > > you can I'm not sure it's worthwhile to reinvent all of them.
> > > >
> > > At the end there is one physical device serving owner and member devices.
> >
> > This doesn't happen yet. For example, a VF with adminq that can be isolated
> > with PASID makes some sense.
> PASID is for the process of user space.
> Kernel space do not consume a PASID.
> Depending on the PASID can work for mediation approach only.

Why? If your proposal is justified why can't it be used with the PASID?

> Last I heard is the cpu or kernel took the support out of it as some special instruction did not work for that cpu.

This is not what I heard from the CPU vendor.

>
> >
> > > So a claim like things are on the VF hence you magically get 200% QoS
> > guarantee is myth.
> >
> > That's not my point, I'm saying VF could benefit from the e.g QOS support in
> > PCIE. I'm not saying it's perfect.
> >
> It is not the QOS support in PCIE.
> It is the restrictions to respond in N usec and burden on the device to put something in always available memory for rarely used element is wasteful.

Let's leave the QOS topic aside in the future, I'm pretty sure we're
talking about different things.

>
> > >
> > > Quoting "all of these" is also incorrect.
> > >
> > > Things added gradually, first functionally with reasonable performance,
> > followed by notion and extension for QoS.
> > > By definition of PCI transport for SR-IOV there is internal channel.
> > >
> > > It is reasonably well proposal in current form.
> > > There are few race condition that you highlight are extremely rare in nature.
> >
> > It's not rare since there's no way to know what the guest is doing.
> I will be practical. It is rare, because a production environment guest is interested in running traffic and not constantly engaging in device reset.
> And if it does, taking longer time to migrate is also fine, because guest is driving the device for production apps.
> > It's actually the critical part for live migration to be correct. You are proposing
> > migration so it must cover all those cases to make sure there is no case to make
> > your proposal a dead end.
> Yes, it should be correct.
> >
> > > Suggestions are welcome to improve.
> >
> > I have given some and I will give more.
> >
> Sure, that is very helpful as usual.
>
> > > There were couple of them by Michael too, I am addressing them in the v2.
> > >
> > > > For example, for the architecture like owner/member, if the virtio
> > > > or transport facility could be controlled via device internal
> > > > channels besides the transport, such a channel may complicate the
> > synchronization a lot.
> > > Two vendors who actually make the hw sriov devices are authoring these and
> > others are also reviewing.
> > > So I am more confident that it is solid enough.
> > > Also, a similar design has been seen with other device for more than a year
> > as GPL integrated with QEMU for a year now and with upstream kernel.
> > >
> > > > The device needs to
> > > > be able to handle or synchronize requests from both PCI and owner in
> > parallel.
> > > > They are just too many possible races and most of my questions so
> > > > far come from this viewpoint. I wouldn't go further for other stuff
> > > > since I believe I've spotted sufficient issues and that's why I must
> > > > stop at this patch before looking at the rest.
> > > It is your call to stop or progress.
> > > I find your reviews useful to improve this proposal, so I will fix them.
> >
> > My point is to make the theory correct before looking at the others as I had a
> > lot of questions (as demonstrated in this thread). I think it's not hard to
> > understand as the rest of the series are based on the theory.
> >
> > >
> > > >
> > > > Admin commands are fine if it does real administrative jobs such as
> > > > provisioning since such work is beyond the core virtio functionality.
> > > >
> > > > Again, the goal of virtio spec is to have a device with sufficient
> > > > guidelines that is easy to implement but not leave the vendors to
> > > > waste their engineering resources in figuring or fuzzing the corner cases.
> > > I have not seen an industry standard spec or a software that does not have
> > corner cases.
> >
> > Corner cases are probably not accurate. I meant, for you, it's probably a corner
> > case, but for me it's kind of obvious.
> >
> > > The spec proposal is from > 1 device vendors.
> >
> > That's good but it doesn't mean it doesn't have any (major) issues.
> > E.g vendors may choose to just implement part of the PCIE capabilities so they
> > don't do audits for the rest.
> >
> > >
> > > I will focus on more practical aspects to progress and improve this spec.
> > > >
> > > > >
> > > > > > > When admin cmd freeze the VF it can expect FLR_completed VF.
> > > > > >
> > > > > > We need to explain why and how about the resume? For example, is
> > > > > > resuming required to wait for the completion of FLR, if not, why?
> > > >
> > > > This question is ignored.
> > > >
> > > I probably missed. Sorry about it.
> > > No, the driver does not need to wait for FLR to finish to issue resume
> > > command,
> >
> > Good but I want to know if stop/freeze->active requires to wait for the
> > completion of FLR. I guess the answer is yes.
> >
> Yes.
> > > as this typically done on the destination member device which should not be
> > under FLR.
> > > I will write up the requirements further.
> > >
> > > > > > In another thread you are saying that the PCI composition is
> > > > > > done by hypervisor, so passthrough is really confusing at least for me.
> > > > > >
> > > > > I explained there what vPCI composition is done there.
> > > > > PCI config space and msix side of composition is done.
> > > > > The whole virtio interface is not composed.
> > > >
> > > > You need to describe this somewhere, no? That's what I'm saying.
> > > >
> > > Mostly not. What is not done is not written.
> > >
> > > > And passthrough is misleading here.
> > > >
> > > Passthrough is mentioned in theory of operation.
> > > It is not present in requirements section.
> > > So, it is fine.
> >
> > I suggest documenting or defining the "passthrough" methodology somewhere.
> > Michael tries to define it in another thread, if it's ok, let's use that. We can't
> > require people to read VFIO code in order to know what happens in the virtio
> > spec.
> >
> Yes. I captured in v2 in assumptions section.
>
> > >
> > > > >
> > > > > > > Ok. I assume "reset flow" is clear to you now that it points to section
> > 2.4.
> > > > > > > This section is not normative section, so using an extra word
> > > > > > > like "flow" does
> > > > > > not confuse anyone.
> > > > > > > I will link to the section anyway.
> > > > > >
> > > > > > Probably, but you mention FLR flow as well.
> > > > > As I said, not repeating the PCIe spec here. The reader knows what
> > > > > FLR of the
> > > > PCIe transport.
> > > >
> > > > Ok, I'm not a native speaker, but I really don't know the difference
> > > > between "FLR" and "FLR flow".
> > > >
> > > Lets keep it simple. I will write it as FLR, as pci transport has it as FLR.
> >
> > Ok.
> >
> Done in v2.
>
> > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > and may also undergo PCI function level
> > > > > > > > > > > +reset(FLR) flow.
> > > > > > > > > >
> > > > > > > > > > Why is only FLR special here? I've asked FRS but you
> > > > > > > > > > ignore the
> > > > question.
> > > > > > > > > >
> > > > > > > > > FLR is special to bring clarity that guest owns the VF
> > > > > > > > > doing FLR, hence
> > > > > > > > hypervisor cannot mediate any registers of the VF.
> > > > > > > >
> > > > > > > > It's not about mediation at all, it's about how the device
> > > > > > > > can implement what you want here correctly.
> > > > > > > >
> > > > > > > > See my above question.
> > > > > > > >
> > > > > > > Ok. it is clear that live migration commands cannot stay on
> > > > > > > the member device
> > > > > > because the member device can undergo device reset and FLR flows
> > > > > > owned by the guest.
> > > > > >
> > > > > > I disagree, hypervisors can emulate FLR and never send FLR to real
> > devices.
> > > > > >
> > > > > That would be some other trap alternative that needs to dissect
> > > > > the device
> > > > and build infrastructure for such dissection is not desired in the listed use
> > case.
> > > >
> > > > Do you need to trap FLR or not? You're saying the hypervisor is in
> > > > charge of vPCI, how is this differ to what you proposed? If not, how
> > > > can vPCI be composed?
> > > >
> > > Live migration driver do not need to trap FLR.
> >
> > Maybe I misunderstood your vPCI composition, but it's really helpful to
> > document how it is expected to be done.
> >
> We can probably have it in the cover letter as hypervisors may evolve and do more passthrough work than done today.
>
> > >
> > > > I believe you need to document how vpci is supposed to be done,
> > > > since I believe your proposal can only work with such specific types
> > > > of PCI composition. This is one of the important things that is missed in this
> > series.
> > > >
> > > I don’t see a need to describe vpci composition as there may be more than
> > one way to do it.
> >
> > More than one way for sure, but this contradicts what you say: you said you
> > don't trap FLR ...
> >
> Don’t trap the FLR in the live migration driver.

But trapped by other?

>
> > People like me may wonder for example why FLR is mentioned, as FLR can be
> > trapped and emulated.
> It can be, that emulation requires the knowledge of device specific things.
> This may be fine in such stack which is composing the virtio device using non virtio devices.
> >
> > Another example, when a device can be saved and restored, the hypervisor may
> > schedule the device among multiple VMs, in that case, trapping FLR is a must.
> >
> I think you meant sharing when you say scheduling.
> Such can work when the data path is also trapped.
> May be one can do using PASID and queue assignment.
> But than it is not passthrough.
> So two different use cases, in that case the whole PCI config space is fully replicated and hence, FLR never reaches to the real device.

It's pretty common, e.g it works as system reset from VCPU never goes to CPU.

>
>
> > > What I think it is worth to describe is the whole pci device is not stored in
> > device context.
> > > I will try to add a short description around it.
> > >
> > > > >
> > > > > So your disagreement is fine for non-passthrough devices.
> > > > >
> > > > > > > (and hypervisor is not involved in these two flows, hence the
> > > > > > > admin command
> > > > > > interface is designed such that it can fullfil above requirements).
> > > > > > >
> > > > > > > Theory of operation brings out this clarity. Please notice
> > > > > > > that it is in
> > > > > > introductory section with an example.
> > > > > > > Not normative line.
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > > > > > +virtio specification;
> > > > > > > > > >
> > > > > > > > > > This seems unnecessary and obvious as it applies to all
> > > > > > > > > > other PCI and virtio functionality.
> > > > > > > > > >
> > > > > > > > > Great. But your comment is contradicts.
> > > > > > > > >
> > > > > > > > > > What's more, for the things that need to be
> > > > > > > > > > synchronized, I don't see any descriptions in this patch. And if it
> > doesn't need, why?
> > > > > > > > > With which operation should it be synchronized and why?
> > > > > > > > > Can you please be specific?
> > > > > > > >
> > > > > > > > See my above question regarding FLR. And it may have others
> > > > > > > > which I haven't had time to audit.
> > > > > > > >
> > > > > > > Ok. when you get chance to audit, lets discuss that time.
> > > > > >
> > > > > > Well, I'm not the author of this series, it should be your job
> > > > > > otherwise it would be too late.
> > > > > >
> > > > > As author, what we think, I will cover. If you have specific
> > > > > points to add value,
> > > > please share, I will look into it.
> > > >
> > > > I've pointed out sufficient issues. I have a lot of others but I
> > > > don't want to have a giant thread once again.
> > > >
> > > I see following things to improve in the requirements which I will do in v2.
> > >
> > > 1. Document race around FLR and admin commands for really rare corner
> > case.
> > > 2. Some text around not migrating the pci device registers 3.
> > > Interaction with PM commands
> > >
> > > > >
> > > > > > For example, how is the power management interaction with the
> > > > freeze/stop?
> > > > > >
> > > > > Power management is owned by the guest, like any other virtio interface.
> > > > > So freeze/stop do not interfere with it.
> > > >
> > > > I don't think this is a good answer. I'm asking how the PM interacts
> > > > with freeze/stop, you answer it works well.
> > >
> > > >
> > > > I'm not obliged to design hardware for you but figuring out the bad
> > > > design for virtio. I'm not convinced with a proposal that misses a
> > > > lot of obvious critical cases and for sure it's not my job to solve them.
> > > >
> > > I am not asking you to solve.
> >
> > My point is that, it's better for you to have some investigation on the PM
> > instead of me.
> >
> Yes, I updated the device requirement in v2.
>
> > >
> > > > I've demonstrated the possible races with FLR. So did the PM. For
> > > > example, if VF is in D3cold state, can we still read its device context?
> > > I think yes, but I will double check.
> > >  If yes, is it a violation of the PCIE spec? If not, why?
> >
> > So you are emulating the state instead of a real suspension?
> >
> No. not emulating the state. It is present in the device at PCI level under guest and member device control.
> When controlling function (owner PF) administrate the member device to freeze, it put the whole device in freeze so D3->D0 cannot happen.

What I want to ask is if the PM is exposed to guest drivers, not
whether the PF can do that.

If the PM can't, we lose a very important functionality in the guest.
If yes, how is the e.g D3state synchronized with the free/stop/device
context read, or if it can't and why.

>
> > > No, because device context is owned by the owner device and not the VF. SR-
> > PCIM interface has defined it be outside of scope of PCIe spec.
> > >
> > > > How about other states? Can the device be freezed in the middle of
> > > > PM state transitions? If yes, how can it work without migrating PCI
> > > > states?
> > > I will double check, but unlikely, it should be similar to FLR case to keep the
> > device to avoid treating it differently.
> >
> > The reason why I see it is different from FLR is that
> >
> > 1) D3cold requires the VF to be off the power
> > 2) State transition might takes more than what FLR did, PCI seems only cover
> > the minimum delay but not maximum which may have implications for
> > downtime
> >
> D3cold is not controlled by the guest driver.

How do you know that?

Spec said:

"All Functions must support the D0 and D3 states (both D3Hot and
D3Cold). The D1 and D2 states are optional."

So the function is available in the VF. And actually VFIO has started
to support that:

commit cc2742fe3660cc6500021d3da8f937d326392dbd
Author: Abhishek Sahu <abhsahu@nvidia.com>
Date:   Mon Aug 29 17:18:49 2022 +0530

    vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT

And even for D3hot, it has very subtle things that needs special care:

Spec said:

"If the No_Soft_Reset bit is Clear, functional context is not required
to be maintained by the Function in the D3Hot state"

So it looks like your proposal can only work with the No_Soft_Reset
set. Or you need at least document the context is maintained in this
case. Since the context might be lost in it is clear. So you can't
migrate the device within the D3hot state.

I believe you should have much more knowledge than me in PCI. I just
read the spec in minutes and spot the above. I'm pretty sure it's not
hard for you to find more issues and try to think of a way to solve
them or not. It would be more efficient for you instead of waiting for
a person like me to point out the issue. No?

Again, it's really awkward that we need to go through everywhere in
the PCIE spec in order to prove your proposal. That's really a layer
violation in the design where we know for sure there will be more
blockers in the future.

> PM register can change D0 to D3hot.
>
> > >
> > > > Well, I meant we need a more precise definition of each state
> > > > otherwise it could be ambiguous (as I pointed above).
> > > Ok. so, few things about read and other messages, I will add.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > In "stop" mode, the device wont process descriptors.
> > > > > > > >
> > > > > > > > If the device won't process descriptors, why still allow it
> > > > > > > > to receive
> > > > > > notifications?
> > > > > > > Because notification may still arrive and if the device may
> > > > > > > update any counters as part of
> > > > > >
> > > > > > Which counters did you mean here?
> > > > > >
> > > > > The counter that Xuan is adding and any other state that device
> > > > > may have to
> > > > update as result of driver notification.
> > > > > For example caching the posted avail index in the notification.
> > > >
> > > > A link to those proposals?
> > > [1]
> > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00048.h
> > > tml
> > >
> >
> > I don't see how this is related to "posted avail index" etc.
> Driver notification counters are updated.

There's no guarantee that the counters are exactly accurate for any
case. E.g counter could be updated in the middle of a read. I don't
think there would be a software that depends on the precise value of a
statistics counter.

>
> >
> > > > If the device must depend on those cached features to work it's
> > > > really fragile. If not, we don't need to care about them.
> > > It is not dependent.
> > > It is the infrastructure to enable it.
> > > Same for other shared memory region accesses.
> > >
> > > >
> > > > >
> > > > > > > it which needs to be migrated or store the received notification.
> > > > > > >
> > > > > > > > Or does it really matter if the device can receive or not here?
> > > > > > > >
> > > > > > > From device point of view, the device is given the chance to
> > > > > > > update its device
> > > > > > context as part of notifications or access to it.
> > > > > >
> > > > > > This is in conflict with what you said above " Device cannot
> > > > > > process the queue ..."
> > > > > >
> > > > > No, it does not.
> > > > > Device context is updated within the device without accessing the
> > > > > queue
> > > > memory of the guest.
> > > >
> > > > This is not documented or explained anywhere?
> > > >
> > > Why should it be explained?
> > > device is not accessing the guest memory -> this is mentioned in stop mode.
> >
> > Isn't it hard to see the difference between the following two?
> >
> I am not following your question.
> > 1) In stop mode, device is not accessing guest memory
> > 2) device context is updated without accessing the queue memory of the guest
> >
> Device context is read/written by the owner PF, so it does not touch the guest memory.
>
> > 1) is to define the stop mode, 2) is to define the behaviour of device context
> >
> > Or are you saying device context can only be fetched after the device is
> > stopped?
> >
> No device context can be fetched in all 3 modes.
> It is weird for the hypervisor to fetch the device context while mode transition in progress.

Ok.

>
> > > Hence, there is no need to write above.
> > >
> > > > >
> > > > > > Maybe you can give a concrete example.
> > > > > >
> > > > > The above one.
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > + the member device context
> > > > > > > > > >
> > > > > > > > > > I don't think we define "device context" anywhere.
> > > > > > > > > >
> > > > > > > > > It is defined further in the description.
> > > > > > > >
> > > > > > > > Like this?
> > > > > > > >
> > > > > > > > """
> > > > > > > >  +The member device has a device context which the owner
> > > > > > > > driver can
> > > > > > > > +either read or write. The member device context consist of
> > > > > > > > +any
> > > > > > > > device  +specific data which is needed by the device to
> > > > > > > > resume its operation  +when the device mode """
> > > > > > > >
> > > > > > > Yes.
> > > > > > > Further patch-3 adds the device context and also add the link
> > > > > > > to it in the
> > > > > > theory of operation section so reader can read more detail about it.
> > > > > > >
> > > > > > > > "Any" is probably too hard for vendors to implement. And in
> > > > > > > > patch 3 I only see virtio device context. Does this mean we
> > > > > > > > don't need transport
> > > > > > > > (PCI) context at all? If yes, how can it work?
> > > > > > > >
> > > > > > > Right. PCI member device is present at source and destination
> > > > > > > with its layout,
> > > > > > only the virtio device context is transferred.
> > > > > > > Which part cannot work?
> > > > > >
> > > > > > It is explained in another thread where you are saying the PCI
> > > > > > requires mediation. I think any author should not ignore such
> > > > > > important assumptions in both the change log and the patch.
> > > > > >
> > > > > > And again, the more I review the more I see how narrow this
> > > > > > series can be
> > > > used:
> > > > > >
> > > > > I explained this before and also covered in the cover letter.
> > > > >
> > > > > > 1) Only works for SR-IOV member device like VF
> > > > > It can be extended to SIOV member device in future.
> > > > > Today these are the only type of member device virtio has.
> > > >
> > > > That is exactly what I want to say, it can only work for the
> > > > owner/member model. It can't work when the virtio device is not
> > > > structured like that. And you missed that most of the existing
> > > > virtio devices are not implemented in this model. It means they
> > > > can't be migrated with a pure virtio specific extension. For you,
> > > > SR-IOV is all but this is not true for virtio. PCI is not the only transport and
> > SR-IOV is not the only architecture in PCI.
> > > >
> > > Each transport will have its own way to handle it.
> > > When there is MMIO owner-member relationship arise, one will be able to do
> > so as well.
> > > In fact other transports will likely miss out as they have not established such
> > pace.
> > >
> > > > And I'm pretty sure the owner/member is not the only requirement,
> > > > there are a lot of other assumptions which are missed in this series.
> > > >
> > > One proposal does not do everything.
> > > It is just impractical.
> >
> > For other assumptions, I meant:
> >
> > 1) how vpci is composed, if it can be composed as vhost, why do we need to
> > mention "passthrough"
> I am lost, above you said you want to capture how vpci is composed.
> Here you say why to mention passthrough.

I mean you need to document how vpci is composed, then people can see
if it makes sense or not.

>
> > 2) the cap/bar layout, for example if a cap shares BARs with others, it can't be
> > "passthrough", no?
> Why not, the cap exposes all the things to the guest.

What if a cap shared the same page with MSI-X? You can't let guests
program MSI-X directly, no?

>
> >
> > >
> > > > >
> > > > > > 2) Mediate PCI but not virtio which is tricky
> > > > > > 3) Can only work for a specific BAR/capability register layout
> > > > > >
> > > > > > Only 1) is described in the change log.
> > > > > >
> > > > > > The other important assumptions like 2) and 3) are not
> > > > > > documented
> > > > anywhere.
> > > > > > And this patch never explains why 2) and 3) is needed or why it
> > > > > > can be used for subsystems other than VFIO/Linux.
> > > > > >
> > > > > Since I am not mentioning vfio now, I will refrain from mentioning
> > > > > others as well. :)
> > > >
> > > > It's not about VFIO at all. It's about to let people know under
> > > > which case this proposal could work. Otherwise if a vendor develops
> > > > a BAR/cap which is not at page boundary. How could you make it work with
> > your proposal here?
> > > >
> > > Vendor is a cloud operator which is building the device, so it will always work
> > it has the matching capabilities on source and destination.
> >
> > I meant, for example, if common_cfg shares a BAR with others but doesn't own
> > a page exclusively, you need to trap, no?
> >
> For passthrough device why such restriction?

See above.

>
> > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > >and device configuration space may change. \\
> > > > > > > > > > > +\hline
> > > > > > > > > >
> > > > > > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > > > > > >
> > > > > > > > > All pci devices which belong to a single guest VM are not
> > > > > > > > > stopped
> > > > > > atomically.
> > > > > > > > > Hence, one device which is in freeze mode, may still
> > > > > > > > > receive driver notifications from other pci device,
> > > > > > > >
> > > > > > > > Device may choose to ignore those notifications, no?
> > > > > > > >
> > > > > > > > > or it may experience a read from the shared memory and get
> > > > > > > > > garbage
> > > > > > data.
> > > > > > > >
> > > > > > > > Could you give me an example for this?
> > > > > > > >
> > > > > > > Section 2.10 Shared Memory Regions.
> > > > > >
> > > > > > How can it experience a read in this case?
> > > > > >
> > > > > MMIO read/write can be initiated by the peer device while the
> > > > > device is in
> > > > stopped state.
> > > >
> > > > Ok, but what I want to say is how it can get the garbage data here?
> > > >
> > > If the device mode is changed to freeze while it is being read by the peer
> > device, it can get garbage data or last data.
> > > Which may not be the one that is expected.
> > > So first all the initiator devices are stopped, ensure that they do not make any
> > requests.
> > >
> > > And there are requests, which gets proper answer.
> >
> > Ok.
> >
> > >
> > > > >
> > > > > > Btw, shared regions are tricky for hardware.
> > > > > >
> > > > > > >
> > > > > > > > > And things can break.
> > > > > > > > > Hence the stop mode, ensures that all the devices get
> > > > > > > > > enough chance to stop
> > > > > > > > themselves, and later when freezed, to not change anything
> > internally.
> > > > > > > > >
> > > > > > > > > > > +0x2   & Freeze &
> > > > > > > > > > > + In this mode, the member device does not accept any
> > > > > > > > > > > +driver notifications,
> > > > > > > > > >
> > > > > > > > > > This is too vague. Is the device allowed to be freezed
> > > > > > > > > > in the middle of any virtio or PCI operations?
> > > > > > > > > >
> > > > > > > > > > For example, in the middle of feature negotiation etc.
> > > > > > > > > > It may cause implementation specific sub-states which
> > > > > > > > > > can't be
> > > > migrated easily.
> > > > > > > > > >
> > > > > > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > > > > > It is passthrough device, hence hypervisor layer do not
> > > > > > > > > get to see sub-
> > > > > > state.
> > > > > > > > >
> > > > > > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > > > > > The device context already covers this sub-state.
> > > > > > > >
> > > > > > > > 1) driver writes driver_features
> > > > > > > > 2) driver sets FEAUTRES_OK
> > > > > > > >
> > > > > > > > 3) device receive driver_features
> > > > > > > > 4) device validating driver_features
> > > > > > > > 5) device clears FEATURES_OK
> > > > > > > >
> > > > > > > > 6) driver read stats and realize FEATURES_OK is being
> > > > > > > > cleared
> > > > > > > >
> > > > > > > > Is it valid to be frozen of the above?
> > > > > > > No. device mode is frozen when hypervisor is sure that no more
> > > > > > > access by the
> > > > > > guest will be done.
> > > > > >
> > > > > > How, you don't trap so 1) and 2) are posted, how can hypervisor
> > > > > > know if there's inflight transactions to any registers?
> > > > > >
> > > > > Because hypervisor has stopped the vcpus which are issuing them.
> > > >
> > > > MMIO are posted. vCPU is stopped but the transactions are inflight.
> > > > How could the hypervisor/device know if there's any inflight PCIE
> > > > transactions here? So I can imagine what happens in fact is the TLP
> > > > for freezing is ordered with the TLP for posted MMIO. This is
> > > > probably guaranteed for typical PCIE setup but how about the relaxed
> > ordering?
> > >
> > > Vcpus do not generated relaxed ordering MMIOs.
> > > In pci spec: " If this bit is Set, the Function is permitted to set
> > > the Relaxed Ordering bit in the Attributes field of transactions it initiates".
> > >
> > > Function initiates RO requests, not the vcpu.
> > > Hence, it is fine.
> > >
> >
> > Ok.
> >
> > > > >
> > > > > > > What can happen between #2 and #3, is device mode may change to
> > stop.
> > > > > >
> > > > > > Why can't be freezed in this case? It's really hard to deduce
> > > > > > why it can't just from your above descriptions.
> > > > > >
> > > > > On the source hypervisor, the mode changes are active->stop->freeze.
> > > > > Hence when freeze is done, the hypervisor knows that all inflight
> > > > > has been
> > > > stopped by now.
> > > >
> > > > Ok, but how about freezing between 3) and 4). If we allow it, do we
> > > > need to migrate to this state? If yes, how can it work with your
> > > > device context? If not, shouldn't we document this?
> > > >
> > > May be, some of these are implementation details. I am not sure it belongs to
> > spec.
> >
> > The point is to make sure that your deivce context covers this case.
> > If it can't be covered, it's a design defect.
> >
> > > Like RSS update while packets are received.. such implementation details are
> > not part of the spec.
> >
> > This is definitely different, the driver can choose to synchronize or the end user
> > can tolerate the possible out of order packets in this case.
> >
> Right but it is not defined in the spec.
>
> > This is not the case here, if freezing between 3) and 4) is allowed, your current
> > device context can't cover this case and guests can't tolerate such kinds of
> > errors after migration for sure.
> >
> Ok. I will add the text around this in v3.
>
> > >
> > > > >
> > > > > > Even if it had, is it even possible to list all the places where
> > > > > > freezing is prohibited? We don't want to end up with a spec that
> > > > > > is hard to implement or leave the vendor to figure out those tricky parts.
> > > > > >
> > > > > The general idea is not prohibiting the freeze/stop mode.
> > > > > If the device needs more time, let device take time to do it.
> > > >
> > > > Ok, it means:
> > > >
> > > > 1) there're conditions from stop to freeze, then what are they?
> > > No, there isn’t condition.
> > > May be I didn’t follow the question.
> >
> > E.g under which condition could the device change the status from active to
> > stop etc. That's something I keep asking with a concrete example (e.g FLR).
> >
> Device mode is changed by the driver from active to stop. This is the admin mode.
> FLR do not change the mode to stop/freeze because it is guest driver controlled operational state of the device.
>
> > > > 2) how much time at most? E.g FLR takes at most 100ms.
> > > From the driver side, it is 100msec for device side it can be less too.
> > > As soon as FLR is done or enough to record it, is done, stop can continue.
> > >
> > > > 3) If it needs more time, can this time satisfy the downtime requirement?
> > > >
> > > Guest VM for all practical purposes is not busy in doing FLR, it is a corner
> > case, yet we have to cover it.
> >
> > Corner case in what sense? A loop in a simple shell script can trigger this easily.
> >
> Sure.
> Which is not the practical application of the guest VM.
> Hence, it is corner case.

Such thought is wrong. You can't assume the behaviour happnes in
guest, that's the basic assumption for virtualization and security.
Or you will end up with endless bugs that is hard to fix.

>
> > > And yes, it satisfy the downtime requirements, because VM is already not
> > interested in the packets, it is busy doing the FLR.
> >
> > Well, it has subtle differences. VM may have more than one interface, just one
> > of the interfaces is doing FLR.
> >
> Sure, it can do. But in that case the time of that VF stop is not critical.
> VM is busy in non-critical work.
>
> > >
> > > > >
> > > > >
> > > > > > > And in stop mode, device context would capture #5 or #4,
> > > > > > > depending where is
> > > > > > device at that point.
> > > > > > >
> > > > > > > > >
> > > > > > > > > > And what's more, the above state machine seems to be
> > > > > > > > > > virtio specific, but you don't explain the interaction
> > > > > > > > > > with the device status state
> > > > > > > > machine.
> > > > > > > > > First, above is not a state machine.
> > > > > > > >
> > > > > > > > So how do readers know if a state can go to another state and when?
> > > > > > > >
> > > > > > > Not sure what you mean by reader. Can you please explain.
> > > > > >
> > > > > > The people who read virtio spec.
> > > > > >
> > > > > So question is "how reader knows if a state can go to another
> > > > > state and
> > > > when"?
> > > > > It is described and listed in the table, when a mode can change.
> > > >
> > > > It's not only "if" but also "when". Your table partially answers the
> > > > "if '' but not "when". I think you should know now the state
> > > > transition is conditional. So let's try our best to ease the life of the vendor.
> > > What do you mean when?
> > > I do not understand that "mode change is conditional"? it is not based on the
> > condition.
> > > [..]
> >
> > See above.
> >
> > >
> > > > > > Let's define the synchronization point first. And it
> > > > > > demonstrates at least devices need to synchronize between the
> > > > > > free/stop and virtio device status machine which is not as easy as what
> > is done in this patch.
> > > > > >
> > > > > Synchronization point = device.
> > > >
> > > > This is obvious as we can't rule stuff outside virtio, and we are
> > > > talking about devices not drivers here. But the spec needs
> > > > sufficient guidance/normative for the vendor to implement. It's more
> > > > than just saying "device is synchronization point".
> > > >
> > > The requirements are already covering what device needs to do.
> > > Some interaction points are missing, as I acked above, I will add them.
> > >
> > > [..]
> > > > > > Until virtio reset, this is how virtio works now. I've pointed
> > > > > > out that it may cause extra troubles when trying to resume, but
> > > > > > you don't tell me what's wrong to keep that?
> > > > > >
> > > > > If kept, hypervisor may not be able to decide when to change the
> > > > > mode from
> > > > active->stop.
> > > >
> > > > Why? It is simply done when mgmt requires a migration?
> > > >
> > > Mgmt is bit higher level entity. Underneath the software layers may wait until
> > the time is right to migrate.
> >
> > I don't understand, anyhow the migration request could not be sent to the
> > device directly without the assistance in hypervisor.
> >
> > > The fundamental point is, the device context is expected to return the
> > incremental value, that is changed content from last time.
> > > So once all changed content is read, its empty.
> >
> > You can't easily define an incremental value for all types of states or structures:
> >
> > 1) device with complicated states like RAM or other
> > 2) the device state has complicated data structures
> >
> What I parse is, that device context is complicated structure.
> So it will be defined incrementally as it becomes more mature.

Increament only makes sense when

1) The state is giant, like RAM
2) There's a requirement of live migrating the state, dirty pages,
blocks from local devices
3) Giant state can converge after limited rounds

All of the above seems to be device type speicifc not the case for a
general device context.

>
> > >
> > > > What's more important, PCI allows multiple common_cfgs. So the
> > > > hypervisor can choose to reserve one common_cfg for live migration.
> > > > In this case we don't have to read to clear semantics.
> > > Common_cfg does not serve large device context, nor it serves DMA.
> >
> > Well, I'd think e.g the address of the descriptor table is part of the device
> > context, and it can be read some common_cfg.
> It can be read but we are talking about not saving 64 VQs tables,  and RSS, flow filters, statistics all in some common config registers.

The problem is, what you define as the incresmental device context
conflict with what has already been existed.

You say device context is incresmental but I give you one example that
it doesn't.

>
> >
> > >
> > > >
> > > > Or, are you saying the value read from common_cfg is not device context?
> > > The value of common config is part of the device context that represents
> > current common config.
> > >
> > > > Isn't this conflict with your vague definition of device context?
> > > >
> > > You mentioned you stop at this patch,
> >
> > Stop means stopping comment.
> >
> > > so likely you didn’t read device context patch, hence you quote it vague.
> > > So I don’t know what you mean by vague.
> >
> > So in this patch you define device context as:
> >
> > "The member device context consist of any device specific data which is needed
> > by the device to resume its operation"
> >
> > So the address of the descriptor table satisfy this definition? If not, why?
> >
> Address of the descriptor table is part of device context.

Ok, see my above reply.

>
> > > Please let me know what you additional thing you want to see in device
> > context after you reach that patch.
> > >
> > >
> > > > > We can opt for a mode where full device context is read in each
> > > > > mode
> > > > without clearing it.
> > > > > But than it can be very specific to a version of qemu, which we
> > > > > are avoiding it
> > > > here.
> > > > >
> > > > > > > 2. device context returns incremental value from the previous
> > > > > > > read. So, it
> > > > > > needs to clear it.
> > > > > >
> > > > > > I don't understand here. This is not the case for most of the devices.
> > > > > >
> > > > > Not sure which devices you mean here with "most of the devices".
> > > > > Device context functions like a write record pages (aka dirty pages).
> > > >
> > > > It's definitely different. We want to migrate dirty pages lively
> > > > which can consume a lot of bandwidth. So reporting delta makes a lot
> > > > of sense here since it would have a lot of rounds of syncing and it
> > > > doesn't result in blockers resuming.
> > > >
> > > Write records are reported as delta from the previous read.
> > >
> > > > For device context, how many rounds of syncing did you expect, and
> > > > if we have N rounds, we need to restore N rounds in order to resume?
> > > > Do you want to live migrating device states? If it's only 1 or 2 rounds, why
> > bother?
> > > >
> > > Live migrate the device context. Typically in current software using it, it is 2
> > rounds.
> >
> > If it's just 2 rounds, why bother for delta? It is only helpful is we want to live
> > migrate some device with giant states with sevreal rounds, and in that can we
> > should leave it as a device specific state.
> >
> The number of rounds matter less. The number of things a device needs to setup is a lot.

How many context in bytes do you exepct for e.g the virtio-net?

> And a VM may have many devices so just like pre-copy of dirty pages, it is an extension for the device state (context) to pre-copy.

Hypervisor can choose to do things like pre-heat but even in this
case, delta is not a must for device unless it have giant states that
need to be live migrated.

>
> > > The interface is generic that if needed more rounds are possible.
> > >
> > > Even device for most practical purpose will implement 2 rounds.
> > >
> > > > And for the delta, how do you know you can easily define deltas for
> > > > every type of device, especially the ones with complicated internal
> > > > states? Defining states has already been demonstrated as a
> > > > complicated task for some devices like virtio-FS and you want to complicate
> > it furtherly?
> > > >
> > > What is your question? If you say virtio-fs is complicated state, may be it
> > should not have existed itself in the virtio spec as first place.
> >
> > We have just more than FS that can't work for live migration. Crypto and GPU
> > are two other examples, and I'm pretty sure we have more.
> >
> Industry have migrated gpu, rdma, nvme, virtio-net, virtio-blk, devices, and susupend/resumed gpu devices too.
> So I can imagine it will happen as wider devices adopt the device context.

How many device you mentioned above have a open standard? Did you know
how many type of devices that virtio has supported? But we are talking
about different things. Unless you implement virtio-(gp)gpu/rdma/nvme,
then we can talk here.

What I want to say here is, from the virtio point of view, it's not
promising if you can just migrate net/blk. You know, we only need very
little extension then we can migrate net and block.

>
> > Until we figure out how they can, we can't say a device context work for all
> > types. No?
> >
> No.

Why?

> > 1) Trying to define a format that works for all types of devices
> > 2) Leavce the states to be defined by individual device types
> >
> > Which method is esay?
> >
> Best of both. i.e. generic fields for generic virtio items like vq config, common config, device config area.
> And device specific context fields like rss, flow filters, counters.

This is conflict with your definition of device context and that is
the source of the misunderstanding.

>
> > > But I differ to think that.
> > > Virtio-fs guest side state wont be changed as part of it.
> > > Virtio-fs is the first device which has considered and listed to migrate the
> > device state.
> > > So it should be possible.
> >
> > I wouldn't repeart the discussion of virtio-FS migration here, you can serach the
> > archives for more details.
> >
> > But the point is obvious, it's really hard to say a simple device context can work
> > for all type of devices. We should allow a device specific states definition. This
> > seems to be agreed by Michale and LingShan.
> >
> Sure. I covered this in v2 at [2].
> Device specific state definitions will be able to grow.

Great.

Thanks



>
> [2] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.html
>
> > >
> > > > What is proposed in this series is an ad-hoc optimization for a
> > > > specific deivce type within a specific subsystem (e.g VFIO) in a
> > > > specific operating system which is not the general.
> > > >
> > > Oh now you mention vfio. Not me. :)
> > >
> > > I am not going to comment on this. It is not ad-hoc.
> >
> > You need to justify how it is not. Based on the current discussion, you have
> > demonstreated a lot of asusmptions in order to make your proposal to work.
> >
> Listed in v2 now.
>
> > > It uses similar dirty page tracking like technique present in cpu hw and other
> > devices.
> > >
> > > > As demsonsted many times, starting from something simple and stupid
> > > > is the most easy way.
> > > >
> > >
> > > > > Whatever is already returned is/should not be repeated in
> > > > > subsequent reads,
> > > > though device can choose to do so.
> > > > >
> > > > > > >
> > > > > > > > > And which software stack may find this useful?
> > > > > > > > > Is there any existing software that can utilize it?
> > > > > > > >
> > > > > > > > Libvirt.
> > > > > > > >
> > > > > > > Does libvirt restore on migration failure?
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > Ok. the device will be able to resume when it is marked active.
> > > > > The device context returned  is the incremental delta as explained above.
> > > >
> > > > I disagree, see my above reply.
> > > I replied above.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > > Why that device context present with the software
> > > > > > > > > vanished, in your
> > > > > > > > assumption, if it is?
> > > > > > > > >
> > > > > > > > > > > Typically, on
> > > > > > > > > > > +the source hypervisor, the owner driver reads the
> > > > > > > > > > > +device context once when the device is in
> > > > > > > > > > > +\field{Active} or \field{Stop} mode and later once
> > > > > > > > > > > +the member device is in
> > > > > > \field{Freeze} mode.
> > > > > > > > > >
> > > > > > > > > > Why need the read while device context could be changed?
> > > > > > > > > > Or is the dirty page part of the device context?
> > > > > > > > > >
> > > > > > > > > It is not part of the dirty page.
> > > > > > > > > It needs to read in the active/stop mode, so that it can
> > > > > > > > > be shared with
> > > > > > > > destination hypervisor, which will pre-setup the complex
> > > > > > > > context of the device, while it is still running on the source side.
> > > > > > > >
> > > > > > > > Is such a method used by any hypervisor?
> > > > > > > Yes. qemu which uses vfio interface uses it.
> > > > > >
> > > > > > Ok, such software technology could be used for all types of
> > > > > > devices, I don't see any advantages to mention it here unless it's unique
> > to virtio.
> > > > > >
> > > > > It is theory of operation that brings the clarity and rationale.
> > > >
> > > > I think it's not. Since it's not something that is unique to virtio.
> > > >
> > > > > So I will keep it.
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > > +
> > > > > > > > > > > +Typically, the device context is read and written one
> > > > > > > > > > > +time on the source and the destination hypervisor
> > > > > > > > > > > +respectively once the device is in \field{Freeze} mode.
> > > > > > > > > > > +On the destination hypervisor, after writing the
> > > > > > > > > > > +device context, when the device mode set to
> > > > > > > > > > > +\field{Active}, the device uses the most recently set
> > > > > > > > > > > +device context and resumes the device
> > > > > > > > > > operation.
> > > > > > > > > >
> > > > > > > > > > There's no context sequence, so this is obvious. It's
> > > > > > > > > > the semantic of all other existing interfaces.
> > > > > > > > > >
> > > > > > > > > Can you please what which existing interfaces do you mean here?
> > > > > > > >
> > > > > > > > For any common cfg member. E.g queue_addr.
> > > > > > > >
> > > > > > > > The driver wrote 100 different values to queue_addr and the
> > > > > > > > device used the value written last time.
> > > > > > > >
> > > > > > > o.k. I don’t see any problem in stating what is done, which is
> > > > > > > less vague. 😊
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > +
> > > > > > > > > > > +In an alternative flow, on the source hypervisor the
> > > > > > > > > > > +owner driver may choose to read the device context
> > > > > > > > > > > +first time while the device is in \field{Active} mode
> > > > > > > > > > > +and second time once the device is in \field{Freeze}
> > > > > > > > > > mode.
> > > > > > > > > >
> > > > > > > > > > Who is going to synchronize the device context with
> > > > > > > > > > possible configuration from the driver?
> > > > > > > > > >
> > > > > > > > > Not sure I understand the question.
> > > > > > > > > If I understand you right, do you mean that, When
> > > > > > > > > configuration change is done by the guest driver, how does
> > > > > > > > > device
> > > > context change?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes.
> > > > > > > >
> > > > > > > > > If so, device context reading will reflect the new configuration.
> > > > > > > >
> > > > > > > > How do you do that? For example:
> > > > > > > >
> > > > > > > > static inline void vp_iowrite64_twopart(u64 val,
> > > > > > > >                                         __le32 __iomem *lo,
> > > > > > > >                                         __le32 __iomem *hi) {
> > > > > > > >         vp_iowrite32((u32)val, lo);
> > > > > > > >         vp_iowrite32(val >> 32, hi); }
> > > > > > > >
> > > > > > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > > > > > >
> > > > > > > Yes. the device context
> > VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > > > > > section captures the partial value.
> > > > > >
> > > > > > There's no way for the device to know whether or not it's a
> > > > > > partial value or
> > > > not.
> > > > > > No?
> > > > > >
> > > > > Device does not need to know, because when the guest vm and the
> > > > > device is
> > > > resumed on the destination, it the guest vm will continue with
> > > > writing the 2nd part.
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > Similarly, on the
> > > > > > > > > > > +destination hypervisor writes the device context
> > > > > > > > > > > +first time while the device is still running in
> > > > > > > > > > > +\field{Active} mode on the source hypervisor and
> > > > > > > > > > > +writes the device context second time while the
> > > > > > > > > > > +device is in
> > > > > > > > > > \field{Freeze} mode.
> > > > > > > > > > > +This flow may result in very short setup time as the
> > > > > > > > > > > +device context likely have minimal changes from the
> > > > > > > > > > > +previously written device
> > > > > > > > context.
> > > > > > > > > >
> > > > > > > > > > Is the hypervisor who is in charge of doing the
> > > > > > > > > > comparison and writing only the delta?
> > > > > > > > > >
> > > > > > > > > The spec commands allow to do so. So possibility exists
> > > > > > > > > from spec
> > > > wise.
> > > > > > > >
> > > > > > > > There are various optimizations for migration for sure, I
> > > > > > > > don't think mentioning any specific one is good.
> > > > > > > >
> > > > > > > The text is informative text similar to,
> > > > > > >
> > > > > > > " However, some devices benefit from the ability to find out
> > > > > > > the amount of available data in the queue without accessing
> > > > > > > the virtqueue in
> > > > > > memory"
> > > > > > >
> > > > > > > " To help with these optimizations, when
> > > > > > > VIRTIO_F_NOTIFICATION_DATA has
> > > > > > been negotiated".
> > > > > > >
> > > > > > > Is this the only optimization in virtio? No, but we still
> > > > > > > mention the rationale of
> > > > > > why it exists.
> > > > > >
> > > > > > The above is a good example as it explain
> > > > > > VIRTIO_F_NOTIFICATION_DATA is the only way without accessing the
> > > > > > virtqueue. But this is not the case of
> > > > migration.
> > > > > > You said it's just a possibility but not a must which is not the
> > > > > > case for VIRTIO_F_NOTIFICATION_DATA.
> > > > > >
> > > > > It is one of the optimization apart. The comparison is of
> > > > > one_of_example or
> > > > not.
> > > >
> > > > I don't get this.
> > > Theory of operation is describing a flow how things are done and how the
> > constructs are helpful to achieve it.
> >
> > Immature optimzation doesn't belong to theory for sure. I see your delta
> > reporting immature in many ways. That's the point.
> >
> > Thanks
> >
> > > And it is not the end of the list.
> > > That does not mean one should not write those.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-18  4:30                                 ` Parav Pandit
  2023-10-18  6:14                                   ` Michael S. Tsirkin
@ 2023-10-19  2:41                                   ` Jason Wang
  2023-10-19  4:29                                     ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-19  2:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 12:30 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, October 18, 2023 6:23 AM
>
> > Why can suspend go directly from guest to device then?
> >
> Because all the virtio registers are treated equally by the live migration driver.
> So why not?

Good, so you agree that if a new status bit is introduced then it can
work with passthrough? So did the register for indices.

> As explained device synchronizing all the operations which are not mediated by VMM.
>
> If somehow you claim that all the synchronization is possible _only_ in software in various mediation layer,
> And it is impossible in single place in device, than I do not agree.
> V2 listed most of the synchronization points of the device.
>
> > > When such device reset is done, it does not reset the device context, nor it
> > clears the dirty page records, because they are done by the controlling function.
> > >
> > > > I'm not convinced that the scalability is broken by just having 2 or
> > > > 3 more registers. We all know MSIX requires much more than this.
> > > >
> > > Saying there is some other high resource consumer method exists, so lets
> > consume more in new interface we do now, is not good approach.
> >
> > This is self-contradictory and double standard. You allow #MSI-X vectors to
> > grow but not config?
> >
>
>
> > > MSI-X on its v2 is underway. Hopefully it will be finished this year, which is
> > already cutting down O(N) resources.
> >
> > Why couldn't such a method be applied to config registers and others?
> >
> Because config registers are not de-duplicating type.
> Meaning each config register is unique in nature.

In nature from the view of the device logic but not the transport like PCIE.

> For such work a queue/dma approach is taken.
> So it is applied to config registers and others to not place as _always_available registers.
> We just need to do that in virtio too.
> All the recent work of flow filters, counters no longer rely on the config registers as we both agreed in discussion [1] with your comment

Yes but it serves a different purpose CVQ can't work before probing,
no? Or are you saying you want to trap CVQ for migration?

>
> "Adding cvq is much easier than inventing(duplicating) the work of a transport."
>
> [1] https://lore.kernel.org/virtio-comment/CACGkMEseZeT4VX8Ut-7KraxLKNOMKOgFDNxqKofXSFT8yHfg-w@mail.gmail.com/#t
> > >
> > > > You can't solve all the issues in one series. As stated many times,
> > > I am not solving all issues in one series. This series builds the infrastructure.
> >
> > It's you that is raising the scalability issue, and what you've ignored is that such
> > an "issue" has existed for many years and various hardware has been built on
> > top of that.
> >
> That does not mean one should continue with such issue.

Then how many MSI-X vectors do you expect to have for each VF? Why did
you choose that value? Why forbid a vendor to have more than that?

> And that hardware consumes more power and memory that results in overall device inefficiency.
> You should have objected the IMS patches in Linux kernel, you should also object new MSI-X proposal and say just use registers.

Actually the reverse, why can't we use something similar to IMS? IMS
allows storage in arbitrary places, no?

>
> > Again, we should make sure the function is correct before we can talk about
> > others, otherwise it would be an endless discussion.
> >
> And using registers is not the way to make it correct.

Correct in what sense? You are actually

1) doing a complete NACK of the existing PCI transport design
2) prevent new features from developing based on the mature of a
software device implementation
3) tie all new features to the owner/member structure which doesn't
exist and unnecessary for transport other than PCI

> Lets make sure that basic function for the member device to first level is correct.

Isn't this what I'm currently doing?

>
>
> > >
> > > > if you really
> > > > care about the scalability, the correct way is to behave like a real
> > > > transport through virtqueue instead of trying to duplicate the
> > > > functionality of transport virtqueue slowly.
> > > >
> > > This will be impossible because no device will transport driver notifications
> > using a virtqueue.
> > > Therefore, virtqueue is not some generic transport that does everything - as
> > simple as that. hence there is no transport virtqueue.
> >
> > You won't get such a wrong conclusion if you read that proposal.
> I have read those 4 or 5 patches posted by Lingshan and showed you that time that driver notifications are not coming via virtqueue.
> And if I missed it, and if they are coming via virtqueue, it does not meet the performance and "efficiency principle" from the paper you pointed.

Good and if you read the patches, you should know it allows transport
specific notification or you meant:

MMIO device notification is not coming via the MMIO, so "MMIO"
transport is not efficient?

>
> >
> > >
> > > And virtqueue for bulk data transfer exists so no need to invent yet another
> > thing without a good reason.
> >
> > I don't understand why this is related to transport virtqueue anyhow, it's also a
> > queue interface, no?
> Transport virtqueue is diversion of unrelated topic here.

You're talking about scalability, and you are saying registers are the
major blocker, but you stick to registers for guest to use and think
it is scalable.

How can I understand your point from the self contradictory statement above?

My point is simple, if your complaint about registers is true and if
you really care about scalability, you should not use any register for
virtio at all, and then you will end up with the transport virtqueue.

>
> A guest vm driver must be able to talk to the member device for all queue configuration etc through its own channel not mediated by the hypervisor.

You just invented a legacy tunnel which requires mediation, no?

> Otherwise such plumbing does not work for any confidential compute workload.

I'm pretty sure your conclusion here is wrong. Let's not keep raising
unrelated issues here or you don't want to converge this series. Let's
open a new thread with CC guys if you wish.

> Hence, I wouldn’t discuss transport virtqueue for now.

Again, if you don't want to talk about transport virtqueue, that's
fine. But let's leave the scalability issue aside as well.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  5:28                                   ` Parav Pandit
@ 2023-10-19  2:41                                     ` Jason Wang
  0 siblings, 0 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-19  2:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 1:28 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, October 18, 2023 6:23 AM
> >
> > On Tue, Oct 17, 2023 at 11:46 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, October 17, 2023 7:41 AM
> > > >
> > > > On Fri, Oct 13, 2023 at 2:40 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Friday, October 13, 2023 6:48 AM
> > > > > >
> > > > > > On Thu, Oct 12, 2023 at 7:37 PM Parav Pandit <parav@nvidia.com>
> > > > > > wrote:>
> > > > > > > As Michael said, software based nesting is used..
> > > > > >
> > > > > > I've pointed out in another thread when hardware has less
> > > > > > abstraction level than nesting, trap/emulation is a must.
> > > > > >
> > > > > > > See if actual hw based devices can implement it or not. Many
> > > > > > > components of
> > > > > > cpu cannot do N level nesting either, but may be virtio can.
> > > > > > > I don’t know how yet.
> > > > > >
> > > > > > I would not repeat the lessons given by Gerald J. Popek and Robert P.
> > > > > > Goldberg[1] in 1976, but I think you miss a lot of fundamental
> > > > > > things in the methodology of virtualization.
> > > > > Weekend is coming. I will read it.
> > > > >
> > > > > > For example, nesting is a very important criteria to examine
> > > > > > whether an architecture is well designed for virtualization.
> > > > > >
> > > > >
> > > > > In my reading of a leading OS vendor documentation, I leant that
> > > > > OS vendor
> > > > do not recommend nested virtualization for production at [1].
> > > > > Snippet:
> > > > > "In addition, Red Hat does not recommend using nested
> > > > > virtualization in
> > > > production user environments, due to various limitations in functionality.
> > > > Instead, nested virtualization is primarily intended for development
> > > > and testing scenarios."
> > > > >
> > > > > [1]
> > > > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_l
> > > > > inux
> > > > > /8/html/configuring_and_managing_virtualization/creating-nested-vi
> > > > > rtua l-machines_configuring-and-managing-virtualization
> > > > >
> > > > > 2nd leading hypervisor listed nested virtualization to be not used
> > > > > for
> > > > "performance sensitive applications".
> > > >
> > > > Another concept shift.
> > > >
> > > > I'm not going to comment on the choice for individual distros. But
> > > > the points are whether we can deploy a nesting virtualization easily
> > > > under a specific hardware architecture. In this regard, the above is a good
> > example.
> > > >
> > > And most of such nesting seems for non production use, helpful for
> > debugging and more.
> >
> > I'm asking you to google, but you refuse to spent 1 minutes to do that but
> > spending several days to debate on this fact:
> >
> > https://cloud.google.com/compute/docs/instances/nested-
> > virtualization/overview
> >
> > Please don't waste the time of both of us.
>
> I showed the link of Redhat and another one is Hyper-V.
> You showed link of google cloud.

The reason is that Red Hat is not offering IAAS, so a IAAS vendor
should be more convincing.

And,

Firstly, it's not about whether or not it is used in a production
environment. It usually takes many years before new technologies land
in production. For example, we all know it may take time for adminq to
be productiion ready, no?
Secondly, the link you show me is RHEL8 which is at the end of its
life cycle? It is supported in RHEL9 AFAIK, and it is supported more
than just x86.
Lastly, where is the Hyper-V link you've mentioned above? Google gives
me a good link for Azure:
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/user-guide/nested-virtualization

>
> There are no representatives from Google and Microsoft here to support nested here in the discussion.

Shouldn't we trust the link published from the official website of a
cloud vendor? We can't involve people from every aspect here.

I meant why not just play with a nested VM? It won't take you too much
time and you will see how seamless it would be now (and actually it
has been very there for many years).

>
> I assume you as part of Redhat show some production use, but in public documentation of Redhat it said non production.
>
> Regardless, I want to emphasize that I am not against the use case of nested.

This is not what I read from you. Or why do you keep wanting me to
prove it has real use but refuse to do a simple Google search?

>
> I am highlighting that any L2 nesting involves today hw emulation in the ecosystem.

How do you define emulation? Emulation does mean it needs to emulate
all? If yes, why don't you start using L1?

> If this is incorrect, please point to the datasheet. (not user documentation at high level).
>
> And for L2 nesting, virtio doing hw emulation is fine to me.
> And one wants to improve that too, lets have the proper nested VF.
> Lets discuss in other thread, where you have many questions.

While I don't see why we need to discuss it here, the nested issues
were raised for this series in this thread.

>
>
> >
> > >
> > > And the nesting is not working without trap + emulation for > 2 level of
> > nesting outside of virtio as far as I understand.
> >
> > Read the above link.
> >
> > > Like Intel PML. How many levels of nesting is done by hw for PML?
> > >
> > > > Again, just a simple google will tell you the instances that support
> > > > nesting have been available for almost all the major cloud vendors for a
> > while.
> > > >
> > > From cpu data sheets, it does not appear that hw is able to do such nesting.
> >
> > For PML, it's up to the CPU vendor to consider a good way to be self virtualized.
> > If it's not, it's a design defect. This is not the place to discuss the design choice
> > of a specific CPU vendor, if you are really interested in this, you can go back in
> > the archive to figure out why AMD nesting is done much earlier than Intel.
>
> In the google link you posted, I read "VMs powered by AMD processors are not supported".
> I wish they should have been able to utilize it.

How is it related to the discussion here? A good design may simplify
nesting a lot, that's the point. A specific vendor hardware may choose
to do something to complicate at their will and at their risk. But it
is not the case for virtio. We have a way to achieve nesting easily
already.

>
> >
> > >
> > > > >
> > > > > I want to repeat and emphasize that I am not ignoring the nested case.
> > > > >
> > > > > An extension for nesting would be the VF presented to the guest
> > > > > itself with
> > > > SR-IOV capability can work as_is as proposed here.
> > > >
> > > > How can a VF have the SR-IOV capability?
> > > >
> > > One option is by trap + emulation.
> >
> > Great.
> >
> > > Second is having it actually on the VF, which will follow the true definition of
> > nesting.
> >
> > How is VF allowed to have SR-IOV capability by the spec?
> >
> To support nesting, PCI-SIG can extend it.

How many layers do you want then?

>
> > >
> > > > > Michael presented the idea of the dummy PF, which is to represent
> > > > > the VF as
> > > > dummy PF which can do the SR-IOV with one VF.
> > > >
> > > > Why do we need the complicated SR-IOV emulation at the nesting level?
> > > You have to complicate one way or the other.
> >
> > How? I've demonstrated that you won't end up with such complications if
> > everything is self contained.
> The primary problem with self-contained is it is not fitting the requirements of passthrough.

Why not?

> How can we do self-contained interface without mediation where device context, dirty pages are lost, when device reset/flr occurs?

Michael demonstread one possible method via admin command, and I
pointed out a per VF adminq in another thread? How can't them work?

> Also the dma occurs in the guest.
> We need facility like PML where PML logs the pages in the VMM level, in virtio case to the owner PF.

Pages do not belong to a device, it's fine to depend on other
entities, that's also my point so dirty page tracking is optional in
the migration.

>
> >
> > > And here it does not look complicated because it uses all existing defined
> > constructs available at VMM and GVM level.
> > > It follows both the principles you listed in the paper, i.e. (a) efficiency and (b)
> > equivalence property.
> >
> > In order to achieve (b), you need to have many PFs and many levels which is an
> > obvious unnecessary complication.
> >
> This is what you wanted to follow the paper.
> It does not need many PFs, at L0 there is one PF and N VFs.
> At L1, one VF is given with emulated config space that consist of SR-IOV capability.

How can you use this "PF" in L1 to migrate the VF in L2?

> This L1 VF allows creating new VF, one of the VF will be passed to L2.
>
> > >
> > > > How can you make sure such a design can result in a live migration
> > > > to be done at any levels?
> > > >
> > > I will propose design that is practical and has some use case.
> > > I will not propose theoretical work that no one will implement.
> >
> > Again, it's only a matter if you want to do everything in a passthrough mode,
> > this is not to the methodology proven by [1]. It's not a matter if you stick to
> > trapping.
> >
> I didn’t understand, but I don’t see a point of discussing passthrough vs non_passthrough.

Well, my point is, trap/emulation has been well studied but not
passthrough. For passthrough, we need to know if it can work or not.
For example, do we need to disable all the sensitive instructions like
PM/FLR etc.

>
>
> > >
> > > > E.g in LN, you had a PF and a VF. How to migrate the PF to this level?
> > > > You want two PFs in the L(N-1) level?
> > > >
> > > Likely yes as dummy PF with emulated caps.
> >
> > Ok, so you will have N PFs in L0 which is unrealistic. Not only because of the
> > limitation of the resources but also because there's no way for the hypervisor to
> > know how many levels of nesting are being used.
> >
> Only one PF in L0. Emulated PF in L1. Similar to how rest of the eco-system platform components are doing it.

See my above question.

> When whole platform commit to do N level nesting, it make sense for virtio to align.
> For example cpu vendors to commit to do N level nested page table traversal on pci read/writes, posted interrupts at N level, PML logging at N level.
> At that point virtio for N level nesting make sense.
>
> > >
> > > > > You need the support from the platform too, I guess TC can extend it.
> > > > > May be a different interface more suitable for nested case which
> > > > > do not have
> > > > performance needs.
> > > >
> > > > I disagree, it's about if the performance can satisfy the requirement at N
> > level.
> > > >
> > > > >
> > > > > How about a nested user to have AQ located on the VF so that
> > > > > mediation sw
> > > > can operate admin commands over self?
> > > >
> > > > I would go with such complicated architecture.
> > > >
> > > You like meant, you wouldn't, Right?
> >
> > Right.
> >
> > >
> > > Also, following your paper which clearly highlights, "execution of privileged
> > instruction in vm occurs, which would have effect of changing machine
> > resources".
> > > In the passthrough case it is not the privileged instruction because the
> > resource is not composed by the the machine, it is already done by the device".
> >
> > How do you know that? With save/load of a device state, you can
> > schedule/share a VF among multiple VMs. Then you still want to pass through
> > everything?
> You cannot share a VF among multiple VMs as each VM has its own isolated memory boundary, isolated by the IOMMU and MMU.
> PCI incoming requests of a specific RID cannot split to two different guest VMs.

All of the modern IOMMUs have a table to map RID to the I/O page
table, do you know why?

>
> > Let's just not invent a mechanism that can only work for a very
> > limited use case.
> >
> The use case you are quoting as limited is common one for passthrough users.

No, see my above question. It is just one case that is beyond what you
can have. There are for sure a lot of others.

>
> > > Hence for such cvq operation trap is not to be done for member virtio device.
> > >
> > > It would make sense to trap cvq for non virtio device, where cvq is composed
> > as part of the machine resource.
> > >
> > > > > Device mode commands will not be applicable there, instead some
> > > > > other
> > > > things to be done.
> > > > > So non passthrough mode software possibly can make use of it?
> > > >
> > > > It would be a great burden if you
> > > >
> > > > 1) use passthrough in L0
> > > > 2) use trap/emulation in L(N+1)
> > > >
> > > How is this different than Intel PML hw?
> >
> > Let me clarify my points, I meant.
> >
> > You can't simply use pass through in order to live migrate at any level. So what
> > you can did is:
> >
> > 1) using passthrough to VF in L0
> > 2) using trap/emulation for PF/VF in L1 and LN
> >
> > Isn't this much more complicated than simply having a self contained device for
> > VF, then you don't need the composition of PF in any level.
> > No?
> >
> The problem in self-contained is it is not able to do even #1.

Well, you know at least the CPU virtualization is self-contained? If
it can do that, why not a simple virtio can't?

>
> > >
> > > > >
> > > > > > That is to say for any CPU/hypervisor vendors, the architecture
> > > > > > should be designed to run any levels of nesting instead of just
> > > > > > an awkward 2 levels (but what you proposed can not work for even 2).
> > > > > Huh, some missing text for corner case as making claim,
> > > > > _not_working in not a
> > > > healthy discussion.
> > > > >
> > > > > > For x86 and KVM, any level of
> > > > > > nesting has been done for about 10 years ago.
> > > > > >
> > > > > I didn’t find hw for PML support in x86 for N or 3 level nesting. Did I miss?
> > > > > I didn’t find hw for nested page tables upto N level walking on
> > > > > the PCIe
> > > > read/writes in any cpu. Did I miss?
> > > >
> > > > You need first asking why it is a must to achieve nested
> > > > virtualization. All of those obstacles come only if you want to use
> > "passthrough" for any levels.
> > > >
> > > > > Have you seen nesting in hw works at N level?
> > > >
> > > > Again, hardware can't have endless resources for endless levels.
> > > Can you please list two or 3 hw features that are in hw, for > 2 levels?
> >
> > Why do I need to do this? What I'm saying is that hardware doesn't need to be
> > designed for N levels. What it needs to make sure to satisfy the requirement
> > proved by [1].
> >
> You need it because you want to follow the 3 principles listed in the paper, i.e. efficiency, equivalency and resource control.

Let's not shift concepts again.

Using the terminology in the papare and it said:

"The efficiency property. All innocuous instructions are executed by
the hardware directly, with no intervention at all on the part of the
control program."

The current trap/emulation model satisfies the requirement as it
treats the virtio/PCI control path as sensitive, so efficiency is not
a requirement.

Looking at your proposal, you keep saying you have good performance
but you still fail to prove why in your proposal e.g the PM
instruction is innocuous (or can run natively).

That's the point.

>
> > >
> > > > Trap and
> > > > emulation is a must for achieving nesting virtualization. If you try
> > > > to invent a passthrough method that can work for any level, you will
> > > > probably fail
> > >
> > > It at least follows the design principle of the paper you suggested.
> >
> > I don't see it this way, see the above reply. The paper is for trap and emulation
> > for sure but you propose to pass through everything.
> >
> > > I don’t see a point of designing something for N level nesting in first go when
> > rest eco system is not there to support it at hw level.
> >
> > Your design complicates the nesting a lot. We have hands-on methodology
> > which has been well studied since the 1970s where you refuse to start with.
> > Then you may end up with a lot of issues.
> >
> I don’t think so. When the hw eco-system is built for nesting, it make sense for virtio to do nesting acceleration.
> Otherwise method done in other nesting is enough for virtio.
>
> > What's more you design is incomplete as it can't be used for migrating:
> >
> > 1) owner
> Owner migration is not requirement. That is just silly.

Why? VFIO allows you to assign PF, is it also silly?

> If one wants to migrate owner, an admin virtio device can be present outside of owner to migrate.

Then we end up with a recursion, no?

>
> > 2) virtio devices that doesn't structure as owner/member
> >
> As with spec 1.3. they are structured for PCI SR-IOV group type.

1.3 should be transport independent, no?

Thanks






> MMIO transport is just missing out on the advancement happening on the PCI transport.
> If there is user interest, one will do for MMIO too.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  2:41                                   ` Jason Wang
@ 2023-10-19  4:29                                     ` Parav Pandit
  2023-10-19  4:44                                       ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  4:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, October 19, 2023 8:11 AM
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, October 18, 2023 6:23 AM
> >
> > > Why can suspend go directly from guest to device then?
> > >
> > Because all the virtio registers are treated equally by the live migration driver.
> > So why not?
> 
> Good, so you agree that if a new status bit is introduced then it can work with
> passthrough? So did the register for indices.
>
No. I replied to Michael of what are the use case of status bit method that can work for non_passthrough.
 
> > As explained device synchronizing all the operations which are not mediated
> by VMM.
> >
> > If somehow you claim that all the synchronization is possible _only_
> > in software in various mediation layer, And it is impossible in single place in
> device, than I do not agree.
> > V2 listed most of the synchronization points of the device.
> >
> > > > When such device reset is done, it does not reset the device
> > > > context, nor it
> > > clears the dirty page records, because they are done by the controlling
> function.
> > > >
> > > > > I'm not convinced that the scalability is broken by just having
> > > > > 2 or
> > > > > 3 more registers. We all know MSIX requires much more than this.
> > > > >
> > > > Saying there is some other high resource consumer method exists,
> > > > so lets
> > > consume more in new interface we do now, is not good approach.
> > >
> > > This is self-contradictory and double standard. You allow #MSI-X
> > > vectors to grow but not config?
> > >
> >
> >
> > > > MSI-X on its v2 is underway. Hopefully it will be finished this
> > > > year, which is
> > > already cutting down O(N) resources.
> > >
> > > Why couldn't such a method be applied to config registers and others?
> > >
> > Because config registers are not de-duplicating type.
> > Meaning each config register is unique in nature.
> 
> In nature from the view of the device logic but not the transport like PCIE.
Does not matter.

> 
> > For such work a queue/dma approach is taken.
> > So it is applied to config registers and others to not place as _always_available
> registers.
> > We just need to do that in virtio too.
> > All the recent work of flow filters, counters no longer rely on the
> > config registers as we both agreed in discussion [1] with your comment
> 
> Yes but it serves a different purpose CVQ can't work before probing, no? Or are
> you saying you want to trap CVQ for migration?
>
For passthrough nothing of the virtio interface (config, cvq, datavqs) is trapped.
For non_passthrough cvq config registers, etc need to be software composed.
 
> >
> > "Adding cvq is much easier than inventing(duplicating) the work of a
> transport."
If that cvq is on the PF, yes, which is the aq.
Second cvq cannot be placed on the VF, which hypervisor do not have access to.

> >
> > [1]
> > https://lore.kernel.org/virtio-comment/CACGkMEseZeT4VX8Ut-
> 7KraxLKNOMKO
> > gFDNxqKofXSFT8yHfg-w@mail.gmail.com/#t
> > > >
> > > > > You can't solve all the issues in one series. As stated many
> > > > > times,
> > > > I am not solving all issues in one series. This series builds the
> infrastructure.
> > >
> > > It's you that is raising the scalability issue, and what you've
> > > ignored is that such an "issue" has existed for many years and
> > > various hardware has been built on top of that.
> > >
> > That does not mean one should continue with such issue.
> 
> Then how many MSI-X vectors do you expect to have for each VF? Why did you
> choose that value? Why forbid a vendor to have more than that?
>
The device has finite number of MSI-X vectors. The cloud operators running VMs, provisions MSI-X vectors to the VF depending on its vcpus, queues, etc SLA config.
 
> > And that hardware consumes more power and memory that results in overall
> device inefficiency.
> > You should have objected the IMS patches in Linux kernel, you should also
> object new MSI-X proposal and say just use registers.
> 
> Actually the reverse, why can't we use something similar to IMS? IMS allows
> storage in arbitrary places, no?
> 
One wants to stay away from registers, you didn’t object non_register interface with an argument that "registers are done for years, so why IMS, why MSI-X improvement?".
Here, you are objecting non_register interface.

> >
> > > Again, we should make sure the function is correct before we can
> > > talk about others, otherwise it would be an endless discussion.
> > >
> > And using registers is not the way to make it correct.
> 
> Correct in what sense? You are actually
> 
> 1) doing a complete NACK of the existing PCI transport design

There is no such design existing.

> 2) prevent new features from developing based on the mature of a software
> device implementation

> 3) tie all new features to the owner/member structure which doesn't exist and
> unnecessary for transport other than PCI
>
Other transports have missed the notion of owner and member is purely their lack.
Once they improve it will work fine.
 
> > Lets make sure that basic function for the member device to first level is
> correct.
> 

> Isn't this what I'm currently doing?
>
You are doing only for non_passthrough mode.
We are doing for first level for passthrough mode.

Your claim what I see is only way to do is non_passthrough via registers because you suggest to say "cvq is easy etc".
And this is where main disagreement is present.

Let me ask basic question.
Do you agree that there are two use cases?
1. device passthrough that does not trap virtio interface (= config space, cvq, data vqs, flow filter vqs, flr and more)

2. only data path is in device, and rest is software composed.

> >
> >
> > > >
> > > > > if you really
> > > > > care about the scalability, the correct way is to behave like a
> > > > > real transport through virtqueue instead of trying to duplicate
> > > > > the functionality of transport virtqueue slowly.
> > > > >
> > > > This will be impossible because no device will transport driver
> > > > notifications
> > > using a virtqueue.
> > > > Therefore, virtqueue is not some generic transport that does
> > > > everything - as
> > > simple as that. hence there is no transport virtqueue.
> > >
> > > You won't get such a wrong conclusion if you read that proposal.
> > I have read those 4 or 5 patches posted by Lingshan and showed you that
> time that driver notifications are not coming via virtqueue.
> > And if I missed it, and if they are coming via virtqueue, it does not meet the
> performance and "efficiency principle" from the paper you pointed.
> 
> Good and if you read the patches, you should know it allows transport specific
> notification or you meant:
> 
If you claim vq as transport, driver notifications must be coming using vq descriptors.
In the transport vq proposal, that is not the case.

> MMIO device notification is not coming via the MMIO, so "MMIO"
> transport is not efficient?
> 
> >
> > >
> > > >
> > > > And virtqueue for bulk data transfer exists so no need to invent
> > > > yet another
> > > thing without a good reason.
> > >
> > > I don't understand why this is related to transport virtqueue
> > > anyhow, it's also a queue interface, no?
> > Transport virtqueue is diversion of unrelated topic here.
> 
> You're talking about scalability, and you are saying registers are the major
> blocker, but you stick to registers for guest to use and think it is scalable.
>
We cannot change the past of what is already present in the spec.
Which new registers?
We are not introducing any new registers in this proposal.

> How can I understand your point from the self contradictory statement above?
> 
> My point is simple, if your complaint about registers is true and if you really care
> about scalability, you should not use any register for virtio at all, and then you
> will end up with the transport virtqueue.
Perfect, we are not introducing any new registers, what is there, is there, that cannot be changed.
In SIOV proposal that we build, will not have any new registers other than needed for init time.

> 
> >
> > A guest vm driver must be able to talk to the member device for all queue
> configuration etc through its own channel not mediated by the hypervisor.
> 
> You just invented a legacy tunnel which requires mediation, no?
> 
It is for legacy.
Nothing is done for new interface.

> > Otherwise such plumbing does not work for any confidential compute
> workload.
> 
> I'm pretty sure your conclusion here is wrong. Let's not keep raising unrelated
> issues here or you don't want to converge this series. Let's open a new thread
> with CC guys if you wish.
> 
> > Hence, I wouldn’t discuss transport virtqueue for now.
> 
> Again, if you don't want to talk about transport virtqueue, that's fine. But let's
> leave the scalability issue aside as well.
>
Registers are related for functionality and scale.

Lets first agree on use case before the design, that I asked above.

I will wait to respond to any other emails until we agree on use case requirements.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  4:29                                     ` Parav Pandit
@ 2023-10-19  4:44                                       ` Jason Wang
  2023-10-19  5:31                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-19  4:44 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

> > Again, if you don't want to talk about transport virtqueue, that's fine. But let's
> > leave the scalability issue aside as well.
> >
> Registers are related for functionality and scale.
>
> Lets first agree on use case before the design, that I asked above.
>
> I will wait to respond to any other emails until we agree on use case requirements.

There are more than just me who want you to define "passthrough" first
where you refuse to respond.

How could we make any agreement without an accurate the definition of
"passthrough" who is a key to understand each other?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  4:44                                       ` Jason Wang
@ 2023-10-19  5:31                                         ` Parav Pandit
  2023-10-19  6:35                                           ` Michael S. Tsirkin
  2023-10-23  3:45                                           ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  5:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Jason Wang
> Sent: Thursday, October 19, 2023 10:15 AM
> 
> > > Again, if you don't want to talk about transport virtqueue, that's
> > > fine. But let's leave the scalability issue aside as well.
> > >
> > Registers are related for functionality and scale.
> >
> > Lets first agree on use case before the design, that I asked above.
> >
> > I will wait to respond to any other emails until we agree on use case
> requirements.
> 
> There are more than just me who want you to define "passthrough" first where
> you refuse to respond.
> 
Totally disagree.
In the previous email itself, I wrote what passthrough is.
So let's try yet one more time.
Either you can re-read last email or for better read below and see if it is understood or not.

> How could we make any agreement without an accurate the definition of
> "passthrough" who is a key to understand each other?

I replied few times in past emails but since those email threads are so long, it is easy to miss out.

Passthrough definition:
a. virtio member device mapped to the guest vm
b. only pci config space and msix of a member device is intercepted by hypervisor.
c. virtio config space, virtio cvqs, data vqs of a member device is directly accessed by the guest vm without intercepted by the hypervisor.

(Why b?, no grand reason, it is how the hypervisors are working where to integrate the virtio member device to).


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  5:31                                         ` Parav Pandit
@ 2023-10-19  6:35                                           ` Michael S. Tsirkin
  2023-10-19  7:30                                             ` Parav Pandit
  2023-10-23  3:45                                           ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  6:35 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 05:31:37AM +0000, Parav Pandit wrote:
> > How could we make any agreement without an accurate the definition of
> > "passthrough" who is a key to understand each other?
> 
> I replied few times in past emails but since those email threads are so long, it is easy to miss out.
> 
> Passthrough definition:
> a. virtio member device mapped to the guest vm
> b. only pci config space and msix of a member device is intercepted by hypervisor.
> c. virtio config space, virtio cvqs, data vqs of a member device is directly accessed by the guest vm without intercepted by the hypervisor.
> 
> (Why b?, no grand reason, it is how the hypervisors are working where to integrate the virtio member device to).

I think it's a reasonable use-case, though of course not at all the only
way to design a system. Some more ways:
2- intercept everything except data vqs and cvqs
	I think this is a reasonable way to build the system and has a bunch
	of advantages short term. The main disadvantage as compared to
	passthrough is the need to keep config space coherent with
	device operation - the way to do it is device specific and
	might get fragile.

4- intercept everything except data vqs
	Here we get another problem in isolating some vqs but not
        others. the problem becomes bigger is that you also
	need to communicate control vq to the device.

also, with both of the above options, we have a question of how
are we communicating with the device to keep control path
and data path in sync when device's dma is mapped to guest.
using PASIDs for isolation might work but again, support is
far from universal so we can't really assume it as
the only way in the spec.

Absent PASID the popular way seems to be shadow vq which basically does

4- software intercept for everything
       clearly that's a lot of CPU overhead, I do not think we can focus on that
       as the only way in the spec, though some hypervisors might
       already have a lot of migration overhead to the point where
       virtio can afford any amount of overhead and it won't be
       measureable.


I also note some or all of the intercepts can always come and go.  For
example, a common setup is that if target VCPUs are running then IOMMU
will inject interrupts directly into guest - if not you generally trap
to hypervisor. Similarly, shadow vq might be active just temporarily.

Which approach is best? I feel ideally virtio would find ways to support
them all rather than deciding on a policy in the spec.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  6:35                                           ` Michael S. Tsirkin
@ 2023-10-19  7:30                                             ` Parav Pandit
  2023-10-19  8:31                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  7:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 19, 2023 12:05 PM
> 
> On Thu, Oct 19, 2023 at 05:31:37AM +0000, Parav Pandit wrote:
> > > How could we make any agreement without an accurate the definition
> > > of "passthrough" who is a key to understand each other?
> >
> > I replied few times in past emails but since those email threads are so long, it
> is easy to miss out.
> >
> > Passthrough definition:
> > a. virtio member device mapped to the guest vm b. only pci config
> > space and msix of a member device is intercepted by hypervisor.
> > c. virtio config space, virtio cvqs, data vqs of a member device is directly
> accessed by the guest vm without intercepted by the hypervisor.
> >
> > (Why b?, no grand reason, it is how the hypervisors are working where to
> integrate the virtio member device to).
> 
> I think it's a reasonable use-case, though of course not at all the only way to
> design a system. 
Sure, there are more ways to bisect the device, specially when underlying device is not a virtio device.
But one can continue bisecting virtio as well as you listed below.
> Some more ways:
> 2- intercept everything except data vqs and cvqs
> 	I think this is a reasonable way to build the system and has a bunch
> 	of advantages short term. The main disadvantage as compared to
> 	passthrough is the need to keep config space coherent with
> 	device operation - the way to do it is device specific and
> 	might get fragile.
> 
Yes, I agree it has short term advantages.
This is not future proof as you listed.

> 4- intercept everything except data vqs
> 	Here we get another problem in isolating some vqs but not
>         others. the problem becomes bigger is that you also
> 	need to communicate control vq to the device.
> 
Yes. for non virtio device vendors have easy way to support.
We supported this for mlx5 devices.

> also, with both of the above options, we have a question of how are we
> communicating with the device to keep control path and data path in sync when
> device's dma is mapped to guest.
> using PASIDs for isolation might work but again, support is far from universal so
> we can't really assume it as the only way in the spec.
> 
Right.

> Absent PASID the popular way seems to be shadow vq which basically does
> 
> 4- software intercept for everything
>        clearly that's a lot of CPU overhead, I do not think we can focus on that
>        as the only way in the spec, though some hypervisors might
>        already have a lot of migration overhead to the point where
>        virtio can afford any amount of overhead and it won't be
>        measureable.
> 
> 
> I also note some or all of the intercepts can always come and go.  For example,
> a common setup is that if target VCPUs are running then IOMMU will inject
> interrupts directly into guest - if not you generally trap to hypervisor. Similarly,
> shadow vq might be active just temporarily.
> 
> Which approach is best? I feel ideally virtio would find ways to support them all
> rather than deciding on a policy in the spec.

Cooking all the modes seems frankly very daunting to me specially when there is no existing software stack to consume all modes and no device vendor to sign of for _all_ variations.

To me, two stacks are practical and common to target at beginning.
i.e.
1. passthrough mode 

2. #2 above,
I had real technical difficulty to make #2 practically work and build a scalable device and have converged api with #1.
The option we explored to have admin command in some register of the VF specific for #2 is partially fine targeted for use case #2 only.

A variation of that for the member device, there is owner device, hence admin command on the AQ can be used.

If we can converge on common virtio interface between #1 and #2, great.
If we cannot be due to technical issues, we shouldn't step on each other's toes, instead build the two interfaces for two different use cases overcoming its own technical challenges.

And when in future, someone want to implement different kind of bisections, they can propose the extensions.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18  9:48                                           ` Parav Pandit
  2023-10-18  9:56                                             ` Michael S. Tsirkin
@ 2023-10-19  8:15                                             ` Zhu, Lingshan
  2023-10-19  9:01                                               ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-19  8:15 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/18/2023 5:48 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, October 18, 2023 2:13 PM
>>
>> On 10/18/2023 3:20 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Wednesday, October 18, 2023 12:22 PM
>>>>
>>>> On 10/18/2023 2:41 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Wednesday, October 18, 2023 12:06 PM
>>>>>>
>>>>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
>>>>>>>> Sent: Monday, October 16, 2023 3:18 PM
>>>>>>>>
>>>>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
>>>>>>>>>>>>>> How do you transfer the ownership?
>>>>>>>>>>>>> An additional ownership deletgation by a new admin command.
>>>>>>>>>>>> if you think this can work, do you want to cook a patch to
>>>>>>>>>>>> implement this before you submitting this live migration series?
>>>>>>>>>>> I answered this already above.
>>>>>>>>>> talk is cheap, show me your patch
>>>>>>>>> Huh. We presented the infrastructure that migrates, 30+ device
>>>>>>>>> types,
>>>>>>>> covering device context ideas from Oracle.
>>>>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
>>>>>>>>>
>>>>>>>>> Please have some respect for other members who covered more
>>>>>>>>> ground than
>>>>>>>> your series.
>>>>>>>>> What more? Apply the same nested concept on the member device as
>>>>>>>> Michael suggested, it is nested virtualization maintain exact
>>>>>>>> same
>>>> semantics.
>>>>>>>>> So a VF is mapped as PF to the L1 guest.
>>>>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
>>>>>>>>>
>>>>>>>>> This nested work can be extended in future, once first level
>>>>>>>>> nesting is
>>>>>>>> covered.
>>>>>>>>>> Answer all questions above, if you think a management VF can
>>>>>>>>>> work, please show me your patch.
>>>>>>>>> The idea evolves from technical debate then pointing fingers
>>>>>>>>> like your
>>>>>>>> comment.
>>>>>>>>> I think a positive discussion with Michael and a pointer to the
>>>>>>>>> paper from
>>>>>>>> Jason gave a good direction of doing _right_ nesting that follows
>>>>>>>> two
>>>>>> principles.
>>>>>>>>> a. efficiency property
>>>>>>>>> b. equivalence property
>>>>>>>>>
>>>>>>>>> (c. resource control is natural already)
>>>>>>>>>
>>>>>>>>> Both apply at VMM and at VM level enabling recursive
>>>>>>>>> virtualization, by
>>>>>>>> having VF that can act as PF inside the guest.
>>>>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
>>>>>>>> Please just show me your patch resolving these opens, how about
>>>>>>>> start from defining virito-fs device context and your management VF?
>>>>>>> As answered, device context infrastructure is done, per device
>>>>>>> specific device-
>>>>>> context will be defined incrementally.
>>>>>>> I will not be including virtio-fs in this series. It will be done
>>>>>>> incrementally in
>>>>>> future utilizing the infrastructure build in this series.
>>>>>> Done? How do you conclude this? You just tell me what is the full
>>>>>> set of virito-fs device context now and how to migrate them.
>>>>>>
>>>>>> You cant? you refuse or you don't? Do you expect the HW designer to
>>>>>> figure out by themself?
>>>>> I wont be able to tell now as I don’t think it is necessary for this series.
>>>>> If one out of 30 devices cannot migrate because of unimaginable
>>>>> amount of
>>>> complexity has been placed there, may be one will not implement it as
>>>> member device.
>>>>>    From experience of migratable complex gpu devices, rdma devices
>>>>> (stateful
>>>> having hundred thousand of stateful QPs), my understanding is complex
>>>> state of virtio-fs can be defined and migratable.
>>>>> Mlx5 driver consist of 150,000 lines of code and that device is
>>>>> migratable
>>>> with complex state.
>>>>> So I am optimistic that virtio-fs can be migratable too.
>>>>> It does not have to limited by my limited creativity of 2023.
>>>>> May be I am wrong, in that case one will not implement passthrough
>>>>> virtio-fs
>>>> device.
>>>> your series wants to migrate device context, but doesn't define
>>>> device context, does this sounds reasonable?
>>> Device generic context is defined at [1] and also the infrastructure for defining
>> the device context in parallel by multiple people can be done post the work of
>> [1].
>>> Per each device type context will be defined incrementally post this work.
>>>
>>> [1]
>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.h
>>> tml
>> This is not post of the work, you should define them before you use them in this
>> series.
>>
> I don’t agree to cook ocean in this patch series.
> No practical spec devel community does it.
> As long as we feel comfortable that device context framework is extendible, it is fine.
> If virtio-fs seems very hard, may be one will come with a new light weight FS device. I really don’t know.
so you want to migrate device context, but refuse to define them?
>
>> And you need to prove why admin vq are better than registers solution if you
>> want a merge.
> Michael already responded the practical aspects.
> Since you may claim, I didn’t answer, below is the technical details.
>
> Why admin commands and aq is better is because of below reasons in my view:
>
> Functionally better:
> 1. When the live migration registers are located on the VF itself, VMM does not have control of it.
> These registers reset, on FLR and device reset because these are virtio registers of the device.
> Hence, VMM lost the state for the job that VMM was supposed to do.
> Therefore, passthrough mode cannot depend on these registers.
>
> 2. Any bulk data transfer of device context and dirty page tracking requires DMA.
> Hence those DMA must happen to the device which is different than VF itself.
> If it is on the VF itself, it has two problems.
> 2.a. VF device reset and FLR will clear them, and device context is lost.
>
> 2.b. the DMA occurs at the PCI RID level.
> IOMMU cannot bifurcate the DMA of one RID to two different address space of guest and hypervisor.
> This requires PASID support.
> Using PASID has following problems.
> 2.b.1 PASID typically not used by the kernel software. It is only meant for the user processes.
> Hence for kernel work a reserving PASID won't be acceptable upstream kernel.
> 2.b.2 Somehow if this is done, When the VF itself supports PASID, it required now vPASID support.
> This is again not where industry is going in other forums where I am part of. Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
> 2.b.3 One of the widely used cpu seems to have dropped the support due to limitation of an instruction around PASID.
> So it cannot be used there, this further limits virtio passthrough users.
>
> Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is functional problems.
>
> Scale wise better:
> 3. Admin command and admin vq are used _only_ when one does device migration command.
> One does not migrate VMs every few msec.
> Hence such functionality to be better be done which is efficient for performance, but without consuming on-chip memory.
> Admin command and admin vq satisfy those.
>
> 4. Once the software matures further, admin command would prefer completion interrupt, instead of poll.
> How to get notification/interrupt? Well, virtqueue defines this already.
> Should we replicate that in some PF registers?
> It can be. But once you put all the functionalities of admin command and aq in registers the whole thing becomes yet another register_q.
>
> 5. Can these registers be placed in the PF to overcome #1 and #2 for passthrough?
> In theory yes.
> In practice, no, as there are many commands that flow, which needs to scale to reasonable number of VFs.
> Admin commands over admin vq provides this generic facility.
>
> 6. Most modern devices who attempts to scale, cut down their register footprint, registers are used only for main bootstap, init time config work.
> Even in virtio spec, one can read:
> "Device configuration space is generally used for rarely changing or initialization-time parameters."
>
> Adding some additional registers to a PF device config space for non init time parameters does not make sense.
>
> 7. Additionally, a nested virtualization should be done by truly nesting the device at right abstraction point of owner-member relationship.
> This follows two principles of (a) efficiency and (b) equivalency of what Jason paper pointed.
> And we ask for nested VF extension we will get our guidance from PCI-SIG, of why it should be done if it is matching with rest of the ecosystem components that support/don’t support the nesting.
It they are true, shall we refactor virtio-pci common cfg 
functionalities to use admin vq?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18 10:47                                                 ` Michael S. Tsirkin
  2023-10-18 10:57                                                   ` Parav Pandit
@ 2023-10-19  8:18                                                   ` Zhu, Lingshan
  2023-10-19  8:37                                                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-19  8:18 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/18/2023 6:47 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 18, 2023 at 10:22:57AM +0000, Parav Pandit wrote:
>>> From: Michael S. Tsirkin <mst@redhat.com>
>>> Sent: Wednesday, October 18, 2023 3:26 PM
>>> For completeness, and to shorten the thread, can you please list known
>>> issues/use cases that are addressed by the status bit interface and how you plan
>>> for them to be addressed?
>> I will avoid listing known issues for a moment for status bit in this email.
>>
>> Status bit interface helps in following good ways.
>> 1. suspend/resume the device fully by the guest by negotiating the new feature.
>> This can be useful in the guest-controlled PM flows of suspend/resume.
>> I still think for this, only feature bit is necessary, and device_status modification is not needed.
>> D0->D3 and D3->D0 transition of the pci can suspend and resume the device which can preserve the last device_status value before entering D3.
>> (Like preserving all rest of the fields of common and other device config).
>> This is orthogonal and needed regardless of device migration.
>>
>> 2. If one does not want to passthrough a member device, but build a mediation-based device on top of existing virtio device,
>> It can be useful with mediating software.
>> Here the mediating software has ample duplicated knowledge of what the member device already has.
>> This can fulfil the nested requirement differently provided a platform support it.
>> (PASID limitation will be practical blocker here).
>>
>> How to I plan to address above two?
>> a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI PM drives the state.
>> This will work orthogonal to VMM side migration and will co-exist with VMM based device migration.
> OK that sounds kind of reasonable. Lingshan, Jason are you interested in
> suspend/resume? Want to start a thread on best way to support that?
suspend/resume a device through PM? why? is the status bit in my series 
better?
>
>> b. nested use case:
>> L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
>> L1 guest to enable SR-IOV and mapping the VF to L2 guest.
>> Consulting industry ecosystem to support nested outside of virtio.
> Can't say I like this much, *a lot* of things to implement,
> and burning up a VF for control path is not nice.
> As an alternative, I suggest a new admin command pci capability
> with basically a PA and a valid bit. Easy to emulate and add to
> a VF. And maybe some way to suggest a safe place for it that
> won't conflict with anything? Still trying to figure out if
> we should add PASID in there, or what. Maybe optionally?
> If actual hardware does it we'd be burning up 20 bits,
> but for a software implementation it's free.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  7:30                                             ` Parav Pandit
@ 2023-10-19  8:31                                               ` Michael S. Tsirkin
  2023-10-19  8:58                                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  8:31 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 07:30:09AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 19, 2023 12:05 PM
> > 
> > On Thu, Oct 19, 2023 at 05:31:37AM +0000, Parav Pandit wrote:
> > > > How could we make any agreement without an accurate the definition
> > > > of "passthrough" who is a key to understand each other?
> > >
> > > I replied few times in past emails but since those email threads are so long, it
> > is easy to miss out.
> > >
> > > Passthrough definition:
> > > a. virtio member device mapped to the guest vm b. only pci config
> > > space and msix of a member device is intercepted by hypervisor.
> > > c. virtio config space, virtio cvqs, data vqs of a member device is directly
> > accessed by the guest vm without intercepted by the hypervisor.
> > >
> > > (Why b?, no grand reason, it is how the hypervisors are working where to
> > integrate the virtio member device to).
> > 
> > I think it's a reasonable use-case, though of course not at all the only way to
> > design a system. 
> Sure, there are more ways to bisect the device, specially when underlying device is not a virtio device.
> But one can continue bisecting virtio as well as you listed below.
> > Some more ways:
> > 2- intercept everything except data vqs and cvqs
> > 	I think this is a reasonable way to build the system and has a bunch
> > 	of advantages short term. The main disadvantage as compared to
> > 	passthrough is the need to keep config space coherent with
> > 	device operation - the way to do it is device specific and
> > 	might get fragile.
> > 
> Yes, I agree it has short term advantages.
> This is not future proof as you listed.
> 
> > 4- intercept everything except data vqs
> > 	Here we get another problem in isolating some vqs but not
> >         others. the problem becomes bigger is that you also
> > 	need to communicate control vq to the device.
> > 
> Yes. for non virtio device vendors have easy way to support.
> We supported this for mlx5 devices.
> 
> > also, with both of the above options, we have a question of how are we
> > communicating with the device to keep control path and data path in sync when
> > device's dma is mapped to guest.
> > using PASIDs for isolation might work but again, support is far from universal so
> > we can't really assume it as the only way in the spec.
> > 
> Right.
> 
> > Absent PASID the popular way seems to be shadow vq which basically does
> > 
> > 4- software intercept for everything
> >        clearly that's a lot of CPU overhead, I do not think we can focus on that
> >        as the only way in the spec, though some hypervisors might
> >        already have a lot of migration overhead to the point where
> >        virtio can afford any amount of overhead and it won't be
> >        measureable.
> > 
> > 
> > I also note some or all of the intercepts can always come and go.  For example,
> > a common setup is that if target VCPUs are running then IOMMU will inject
> > interrupts directly into guest - if not you generally trap to hypervisor. Similarly,
> > shadow vq might be active just temporarily.
> > 
> > Which approach is best? I feel ideally virtio would find ways to support them all
> > rather than deciding on a policy in the spec.
> 
> Cooking all the modes seems frankly very daunting to me specially when
> there is no existing software stack to consume all modes and no device
> vendor to sign of for _all_ variations.

Not addressing all the modes.  We are building components not stacks.
Components need to be reusable not stack specific.

Was the whole admin command interface with its levels of indirection
a design mistake then? It was designed exactly to support
all kind of models.
> 
> To me, two stacks are practical and common to target at beginning.
> i.e.
> 1. passthrough mode 
> 
> 2. #2 above,
> I had real technical difficulty to make #2 practically work and build a scalable device and have converged api with #1.
> The option we explored to have admin command in some register of the VF specific for #2 is partially fine targeted for use case #2 only.

Right. So - a way to send admin commands to a VF directly, perhaps in
config space? Do we need more than PA+PASID+some flags?
Want to try to write something like this up?

> A variation of that for the member device, there is owner device, hence admin command on the AQ can be used.
> 
> If we can converge on common virtio interface between #1 and #2, great.
> If we cannot be due to technical issues, we shouldn't step on each other's toes, instead build the two interfaces for two different use cases overcoming its own technical challenges.
> 
> And when in future, someone want to implement different kind of bisections, they can propose the extensions.

Not good at all, this means the interface is very narrow.
Your "propose an extension" just doesn't work practically.
It takes years for things to be widely deployed in the field,
by the time they are there are more use-cases.
We need something universal and admin commands were supposed to be
just this.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  8:18                                                   ` Zhu, Lingshan
@ 2023-10-19  8:37                                                     ` Michael S. Tsirkin
  2023-10-19  8:49                                                       ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  8:37 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 04:18:42PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/18/2023 6:47 PM, Michael S. Tsirkin wrote:
> > On Wed, Oct 18, 2023 at 10:22:57AM +0000, Parav Pandit wrote:
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Wednesday, October 18, 2023 3:26 PM
> > > > For completeness, and to shorten the thread, can you please list known
> > > > issues/use cases that are addressed by the status bit interface and how you plan
> > > > for them to be addressed?
> > > I will avoid listing known issues for a moment for status bit in this email.
> > > 
> > > Status bit interface helps in following good ways.
> > > 1. suspend/resume the device fully by the guest by negotiating the new feature.
> > > This can be useful in the guest-controlled PM flows of suspend/resume.
> > > I still think for this, only feature bit is necessary, and device_status modification is not needed.
> > > D0->D3 and D3->D0 transition of the pci can suspend and resume the device which can preserve the last device_status value before entering D3.
> > > (Like preserving all rest of the fields of common and other device config).
> > > This is orthogonal and needed regardless of device migration.
> > > 
> > > 2. If one does not want to passthrough a member device, but build a mediation-based device on top of existing virtio device,
> > > It can be useful with mediating software.
> > > Here the mediating software has ample duplicated knowledge of what the member device already has.
> > > This can fulfil the nested requirement differently provided a platform support it.
> > > (PASID limitation will be practical blocker here).
> > > 
> > > How to I plan to address above two?
> > > a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI PM drives the state.
> > > This will work orthogonal to VMM side migration and will co-exist with VMM based device migration.
> > OK that sounds kind of reasonable. Lingshan, Jason are you interested in
> > suspend/resume? Want to start a thread on best way to support that?
> suspend/resume a device through PM? why? is the status bit in my series
> better?

I don't know. If we can please stop discussing suspend in live migration
thread and have one with focus on suspend then maybe that's exactly what
is needed.  Is there even a problem we need to solve? Do we need to stop
vqs or is reset as done currently good enough?
Let's focus on that and then hopefully these flamewars can finally stop?

> > 
> > > b. nested use case:
> > > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> > > Consulting industry ecosystem to support nested outside of virtio.
> > Can't say I like this much, *a lot* of things to implement,
> > and burning up a VF for control path is not nice.
> > As an alternative, I suggest a new admin command pci capability
> > with basically a PA and a valid bit. Easy to emulate and add to
> > a VF. And maybe some way to suggest a safe place for it that
> > won't conflict with anything? Still trying to figure out if
> > we should add PASID in there, or what. Maybe optionally?
> > If actual hardware does it we'd be burning up 20 bits,
> > but for a software implementation it's free.
> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  8:37                                                     ` Michael S. Tsirkin
@ 2023-10-19  8:49                                                       ` Zhu, Lingshan
  2023-10-19  8:55                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-19  8:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/19/2023 4:37 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 19, 2023 at 04:18:42PM +0800, Zhu, Lingshan wrote:
>>
>> On 10/18/2023 6:47 PM, Michael S. Tsirkin wrote:
>>> On Wed, Oct 18, 2023 at 10:22:57AM +0000, Parav Pandit wrote:
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Sent: Wednesday, October 18, 2023 3:26 PM
>>>>> For completeness, and to shorten the thread, can you please list known
>>>>> issues/use cases that are addressed by the status bit interface and how you plan
>>>>> for them to be addressed?
>>>> I will avoid listing known issues for a moment for status bit in this email.
>>>>
>>>> Status bit interface helps in following good ways.
>>>> 1. suspend/resume the device fully by the guest by negotiating the new feature.
>>>> This can be useful in the guest-controlled PM flows of suspend/resume.
>>>> I still think for this, only feature bit is necessary, and device_status modification is not needed.
>>>> D0->D3 and D3->D0 transition of the pci can suspend and resume the device which can preserve the last device_status value before entering D3.
>>>> (Like preserving all rest of the fields of common and other device config).
>>>> This is orthogonal and needed regardless of device migration.
>>>>
>>>> 2. If one does not want to passthrough a member device, but build a mediation-based device on top of existing virtio device,
>>>> It can be useful with mediating software.
>>>> Here the mediating software has ample duplicated knowledge of what the member device already has.
>>>> This can fulfil the nested requirement differently provided a platform support it.
>>>> (PASID limitation will be practical blocker here).
>>>>
>>>> How to I plan to address above two?
>>>> a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI PM drives the state.
>>>> This will work orthogonal to VMM side migration and will co-exist with VMM based device migration.
>>> OK that sounds kind of reasonable. Lingshan, Jason are you interested in
>>> suspend/resume? Want to start a thread on best way to support that?
>> suspend/resume a device through PM? why? is the status bit in my series
>> better?
> I don't know. If we can please stop discussing suspend in live migration
> thread and have one with focus on suspend then maybe that's exactly what
> is needed.  Is there even a problem we need to solve? Do we need to stop
> vqs or is reset as done currently good enough?
I have been asked the question above, so I raise my concern, nothing more.
> Let's focus on that and then hopefully these flamewars can finally stop?
OK
>
>>>> b. nested use case:
>>>> L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
>>>> L1 guest to enable SR-IOV and mapping the VF to L2 guest.
>>>> Consulting industry ecosystem to support nested outside of virtio.
>>> Can't say I like this much, *a lot* of things to implement,
>>> and burning up a VF for control path is not nice.
>>> As an alternative, I suggest a new admin command pci capability
>>> with basically a PA and a valid bit. Easy to emulate and add to
>>> a VF. And maybe some way to suggest a safe place for it that
>>> won't conflict with anything? Still trying to figure out if
>>> we should add PASID in there, or what. Maybe optionally?
>>> If actual hardware does it we'd be burning up 20 bits,
>>> but for a software implementation it's free.
>>>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  8:49                                                       ` Zhu, Lingshan
@ 2023-10-19  8:55                                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  8:55 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 04:49:21PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/19/2023 4:37 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 19, 2023 at 04:18:42PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 10/18/2023 6:47 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Oct 18, 2023 at 10:22:57AM +0000, Parav Pandit wrote:
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Wednesday, October 18, 2023 3:26 PM
> > > > > > For completeness, and to shorten the thread, can you please list known
> > > > > > issues/use cases that are addressed by the status bit interface and how you plan
> > > > > > for them to be addressed?
> > > > > I will avoid listing known issues for a moment for status bit in this email.
> > > > > 
> > > > > Status bit interface helps in following good ways.
> > > > > 1. suspend/resume the device fully by the guest by negotiating the new feature.
> > > > > This can be useful in the guest-controlled PM flows of suspend/resume.
> > > > > I still think for this, only feature bit is necessary, and device_status modification is not needed.
> > > > > D0->D3 and D3->D0 transition of the pci can suspend and resume the device which can preserve the last device_status value before entering D3.
> > > > > (Like preserving all rest of the fields of common and other device config).
> > > > > This is orthogonal and needed regardless of device migration.
> > > > > 
> > > > > 2. If one does not want to passthrough a member device, but build a mediation-based device on top of existing virtio device,
> > > > > It can be useful with mediating software.
> > > > > Here the mediating software has ample duplicated knowledge of what the member device already has.
> > > > > This can fulfil the nested requirement differently provided a platform support it.
> > > > > (PASID limitation will be practical blocker here).
> > > > > 
> > > > > How to I plan to address above two?
> > > > > a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI PM drives the state.
> > > > > This will work orthogonal to VMM side migration and will co-exist with VMM based device migration.
> > > > OK that sounds kind of reasonable. Lingshan, Jason are you interested in
> > > > suspend/resume? Want to start a thread on best way to support that?
> > > suspend/resume a device through PM? why? is the status bit in my series
> > > better?
> > I don't know. If we can please stop discussing suspend in live migration
> > thread and have one with focus on suspend then maybe that's exactly what
> > is needed.  Is there even a problem we need to solve? Do we need to stop
> > vqs or is reset as done currently good enough?
> I have been asked the question above, so I raise my concern, nothing more.
> > Let's focus on that and then hopefully these flamewars can finally stop?
> OK

OK as in there will be a thread for addressing VM suspend or OK as in
it's not really of interest?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  8:31                                               ` Michael S. Tsirkin
@ 2023-10-19  8:58                                                 ` Parav Pandit
  2023-10-19  9:11                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  8:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 19, 2023 2:01 PM
> 
> On Thu, Oct 19, 2023 at 07:30:09AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, October 19, 2023 12:05 PM
> > >
> > > On Thu, Oct 19, 2023 at 05:31:37AM +0000, Parav Pandit wrote:
> > > > > How could we make any agreement without an accurate the
> > > > > definition of "passthrough" who is a key to understand each other?
> > > >
> > > > I replied few times in past emails but since those email threads
> > > > are so long, it
> > > is easy to miss out.
> > > >
> > > > Passthrough definition:
> > > > a. virtio member device mapped to the guest vm b. only pci config
> > > > space and msix of a member device is intercepted by hypervisor.
> > > > c. virtio config space, virtio cvqs, data vqs of a member device
> > > > is directly
> > > accessed by the guest vm without intercepted by the hypervisor.
> > > >
> > > > (Why b?, no grand reason, it is how the hypervisors are working
> > > > where to
> > > integrate the virtio member device to).
> > >
> > > I think it's a reasonable use-case, though of course not at all the
> > > only way to design a system.
> > Sure, there are more ways to bisect the device, specially when underlying
> device is not a virtio device.
> > But one can continue bisecting virtio as well as you listed below.
> > > Some more ways:
> > > 2- intercept everything except data vqs and cvqs
> > > 	I think this is a reasonable way to build the system and has a bunch
> > > 	of advantages short term. The main disadvantage as compared to
> > > 	passthrough is the need to keep config space coherent with
> > > 	device operation - the way to do it is device specific and
> > > 	might get fragile.
> > >
> > Yes, I agree it has short term advantages.
> > This is not future proof as you listed.
> >
> > > 4- intercept everything except data vqs
> > > 	Here we get another problem in isolating some vqs but not
> > >         others. the problem becomes bigger is that you also
> > > 	need to communicate control vq to the device.
> > >
> > Yes. for non virtio device vendors have easy way to support.
> > We supported this for mlx5 devices.
> >
> > > also, with both of the above options, we have a question of how are
> > > we communicating with the device to keep control path and data path
> > > in sync when device's dma is mapped to guest.
> > > using PASIDs for isolation might work but again, support is far from
> > > universal so we can't really assume it as the only way in the spec.
> > >
> > Right.
> >
> > > Absent PASID the popular way seems to be shadow vq which basically
> > > does
> > >
> > > 4- software intercept for everything
> > >        clearly that's a lot of CPU overhead, I do not think we can focus on that
> > >        as the only way in the spec, though some hypervisors might
> > >        already have a lot of migration overhead to the point where
> > >        virtio can afford any amount of overhead and it won't be
> > >        measureable.
> > >
> > >
> > > I also note some or all of the intercepts can always come and go.
> > > For example, a common setup is that if target VCPUs are running then
> > > IOMMU will inject interrupts directly into guest - if not you
> > > generally trap to hypervisor. Similarly, shadow vq might be active just
> temporarily.
> > >
> > > Which approach is best? I feel ideally virtio would find ways to
> > > support them all rather than deciding on a policy in the spec.
> >
> > Cooking all the modes seems frankly very daunting to me specially when
> > there is no existing software stack to consume all modes and no device
> > vendor to sign of for _all_ variations.
> 
> Not addressing all the modes.  We are building components not stacks.
> Components need to be reusable not stack specific.
> 
> Was the whole admin command interface with its levels of indirection a design
> mistake then? It was designed exactly to support all kind of models.
Admin vq for multiple use cases including device migration demonstrates that it is a good fit.
SR-IOV, SIOV will be able to utilize for device migration, provisioning, legacy and more.

> >
> > To me, two stacks are practical and common to target at beginning.
> > i.e.
> > 1. passthrough mode
> >
> > 2. #2 above,
> > I had real technical difficulty to make #2 practically work and build a scalable
> device and have converged api with #1.
> > The option we explored to have admin command in some register of the VF
> specific for #2 is partially fine targeted for use case #2 only.
> 
> Right. So - a way to send admin commands to a VF directly, perhaps in config
> space? Do we need more than PA+PASID+some flags?
> Want to try to write something like this up?
> 
It cannot be in the PCI 4K config space for sure.
It must reside in the virtio config space.

I am sure that this is used for passthrough mode of #1.
So, can you please confirm to write this up for mode #2 only?

> > A variation of that for the member device, there is owner device, hence
> admin command on the AQ can be used.
> >
> > If we can converge on common virtio interface between #1 and #2, great.
> > If we cannot be due to technical issues, we shouldn't step on each other's
> toes, instead build the two interfaces for two different use cases overcoming its
> own technical challenges.
> >
> > And when in future, someone want to implement different kind of bisections,
> they can propose the extensions.
> 
> Not good at all, this means the interface is very narrow.
> Your "propose an extension" just doesn't work practically.
> It takes years for things to be widely deployed in the field, by the time they are
> there are more use-cases.

We usually see it getting deployed in < 1 year time with new spec advancement pace for many features.
Building something for unreasonable amount of time without use case results in missing the immediate deployments that happens in 2024 to 2027 of 1.4 spec time frame.

> We need something universal and admin commands were supposed to be just
> this.
I don't see a universal solution for all problems for above #1 and #2.

Solving above #2 will cover large part of deployments that users are doing.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  8:15                                             ` Zhu, Lingshan
@ 2023-10-19  9:01                                               ` Parav Pandit
  2023-10-19  9:09                                                 ` Zhu, Lingshan
  2023-10-19  9:13                                                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  9:01 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, October 19, 2023 1:45 PM
> 
> On 10/18/2023 5:48 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Wednesday, October 18, 2023 2:13 PM
> >>
> >> On 10/18/2023 3:20 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Wednesday, October 18, 2023 12:22 PM
> >>>>
> >>>> On 10/18/2023 2:41 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Wednesday, October 18, 2023 12:06 PM
> >>>>>>
> >>>>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
> >>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>> Lingshan
> >>>>>>>> Sent: Monday, October 16, 2023 3:18 PM
> >>>>>>>>
> >>>>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
> >>>>>>>>>>>>>> How do you transfer the ownership?
> >>>>>>>>>>>>> An additional ownership deletgation by a new admin
> command.
> >>>>>>>>>>>> if you think this can work, do you want to cook a patch to
> >>>>>>>>>>>> implement this before you submitting this live migration series?
> >>>>>>>>>>> I answered this already above.
> >>>>>>>>>> talk is cheap, show me your patch
> >>>>>>>>> Huh. We presented the infrastructure that migrates, 30+ device
> >>>>>>>>> types,
> >>>>>>>> covering device context ideas from Oracle.
> >>>>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
> >>>>>>>>>
> >>>>>>>>> Please have some respect for other members who covered more
> >>>>>>>>> ground than
> >>>>>>>> your series.
> >>>>>>>>> What more? Apply the same nested concept on the member device
> >>>>>>>>> as
> >>>>>>>> Michael suggested, it is nested virtualization maintain exact
> >>>>>>>> same
> >>>> semantics.
> >>>>>>>>> So a VF is mapped as PF to the L1 guest.
> >>>>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> >>>>>>>>>
> >>>>>>>>> This nested work can be extended in future, once first level
> >>>>>>>>> nesting is
> >>>>>>>> covered.
> >>>>>>>>>> Answer all questions above, if you think a management VF can
> >>>>>>>>>> work, please show me your patch.
> >>>>>>>>> The idea evolves from technical debate then pointing fingers
> >>>>>>>>> like your
> >>>>>>>> comment.
> >>>>>>>>> I think a positive discussion with Michael and a pointer to
> >>>>>>>>> the paper from
> >>>>>>>> Jason gave a good direction of doing _right_ nesting that
> >>>>>>>> follows two
> >>>>>> principles.
> >>>>>>>>> a. efficiency property
> >>>>>>>>> b. equivalence property
> >>>>>>>>>
> >>>>>>>>> (c. resource control is natural already)
> >>>>>>>>>
> >>>>>>>>> Both apply at VMM and at VM level enabling recursive
> >>>>>>>>> virtualization, by
> >>>>>>>> having VF that can act as PF inside the guest.
> >>>>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> >>>>>>>> Please just show me your patch resolving these opens, how about
> >>>>>>>> start from defining virito-fs device context and your management VF?
> >>>>>>> As answered, device context infrastructure is done, per device
> >>>>>>> specific device-
> >>>>>> context will be defined incrementally.
> >>>>>>> I will not be including virtio-fs in this series. It will be
> >>>>>>> done incrementally in
> >>>>>> future utilizing the infrastructure build in this series.
> >>>>>> Done? How do you conclude this? You just tell me what is the full
> >>>>>> set of virito-fs device context now and how to migrate them.
> >>>>>>
> >>>>>> You cant? you refuse or you don't? Do you expect the HW designer
> >>>>>> to figure out by themself?
> >>>>> I wont be able to tell now as I don’t think it is necessary for this series.
> >>>>> If one out of 30 devices cannot migrate because of unimaginable
> >>>>> amount of
> >>>> complexity has been placed there, may be one will not implement it
> >>>> as member device.
> >>>>>    From experience of migratable complex gpu devices, rdma devices
> >>>>> (stateful
> >>>> having hundred thousand of stateful QPs), my understanding is
> >>>> complex state of virtio-fs can be defined and migratable.
> >>>>> Mlx5 driver consist of 150,000 lines of code and that device is
> >>>>> migratable
> >>>> with complex state.
> >>>>> So I am optimistic that virtio-fs can be migratable too.
> >>>>> It does not have to limited by my limited creativity of 2023.
> >>>>> May be I am wrong, in that case one will not implement passthrough
> >>>>> virtio-fs
> >>>> device.
> >>>> your series wants to migrate device context, but doesn't define
> >>>> device context, does this sounds reasonable?
> >>> Device generic context is defined at [1] and also the infrastructure
> >>> for defining
> >> the device context in parallel by multiple people can be done post
> >> the work of [1].
> >>> Per each device type context will be defined incrementally post this work.
> >>>
> >>> [1]
> >>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190
> >>> .h
> >>> tml
> >> This is not post of the work, you should define them before you use
> >> them in this series.
> >>
> > I don’t agree to cook ocean in this patch series.
> > No practical spec devel community does it.
> > As long as we feel comfortable that device context framework is extendible, it
> is fine.
> > If virtio-fs seems very hard, may be one will come with a new light weight FS
> device. I really don’t know.
> so you want to migrate device context, but refuse to define them?
> >
> >> And you need to prove why admin vq are better than registers solution
> >> if you want a merge.
> > Michael already responded the practical aspects.
> > Since you may claim, I didn’t answer, below is the technical details.
> >
> > Why admin commands and aq is better is because of below reasons in my
> view:
> >
> > Functionally better:
> > 1. When the live migration registers are located on the VF itself, VMM does
> not have control of it.
> > These registers reset, on FLR and device reset because these are virtio
> registers of the device.
> > Hence, VMM lost the state for the job that VMM was supposed to do.
> > Therefore, passthrough mode cannot depend on these registers.
> >
> > 2. Any bulk data transfer of device context and dirty page tracking requires
> DMA.
> > Hence those DMA must happen to the device which is different than VF itself.
> > If it is on the VF itself, it has two problems.
> > 2.a. VF device reset and FLR will clear them, and device context is lost.
> >
> > 2.b. the DMA occurs at the PCI RID level.
> > IOMMU cannot bifurcate the DMA of one RID to two different address space
> of guest and hypervisor.
> > This requires PASID support.
> > Using PASID has following problems.
> > 2.b.1 PASID typically not used by the kernel software. It is only meant for the
> user processes.
> > Hence for kernel work a reserving PASID won't be acceptable upstream
> kernel.
> > 2.b.2 Somehow if this is done, When the VF itself supports PASID, it required
> now vPASID support.
> > This is again not where industry is going in other forums where I am part of.
> Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
> > 2.b.3 One of the widely used cpu seems to have dropped the support due to
> limitation of an instruction around PASID.
> > So it cannot be used there, this further limits virtio passthrough users.
> >
> > Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is
> functional problems.
> >
> > Scale wise better:
> > 3. Admin command and admin vq are used _only_ when one does device
> migration command.
> > One does not migrate VMs every few msec.
> > Hence such functionality to be better be done which is efficient for
> performance, but without consuming on-chip memory.
> > Admin command and admin vq satisfy those.
> >
> > 4. Once the software matures further, admin command would prefer
> completion interrupt, instead of poll.
> > How to get notification/interrupt? Well, virtqueue defines this already.
> > Should we replicate that in some PF registers?
> > It can be. But once you put all the functionalities of admin command and aq
> in registers the whole thing becomes yet another register_q.
> >
> > 5. Can these registers be placed in the PF to overcome #1 and #2 for
> passthrough?
> > In theory yes.
> > In practice, no, as there are many commands that flow, which needs to scale
> to reasonable number of VFs.
> > Admin commands over admin vq provides this generic facility.
> >
> > 6. Most modern devices who attempts to scale, cut down their register
> footprint, registers are used only for main bootstap, init time config work.
> > Even in virtio spec, one can read:
> > "Device configuration space is generally used for rarely changing or
> initialization-time parameters."
> >
> > Adding some additional registers to a PF device config space for non init time
> parameters does not make sense.
> >
> > 7. Additionally, a nested virtualization should be done by truly nesting the
> device at right abstraction point of owner-member relationship.
> > This follows two principles of (a) efficiency and (b) equivalency of what Jason
> paper pointed.
> > And we ask for nested VF extension we will get our guidance from PCI-SIG, of
> why it should be done if it is matching with rest of the ecosystem components
> that support/don’t support the nesting.
> It they are true, shall we refactor virtio-pci common cfg functionalities to use
> admin vq?
For non-backward compatible SIOV device of the future, yes, virtio-pci common config (non init registers) should be moved to a vq, located on the member device directly.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  9:01                                               ` Parav Pandit
@ 2023-10-19  9:09                                                 ` Zhu, Lingshan
  2023-10-19  9:13                                                   ` Parav Pandit
  2023-10-19  9:13                                                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-19  9:09 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/19/2023 5:01 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, October 19, 2023 1:45 PM
>>
>> On 10/18/2023 5:48 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Wednesday, October 18, 2023 2:13 PM
>>>>
>>>> On 10/18/2023 3:20 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Wednesday, October 18, 2023 12:22 PM
>>>>>>
>>>>>> On 10/18/2023 2:41 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Wednesday, October 18, 2023 12:06 PM
>>>>>>>>
>>>>>>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
>>>>>>>>>> Lingshan
>>>>>>>>>> Sent: Monday, October 16, 2023 3:18 PM
>>>>>>>>>>
>>>>>>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
>>>>>>>>>>>>>>>> How do you transfer the ownership?
>>>>>>>>>>>>>>> An additional ownership deletgation by a new admin
>> command.
>>>>>>>>>>>>>> if you think this can work, do you want to cook a patch to
>>>>>>>>>>>>>> implement this before you submitting this live migration series?
>>>>>>>>>>>>> I answered this already above.
>>>>>>>>>>>> talk is cheap, show me your patch
>>>>>>>>>>> Huh. We presented the infrastructure that migrates, 30+ device
>>>>>>>>>>> types,
>>>>>>>>>> covering device context ideas from Oracle.
>>>>>>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
>>>>>>>>>>>
>>>>>>>>>>> Please have some respect for other members who covered more
>>>>>>>>>>> ground than
>>>>>>>>>> your series.
>>>>>>>>>>> What more? Apply the same nested concept on the member device
>>>>>>>>>>> as
>>>>>>>>>> Michael suggested, it is nested virtualization maintain exact
>>>>>>>>>> same
>>>>>> semantics.
>>>>>>>>>>> So a VF is mapped as PF to the L1 guest.
>>>>>>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
>>>>>>>>>>>
>>>>>>>>>>> This nested work can be extended in future, once first level
>>>>>>>>>>> nesting is
>>>>>>>>>> covered.
>>>>>>>>>>>> Answer all questions above, if you think a management VF can
>>>>>>>>>>>> work, please show me your patch.
>>>>>>>>>>> The idea evolves from technical debate then pointing fingers
>>>>>>>>>>> like your
>>>>>>>>>> comment.
>>>>>>>>>>> I think a positive discussion with Michael and a pointer to
>>>>>>>>>>> the paper from
>>>>>>>>>> Jason gave a good direction of doing _right_ nesting that
>>>>>>>>>> follows two
>>>>>>>> principles.
>>>>>>>>>>> a. efficiency property
>>>>>>>>>>> b. equivalence property
>>>>>>>>>>>
>>>>>>>>>>> (c. resource control is natural already)
>>>>>>>>>>>
>>>>>>>>>>> Both apply at VMM and at VM level enabling recursive
>>>>>>>>>>> virtualization, by
>>>>>>>>>> having VF that can act as PF inside the guest.
>>>>>>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
>>>>>>>>>> Please just show me your patch resolving these opens, how about
>>>>>>>>>> start from defining virito-fs device context and your management VF?
>>>>>>>>> As answered, device context infrastructure is done, per device
>>>>>>>>> specific device-
>>>>>>>> context will be defined incrementally.
>>>>>>>>> I will not be including virtio-fs in this series. It will be
>>>>>>>>> done incrementally in
>>>>>>>> future utilizing the infrastructure build in this series.
>>>>>>>> Done? How do you conclude this? You just tell me what is the full
>>>>>>>> set of virito-fs device context now and how to migrate them.
>>>>>>>>
>>>>>>>> You cant? you refuse or you don't? Do you expect the HW designer
>>>>>>>> to figure out by themself?
>>>>>>> I wont be able to tell now as I don’t think it is necessary for this series.
>>>>>>> If one out of 30 devices cannot migrate because of unimaginable
>>>>>>> amount of
>>>>>> complexity has been placed there, may be one will not implement it
>>>>>> as member device.
>>>>>>>     From experience of migratable complex gpu devices, rdma devices
>>>>>>> (stateful
>>>>>> having hundred thousand of stateful QPs), my understanding is
>>>>>> complex state of virtio-fs can be defined and migratable.
>>>>>>> Mlx5 driver consist of 150,000 lines of code and that device is
>>>>>>> migratable
>>>>>> with complex state.
>>>>>>> So I am optimistic that virtio-fs can be migratable too.
>>>>>>> It does not have to limited by my limited creativity of 2023.
>>>>>>> May be I am wrong, in that case one will not implement passthrough
>>>>>>> virtio-fs
>>>>>> device.
>>>>>> your series wants to migrate device context, but doesn't define
>>>>>> device context, does this sounds reasonable?
>>>>> Device generic context is defined at [1] and also the infrastructure
>>>>> for defining
>>>> the device context in parallel by multiple people can be done post
>>>> the work of [1].
>>>>> Per each device type context will be defined incrementally post this work.
>>>>>
>>>>> [1]
>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190
>>>>> .h
>>>>> tml
>>>> This is not post of the work, you should define them before you use
>>>> them in this series.
>>>>
>>> I don’t agree to cook ocean in this patch series.
>>> No practical spec devel community does it.
>>> As long as we feel comfortable that device context framework is extendible, it
>> is fine.
>>> If virtio-fs seems very hard, may be one will come with a new light weight FS
>> device. I really don’t know.
>> so you want to migrate device context, but refuse to define them?
>>>> And you need to prove why admin vq are better than registers solution
>>>> if you want a merge.
>>> Michael already responded the practical aspects.
>>> Since you may claim, I didn’t answer, below is the technical details.
>>>
>>> Why admin commands and aq is better is because of below reasons in my
>> view:
>>> Functionally better:
>>> 1. When the live migration registers are located on the VF itself, VMM does
>> not have control of it.
>>> These registers reset, on FLR and device reset because these are virtio
>> registers of the device.
>>> Hence, VMM lost the state for the job that VMM was supposed to do.
>>> Therefore, passthrough mode cannot depend on these registers.
>>>
>>> 2. Any bulk data transfer of device context and dirty page tracking requires
>> DMA.
>>> Hence those DMA must happen to the device which is different than VF itself.
>>> If it is on the VF itself, it has two problems.
>>> 2.a. VF device reset and FLR will clear them, and device context is lost.
>>>
>>> 2.b. the DMA occurs at the PCI RID level.
>>> IOMMU cannot bifurcate the DMA of one RID to two different address space
>> of guest and hypervisor.
>>> This requires PASID support.
>>> Using PASID has following problems.
>>> 2.b.1 PASID typically not used by the kernel software. It is only meant for the
>> user processes.
>>> Hence for kernel work a reserving PASID won't be acceptable upstream
>> kernel.
>>> 2.b.2 Somehow if this is done, When the VF itself supports PASID, it required
>> now vPASID support.
>>> This is again not where industry is going in other forums where I am part of.
>> Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
>>> 2.b.3 One of the widely used cpu seems to have dropped the support due to
>> limitation of an instruction around PASID.
>>> So it cannot be used there, this further limits virtio passthrough users.
>>>
>>> Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is
>> functional problems.
>>> Scale wise better:
>>> 3. Admin command and admin vq are used _only_ when one does device
>> migration command.
>>> One does not migrate VMs every few msec.
>>> Hence such functionality to be better be done which is efficient for
>> performance, but without consuming on-chip memory.
>>> Admin command and admin vq satisfy those.
>>>
>>> 4. Once the software matures further, admin command would prefer
>> completion interrupt, instead of poll.
>>> How to get notification/interrupt? Well, virtqueue defines this already.
>>> Should we replicate that in some PF registers?
>>> It can be. But once you put all the functionalities of admin command and aq
>> in registers the whole thing becomes yet another register_q.
>>> 5. Can these registers be placed in the PF to overcome #1 and #2 for
>> passthrough?
>>> In theory yes.
>>> In practice, no, as there are many commands that flow, which needs to scale
>> to reasonable number of VFs.
>>> Admin commands over admin vq provides this generic facility.
>>>
>>> 6. Most modern devices who attempts to scale, cut down their register
>> footprint, registers are used only for main bootstap, init time config work.
>>> Even in virtio spec, one can read:
>>> "Device configuration space is generally used for rarely changing or
>> initialization-time parameters."
>>> Adding some additional registers to a PF device config space for non init time
>> parameters does not make sense.
>>> 7. Additionally, a nested virtualization should be done by truly nesting the
>> device at right abstraction point of owner-member relationship.
>>> This follows two principles of (a) efficiency and (b) equivalency of what Jason
>> paper pointed.
>>> And we ask for nested VF extension we will get our guidance from PCI-SIG, of
>> why it should be done if it is matching with rest of the ecosystem components
>> that support/don’t support the nesting.
>> It they are true, shall we refactor virtio-pci common cfg functionalities to use
>> admin vq?
> For non-backward compatible SIOV device of the future, yes, virtio-pci common config (non init registers) should be moved to a vq, located on the member device directly.
Oh, really? Quite interesting, do you want to move all config space 
fields in VF to admin vq? Have a plan?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  8:58                                                 ` Parav Pandit
@ 2023-10-19  9:11                                                   ` Michael S. Tsirkin
  2023-10-19  9:20                                                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  9:11 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 08:58:10AM +0000, Parav Pandit wrote:
> It cannot be in the PCI 4K config space for sure.
> It must reside in the virtio config space.

Why? My concern with virtio config space would be
that it's not orthogonal to other things in the
config space. E.g. you need to look at feature bits
to discover presence. How does this work while you are
documenting that device can undergo reset at any time?
With a capability you can discover it without poking at features.

> I am sure that this is used for passthrough mode of #1.
> So, can you please confirm to write this up for mode #2 only?

To me it sounds like a generally useful capability that could be
used as basis e.g. for admin command transport.

> > > A variation of that for the member device, there is owner device, hence
> > admin command on the AQ can be used.
> > >
> > > If we can converge on common virtio interface between #1 and #2, great.
> > > If we cannot be due to technical issues, we shouldn't step on each other's
> > toes, instead build the two interfaces for two different use cases overcoming its
> > own technical challenges.
> > >
> > > And when in future, someone want to implement different kind of bisections,
> > they can propose the extensions.
> > 
> > Not good at all, this means the interface is very narrow.
> > Your "propose an extension" just doesn't work practically.
> > It takes years for things to be widely deployed in the field, by the time they are
> > there are more use-cases.
> 
> We usually see it getting deployed in < 1 year time with new spec advancement pace for many features.
> Building something for unreasonable amount of time without use case results in missing the immediate deployments that happens in 2024 to 2027 of 1.4 spec time frame.
> 
> > We need something universal and admin commands were supposed to be just
> > this.
> I don't see a universal solution for all problems for above #1 and #2.
> 
> Solving above #2 will cover large part of deployments that users are doing.

OK. But additionally, if an interface can cover a couple of use-cases
we can be reasonably sure it's going to cover more going forward.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  9:09                                                 ` Zhu, Lingshan
@ 2023-10-19  9:13                                                   ` Parav Pandit
  2023-10-19  9:14                                                     ` Michael S. Tsirkin
  2023-10-19  9:16                                                     ` Zhu, Lingshan
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  9:13 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, October 19, 2023 2:40 PM
> 
> On 10/19/2023 5:01 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Thursday, October 19, 2023 1:45 PM
> >>
> >> On 10/18/2023 5:48 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Wednesday, October 18, 2023 2:13 PM
> >>>>
> >>>> On 10/18/2023 3:20 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Wednesday, October 18, 2023 12:22 PM
> >>>>>>
> >>>>>> On 10/18/2023 2:41 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Wednesday, October 18, 2023 12:06 PM
> >>>>>>>>
> >>>>>>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
> >>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>>>> Lingshan
> >>>>>>>>>> Sent: Monday, October 16, 2023 3:18 PM
> >>>>>>>>>>
> >>>>>>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
> >>>>>>>>>>>>>>>> How do you transfer the ownership?
> >>>>>>>>>>>>>>> An additional ownership deletgation by a new admin
> >> command.
> >>>>>>>>>>>>>> if you think this can work, do you want to cook a patch
> >>>>>>>>>>>>>> to implement this before you submitting this live migration
> series?
> >>>>>>>>>>>>> I answered this already above.
> >>>>>>>>>>>> talk is cheap, show me your patch
> >>>>>>>>>>> Huh. We presented the infrastructure that migrates, 30+
> >>>>>>>>>>> device types,
> >>>>>>>>>> covering device context ideas from Oracle.
> >>>>>>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
> >>>>>>>>>>>
> >>>>>>>>>>> Please have some respect for other members who covered more
> >>>>>>>>>>> ground than
> >>>>>>>>>> your series.
> >>>>>>>>>>> What more? Apply the same nested concept on the member
> >>>>>>>>>>> device as
> >>>>>>>>>> Michael suggested, it is nested virtualization maintain exact
> >>>>>>>>>> same
> >>>>>> semantics.
> >>>>>>>>>>> So a VF is mapped as PF to the L1 guest.
> >>>>>>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> >>>>>>>>>>>
> >>>>>>>>>>> This nested work can be extended in future, once first level
> >>>>>>>>>>> nesting is
> >>>>>>>>>> covered.
> >>>>>>>>>>>> Answer all questions above, if you think a management VF
> >>>>>>>>>>>> can work, please show me your patch.
> >>>>>>>>>>> The idea evolves from technical debate then pointing fingers
> >>>>>>>>>>> like your
> >>>>>>>>>> comment.
> >>>>>>>>>>> I think a positive discussion with Michael and a pointer to
> >>>>>>>>>>> the paper from
> >>>>>>>>>> Jason gave a good direction of doing _right_ nesting that
> >>>>>>>>>> follows two
> >>>>>>>> principles.
> >>>>>>>>>>> a. efficiency property
> >>>>>>>>>>> b. equivalence property
> >>>>>>>>>>>
> >>>>>>>>>>> (c. resource control is natural already)
> >>>>>>>>>>>
> >>>>>>>>>>> Both apply at VMM and at VM level enabling recursive
> >>>>>>>>>>> virtualization, by
> >>>>>>>>>> having VF that can act as PF inside the guest.
> >>>>>>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> >>>>>>>>>> Please just show me your patch resolving these opens, how
> >>>>>>>>>> about start from defining virito-fs device context and your
> management VF?
> >>>>>>>>> As answered, device context infrastructure is done, per device
> >>>>>>>>> specific device-
> >>>>>>>> context will be defined incrementally.
> >>>>>>>>> I will not be including virtio-fs in this series. It will be
> >>>>>>>>> done incrementally in
> >>>>>>>> future utilizing the infrastructure build in this series.
> >>>>>>>> Done? How do you conclude this? You just tell me what is the
> >>>>>>>> full set of virito-fs device context now and how to migrate them.
> >>>>>>>>
> >>>>>>>> You cant? you refuse or you don't? Do you expect the HW
> >>>>>>>> designer to figure out by themself?
> >>>>>>> I wont be able to tell now as I don’t think it is necessary for this series.
> >>>>>>> If one out of 30 devices cannot migrate because of unimaginable
> >>>>>>> amount of
> >>>>>> complexity has been placed there, may be one will not implement
> >>>>>> it as member device.
> >>>>>>>     From experience of migratable complex gpu devices, rdma
> >>>>>>> devices (stateful
> >>>>>> having hundred thousand of stateful QPs), my understanding is
> >>>>>> complex state of virtio-fs can be defined and migratable.
> >>>>>>> Mlx5 driver consist of 150,000 lines of code and that device is
> >>>>>>> migratable
> >>>>>> with complex state.
> >>>>>>> So I am optimistic that virtio-fs can be migratable too.
> >>>>>>> It does not have to limited by my limited creativity of 2023.
> >>>>>>> May be I am wrong, in that case one will not implement
> >>>>>>> passthrough virtio-fs
> >>>>>> device.
> >>>>>> your series wants to migrate device context, but doesn't define
> >>>>>> device context, does this sounds reasonable?
> >>>>> Device generic context is defined at [1] and also the
> >>>>> infrastructure for defining
> >>>> the device context in parallel by multiple people can be done post
> >>>> the work of [1].
> >>>>> Per each device type context will be defined incrementally post this
> work.
> >>>>>
> >>>>> [1]
> >>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg001
> >>>>> 90
> >>>>> .h
> >>>>> tml
> >>>> This is not post of the work, you should define them before you use
> >>>> them in this series.
> >>>>
> >>> I don’t agree to cook ocean in this patch series.
> >>> No practical spec devel community does it.
> >>> As long as we feel comfortable that device context framework is
> >>> extendible, it
> >> is fine.
> >>> If virtio-fs seems very hard, may be one will come with a new light
> >>> weight FS
> >> device. I really don’t know.
> >> so you want to migrate device context, but refuse to define them?
> >>>> And you need to prove why admin vq are better than registers
> >>>> solution if you want a merge.
> >>> Michael already responded the practical aspects.
> >>> Since you may claim, I didn’t answer, below is the technical details.
> >>>
> >>> Why admin commands and aq is better is because of below reasons in
> >>> my
> >> view:
> >>> Functionally better:
> >>> 1. When the live migration registers are located on the VF itself,
> >>> VMM does
> >> not have control of it.
> >>> These registers reset, on FLR and device reset because these are
> >>> virtio
> >> registers of the device.
> >>> Hence, VMM lost the state for the job that VMM was supposed to do.
> >>> Therefore, passthrough mode cannot depend on these registers.
> >>>
> >>> 2. Any bulk data transfer of device context and dirty page tracking
> >>> requires
> >> DMA.
> >>> Hence those DMA must happen to the device which is different than VF
> itself.
> >>> If it is on the VF itself, it has two problems.
> >>> 2.a. VF device reset and FLR will clear them, and device context is lost.
> >>>
> >>> 2.b. the DMA occurs at the PCI RID level.
> >>> IOMMU cannot bifurcate the DMA of one RID to two different address
> >>> space
> >> of guest and hypervisor.
> >>> This requires PASID support.
> >>> Using PASID has following problems.
> >>> 2.b.1 PASID typically not used by the kernel software. It is only
> >>> meant for the
> >> user processes.
> >>> Hence for kernel work a reserving PASID won't be acceptable upstream
> >> kernel.
> >>> 2.b.2 Somehow if this is done, When the VF itself supports PASID, it
> >>> required
> >> now vPASID support.
> >>> This is again not where industry is going in other forums where I am part of.
> >> Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
> >>> 2.b.3 One of the widely used cpu seems to have dropped the support
> >>> due to
> >> limitation of an instruction around PASID.
> >>> So it cannot be used there, this further limits virtio passthrough users.
> >>>
> >>> Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is
> >> functional problems.
> >>> Scale wise better:
> >>> 3. Admin command and admin vq are used _only_ when one does device
> >> migration command.
> >>> One does not migrate VMs every few msec.
> >>> Hence such functionality to be better be done which is efficient for
> >> performance, but without consuming on-chip memory.
> >>> Admin command and admin vq satisfy those.
> >>>
> >>> 4. Once the software matures further, admin command would prefer
> >> completion interrupt, instead of poll.
> >>> How to get notification/interrupt? Well, virtqueue defines this already.
> >>> Should we replicate that in some PF registers?
> >>> It can be. But once you put all the functionalities of admin command
> >>> and aq
> >> in registers the whole thing becomes yet another register_q.
> >>> 5. Can these registers be placed in the PF to overcome #1 and #2 for
> >> passthrough?
> >>> In theory yes.
> >>> In practice, no, as there are many commands that flow, which needs
> >>> to scale
> >> to reasonable number of VFs.
> >>> Admin commands over admin vq provides this generic facility.
> >>>
> >>> 6. Most modern devices who attempts to scale, cut down their
> >>> register
> >> footprint, registers are used only for main bootstap, init time config work.
> >>> Even in virtio spec, one can read:
> >>> "Device configuration space is generally used for rarely changing or
> >> initialization-time parameters."
> >>> Adding some additional registers to a PF device config space for non
> >>> init time
> >> parameters does not make sense.
> >>> 7. Additionally, a nested virtualization should be done by truly
> >>> nesting the
> >> device at right abstraction point of owner-member relationship.
> >>> This follows two principles of (a) efficiency and (b) equivalency of
> >>> what Jason
> >> paper pointed.
> >>> And we ask for nested VF extension we will get our guidance from
> >>> PCI-SIG, of
> >> why it should be done if it is matching with rest of the ecosystem
> >> components that support/don’t support the nesting.
> >> It they are true, shall we refactor virtio-pci common cfg
> >> functionalities to use admin vq?
> > For non-backward compatible SIOV device of the future, yes, virtio-pci
> common config (non init registers) should be moved to a vq, located on the
> member device directly.
> Oh, really? Quite interesting, do you want to move all config space fields in VF
> to admin vq? Have a plan?
Not in my plan for spec 1.4 time frame.
I do not want to divert the discussion, would like to focus on device migration phases.
Lets please discuss in some other dedicated thread.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  9:01                                               ` Parav Pandit
  2023-10-19  9:09                                                 ` Zhu, Lingshan
@ 2023-10-19  9:13                                                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  9:13 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 09:01:10AM +0000, Parav Pandit wrote:
> > > 7. Additionally, a nested virtualization should be done by truly nesting the
> > device at right abstraction point of owner-member relationship.
> > > This follows two principles of (a) efficiency and (b) equivalency of what Jason
> > paper pointed.
> > > And we ask for nested VF extension we will get our guidance from PCI-SIG, of
> > why it should be done if it is matching with rest of the ecosystem components
> > that support/don’t support the nesting.
> > It they are true, shall we refactor virtio-pci common cfg functionalities to use
> > admin vq?
> For non-backward compatible SIOV device of the future, yes, virtio-pci common config (non init registers) should be moved to a vq, located on the member device directly.

I doubt it can be a vq though. This is why I'm asking for a simple way
to send admin commands to device. Then non-backward compatible SIOV can
be hopefully built on top of that.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  9:13                                                   ` Parav Pandit
@ 2023-10-19  9:14                                                     ` Michael S. Tsirkin
  2023-10-19  9:18                                                       ` Zhu, Lingshan
  2023-10-19  9:16                                                     ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  9:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> > Oh, really? Quite interesting, do you want to move all config space fields in VF
> > to admin vq? Have a plan?
> Not in my plan for spec 1.4 time frame.
> I do not want to divert the discussion, would like to focus on device migration phases.
> Lets please discuss in some other dedicated thread.

Possibly, if there's a way to send admin commands to vf itself then
Lingshan will be happy?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  9:13                                                   ` Parav Pandit
  2023-10-19  9:14                                                     ` Michael S. Tsirkin
@ 2023-10-19  9:16                                                     ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-19  9:16 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



On 10/19/2023 5:13 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, October 19, 2023 2:40 PM
>>
>> On 10/19/2023 5:01 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Thursday, October 19, 2023 1:45 PM
>>>>
>>>> On 10/18/2023 5:48 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Wednesday, October 18, 2023 2:13 PM
>>>>>>
>>>>>> On 10/18/2023 3:20 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Wednesday, October 18, 2023 12:22 PM
>>>>>>>>
>>>>>>>> On 10/18/2023 2:41 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Wednesday, October 18, 2023 12:06 PM
>>>>>>>>>>
>>>>>>>>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
>>>>>>>>>>>> Lingshan
>>>>>>>>>>>> Sent: Monday, October 16, 2023 3:18 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
>>>>>>>>>>>>>>>>>> How do you transfer the ownership?
>>>>>>>>>>>>>>>>> An additional ownership deletgation by a new admin
>>>> command.
>>>>>>>>>>>>>>>> if you think this can work, do you want to cook a patch
>>>>>>>>>>>>>>>> to implement this before you submitting this live migration
>> series?
>>>>>>>>>>>>>>> I answered this already above.
>>>>>>>>>>>>>> talk is cheap, show me your patch
>>>>>>>>>>>>> Huh. We presented the infrastructure that migrates, 30+
>>>>>>>>>>>>> device types,
>>>>>>>>>>>> covering device context ideas from Oracle.
>>>>>>>>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please have some respect for other members who covered more
>>>>>>>>>>>>> ground than
>>>>>>>>>>>> your series.
>>>>>>>>>>>>> What more? Apply the same nested concept on the member
>>>>>>>>>>>>> device as
>>>>>>>>>>>> Michael suggested, it is nested virtualization maintain exact
>>>>>>>>>>>> same
>>>>>>>> semantics.
>>>>>>>>>>>>> So a VF is mapped as PF to the L1 guest.
>>>>>>>>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This nested work can be extended in future, once first level
>>>>>>>>>>>>> nesting is
>>>>>>>>>>>> covered.
>>>>>>>>>>>>>> Answer all questions above, if you think a management VF
>>>>>>>>>>>>>> can work, please show me your patch.
>>>>>>>>>>>>> The idea evolves from technical debate then pointing fingers
>>>>>>>>>>>>> like your
>>>>>>>>>>>> comment.
>>>>>>>>>>>>> I think a positive discussion with Michael and a pointer to
>>>>>>>>>>>>> the paper from
>>>>>>>>>>>> Jason gave a good direction of doing _right_ nesting that
>>>>>>>>>>>> follows two
>>>>>>>>>> principles.
>>>>>>>>>>>>> a. efficiency property
>>>>>>>>>>>>> b. equivalence property
>>>>>>>>>>>>>
>>>>>>>>>>>>> (c. resource control is natural already)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Both apply at VMM and at VM level enabling recursive
>>>>>>>>>>>>> virtualization, by
>>>>>>>>>>>> having VF that can act as PF inside the guest.
>>>>>>>>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
>>>>>>>>>>>> Please just show me your patch resolving these opens, how
>>>>>>>>>>>> about start from defining virito-fs device context and your
>> management VF?
>>>>>>>>>>> As answered, device context infrastructure is done, per device
>>>>>>>>>>> specific device-
>>>>>>>>>> context will be defined incrementally.
>>>>>>>>>>> I will not be including virtio-fs in this series. It will be
>>>>>>>>>>> done incrementally in
>>>>>>>>>> future utilizing the infrastructure build in this series.
>>>>>>>>>> Done? How do you conclude this? You just tell me what is the
>>>>>>>>>> full set of virito-fs device context now and how to migrate them.
>>>>>>>>>>
>>>>>>>>>> You cant? you refuse or you don't? Do you expect the HW
>>>>>>>>>> designer to figure out by themself?
>>>>>>>>> I wont be able to tell now as I don’t think it is necessary for this series.
>>>>>>>>> If one out of 30 devices cannot migrate because of unimaginable
>>>>>>>>> amount of
>>>>>>>> complexity has been placed there, may be one will not implement
>>>>>>>> it as member device.
>>>>>>>>>      From experience of migratable complex gpu devices, rdma
>>>>>>>>> devices (stateful
>>>>>>>> having hundred thousand of stateful QPs), my understanding is
>>>>>>>> complex state of virtio-fs can be defined and migratable.
>>>>>>>>> Mlx5 driver consist of 150,000 lines of code and that device is
>>>>>>>>> migratable
>>>>>>>> with complex state.
>>>>>>>>> So I am optimistic that virtio-fs can be migratable too.
>>>>>>>>> It does not have to limited by my limited creativity of 2023.
>>>>>>>>> May be I am wrong, in that case one will not implement
>>>>>>>>> passthrough virtio-fs
>>>>>>>> device.
>>>>>>>> your series wants to migrate device context, but doesn't define
>>>>>>>> device context, does this sounds reasonable?
>>>>>>> Device generic context is defined at [1] and also the
>>>>>>> infrastructure for defining
>>>>>> the device context in parallel by multiple people can be done post
>>>>>> the work of [1].
>>>>>>> Per each device type context will be defined incrementally post this
>> work.
>>>>>>> [1]
>>>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg001
>>>>>>> 90
>>>>>>> .h
>>>>>>> tml
>>>>>> This is not post of the work, you should define them before you use
>>>>>> them in this series.
>>>>>>
>>>>> I don’t agree to cook ocean in this patch series.
>>>>> No practical spec devel community does it.
>>>>> As long as we feel comfortable that device context framework is
>>>>> extendible, it
>>>> is fine.
>>>>> If virtio-fs seems very hard, may be one will come with a new light
>>>>> weight FS
>>>> device. I really don’t know.
>>>> so you want to migrate device context, but refuse to define them?
>>>>>> And you need to prove why admin vq are better than registers
>>>>>> solution if you want a merge.
>>>>> Michael already responded the practical aspects.
>>>>> Since you may claim, I didn’t answer, below is the technical details.
>>>>>
>>>>> Why admin commands and aq is better is because of below reasons in
>>>>> my
>>>> view:
>>>>> Functionally better:
>>>>> 1. When the live migration registers are located on the VF itself,
>>>>> VMM does
>>>> not have control of it.
>>>>> These registers reset, on FLR and device reset because these are
>>>>> virtio
>>>> registers of the device.
>>>>> Hence, VMM lost the state for the job that VMM was supposed to do.
>>>>> Therefore, passthrough mode cannot depend on these registers.
>>>>>
>>>>> 2. Any bulk data transfer of device context and dirty page tracking
>>>>> requires
>>>> DMA.
>>>>> Hence those DMA must happen to the device which is different than VF
>> itself.
>>>>> If it is on the VF itself, it has two problems.
>>>>> 2.a. VF device reset and FLR will clear them, and device context is lost.
>>>>>
>>>>> 2.b. the DMA occurs at the PCI RID level.
>>>>> IOMMU cannot bifurcate the DMA of one RID to two different address
>>>>> space
>>>> of guest and hypervisor.
>>>>> This requires PASID support.
>>>>> Using PASID has following problems.
>>>>> 2.b.1 PASID typically not used by the kernel software. It is only
>>>>> meant for the
>>>> user processes.
>>>>> Hence for kernel work a reserving PASID won't be acceptable upstream
>>>> kernel.
>>>>> 2.b.2 Somehow if this is done, When the VF itself supports PASID, it
>>>>> required
>>>> now vPASID support.
>>>>> This is again not where industry is going in other forums where I am part of.
>>>> Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
>>>>> 2.b.3 One of the widely used cpu seems to have dropped the support
>>>>> due to
>>>> limitation of an instruction around PASID.
>>>>> So it cannot be used there, this further limits virtio passthrough users.
>>>>>
>>>>> Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is
>>>> functional problems.
>>>>> Scale wise better:
>>>>> 3. Admin command and admin vq are used _only_ when one does device
>>>> migration command.
>>>>> One does not migrate VMs every few msec.
>>>>> Hence such functionality to be better be done which is efficient for
>>>> performance, but without consuming on-chip memory.
>>>>> Admin command and admin vq satisfy those.
>>>>>
>>>>> 4. Once the software matures further, admin command would prefer
>>>> completion interrupt, instead of poll.
>>>>> How to get notification/interrupt? Well, virtqueue defines this already.
>>>>> Should we replicate that in some PF registers?
>>>>> It can be. But once you put all the functionalities of admin command
>>>>> and aq
>>>> in registers the whole thing becomes yet another register_q.
>>>>> 5. Can these registers be placed in the PF to overcome #1 and #2 for
>>>> passthrough?
>>>>> In theory yes.
>>>>> In practice, no, as there are many commands that flow, which needs
>>>>> to scale
>>>> to reasonable number of VFs.
>>>>> Admin commands over admin vq provides this generic facility.
>>>>>
>>>>> 6. Most modern devices who attempts to scale, cut down their
>>>>> register
>>>> footprint, registers are used only for main bootstap, init time config work.
>>>>> Even in virtio spec, one can read:
>>>>> "Device configuration space is generally used for rarely changing or
>>>> initialization-time parameters."
>>>>> Adding some additional registers to a PF device config space for non
>>>>> init time
>>>> parameters does not make sense.
>>>>> 7. Additionally, a nested virtualization should be done by truly
>>>>> nesting the
>>>> device at right abstraction point of owner-member relationship.
>>>>> This follows two principles of (a) efficiency and (b) equivalency of
>>>>> what Jason
>>>> paper pointed.
>>>>> And we ask for nested VF extension we will get our guidance from
>>>>> PCI-SIG, of
>>>> why it should be done if it is matching with rest of the ecosystem
>>>> components that support/don’t support the nesting.
>>>> It they are true, shall we refactor virtio-pci common cfg
>>>> functionalities to use admin vq?
>>> For non-backward compatible SIOV device of the future, yes, virtio-pci
>> common config (non init registers) should be moved to a vq, located on the
>> member device directly.
>> Oh, really? Quite interesting, do you want to move all config space fields in VF
>> to admin vq? Have a plan?
> Not in my plan for spec 1.4 time frame.
> I do not want to divert the discussion, would like to focus on device migration phases.
> Lets please discuss in some other dedicated thread.
OK, but don't say admin vq is better than registers.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  9:14                                                     ` Michael S. Tsirkin
@ 2023-10-19  9:18                                                       ` Zhu, Lingshan
  2023-10-19 10:33                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-19  9:18 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>> Oh, really? Quite interesting, do you want to move all config space fields in VF
>>> to admin vq? Have a plan?
>> Not in my plan for spec 1.4 time frame.
>> I do not want to divert the discussion, would like to focus on device migration phases.
>> Lets please discuss in some other dedicated thread.
> Possibly, if there's a way to send admin commands to vf itself then
> Lingshan will be happy?
still need to prove why admin commands are better than registers.

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:11                                                   ` Michael S. Tsirkin
@ 2023-10-19  9:20                                                     ` Parav Pandit
  2023-10-19  9:26                                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  9:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 19, 2023 2:41 PM
> 
> On Thu, Oct 19, 2023 at 08:58:10AM +0000, Parav Pandit wrote:
> > It cannot be in the PCI 4K config space for sure.
> > It must reside in the virtio config space.
> 
> Why? 
Because pci spec has clearly called out to not place any device specific things in there.

Citation in pci spec
" It is strongly recommended that PCI Express devices place no registers in Configuration Space other than those in
headers or Capability structures architected by applicable PCI specifications."

> My concern with virtio config space would be that it's not orthogonal to
> other things in the config space. E.g. you need to look at feature bits to discover
> presence. How does this work while you are documenting that device can
> undergo reset at any time?
This is why such solution cannot work for passthrough. It can only fulfil #2 based approach.

> With a capability you can discover it without poking at features.
> 
It is discouraged by pci spec.
Pci caps for mostly doing very small init time sort of config.
Not to run frequent commands.

> > I am sure that this is used for passthrough mode of #1.
> > So, can you please confirm to write this up for mode #2 only?
> 
> To me it sounds like a generally useful capability that could be used as basis e.g.
> for admin command transport.
> 
Unfortunately, it cannot be pci capability. 
It needs to stay in virtio area and only fulfill use case of #2.

> > > > A variation of that for the member device, there is owner device,
> > > > hence
> > > admin command on the AQ can be used.
> > > >
> > > > If we can converge on common virtio interface between #1 and #2, great.
> > > > If we cannot be due to technical issues, we shouldn't step on each
> > > > other's
> > > toes, instead build the two interfaces for two different use cases
> > > overcoming its own technical challenges.
> > > >
> > > > And when in future, someone want to implement different kind of
> > > > bisections,
> > > they can propose the extensions.
> > >
> > > Not good at all, this means the interface is very narrow.
> > > Your "propose an extension" just doesn't work practically.
> > > It takes years for things to be widely deployed in the field, by the
> > > time they are there are more use-cases.
> >
> > We usually see it getting deployed in < 1 year time with new spec
> advancement pace for many features.
> > Building something for unreasonable amount of time without use case results
> in missing the immediate deployments that happens in 2024 to 2027 of 1.4 spec
> time frame.
> >
> > > We need something universal and admin commands were supposed to be
> > > just this.
> > I don't see a universal solution for all problems for above #1 and #2.
> >
> > Solving above #2 will cover large part of deployments that users are doing.
> 
> OK. But additionally, if an interface can cover a couple of use-cases we can be
> reasonably sure it's going to cover more going forward.
May be.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:20                                                     ` Parav Pandit
@ 2023-10-19  9:26                                                       ` Michael S. Tsirkin
  2023-10-19  9:33                                                         ` Michael S. Tsirkin
  2023-10-19  9:39                                                         ` Parav Pandit
  0 siblings, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  9:26 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 09:20:22AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 19, 2023 2:41 PM
> > 
> > On Thu, Oct 19, 2023 at 08:58:10AM +0000, Parav Pandit wrote:
> > > It cannot be in the PCI 4K config space for sure.
> > > It must reside in the virtio config space.
> > 
> > Why? 
> Because pci spec has clearly called out to not place any device specific things in there.
> 
> Citation in pci spec
> " It is strongly recommended that PCI Express devices place no registers in Configuration Space other than those in
> headers or Capability structures architected by applicable PCI specifications."

But of course, we'd place them inside a vendor specific capability.
We just need a version of virtio_pci_cap64 that's suitable for express.


> > My concern with virtio config space would be that it's not orthogonal to
> > other things in the config space. E.g. you need to look at feature bits to discover
> > presence. How does this work while you are documenting that device can
> > undergo reset at any time?
> This is why such solution cannot work for passthrough. It can only fulfil #2 based approach.
> 
> > With a capability you can discover it without poking at features.
> > 
> It is discouraged by pci spec.
> Pci caps for mostly doing very small init time sort of config.
> Not to run frequent commands.

Interesting. where in the spec exactly?

> > > I am sure that this is used for passthrough mode of #1.
> > > So, can you please confirm to write this up for mode #2 only?
> > 
> > To me it sounds like a generally useful capability that could be used as basis e.g.
> > for admin command transport.
> > 
> Unfortunately, it cannot be pci capability. 
> It needs to stay in virtio area and only fulfill use case of #2.
> 
> > > > > A variation of that for the member device, there is owner device,
> > > > > hence
> > > > admin command on the AQ can be used.
> > > > >
> > > > > If we can converge on common virtio interface between #1 and #2, great.
> > > > > If we cannot be due to technical issues, we shouldn't step on each
> > > > > other's
> > > > toes, instead build the two interfaces for two different use cases
> > > > overcoming its own technical challenges.
> > > > >
> > > > > And when in future, someone want to implement different kind of
> > > > > bisections,
> > > > they can propose the extensions.
> > > >
> > > > Not good at all, this means the interface is very narrow.
> > > > Your "propose an extension" just doesn't work practically.
> > > > It takes years for things to be widely deployed in the field, by the
> > > > time they are there are more use-cases.
> > >
> > > We usually see it getting deployed in < 1 year time with new spec
> > advancement pace for many features.
> > > Building something for unreasonable amount of time without use case results
> > in missing the immediate deployments that happens in 2024 to 2027 of 1.4 spec
> > time frame.
> > >
> > > > We need something universal and admin commands were supposed to be
> > > > just this.
> > > I don't see a universal solution for all problems for above #1 and #2.
> > >
> > > Solving above #2 will cover large part of deployments that users are doing.
> > 
> > OK. But additionally, if an interface can cover a couple of use-cases we can be
> > reasonably sure it's going to cover more going forward.
> May be.

Yes hard to be sure. But if it can't then that's a good sign it's
problematic.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:26                                                       ` Michael S. Tsirkin
@ 2023-10-19  9:33                                                         ` Michael S. Tsirkin
  2023-10-19  9:41                                                           ` Parav Pandit
  2023-10-19  9:39                                                         ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  9:33 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 05:26:54AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 19, 2023 at 09:20:22AM +0000, Parav Pandit wrote:
> > 
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, October 19, 2023 2:41 PM
> > > 
> > > On Thu, Oct 19, 2023 at 08:58:10AM +0000, Parav Pandit wrote:
> > > > It cannot be in the PCI 4K config space for sure.
> > > > It must reside in the virtio config space.
> > > 
> > > Why? 
> > Because pci spec has clearly called out to not place any device specific things in there.
> > 
> > Citation in pci spec
> > " It is strongly recommended that PCI Express devices place no registers in Configuration Space other than those in
> > headers or Capability structures architected by applicable PCI specifications."
> 
> But of course, we'd place them inside a vendor specific capability.
> We just need a version of virtio_pci_cap64 that's suitable for express.
> 
> 
> > > My concern with virtio config space would be that it's not orthogonal to
> > > other things in the config space. E.g. you need to look at feature bits to discover
> > > presence. How does this work while you are documenting that device can
> > > undergo reset at any time?
> > This is why such solution cannot work for passthrough. It can only fulfil #2 based approach.
> > 
> > > With a capability you can discover it without poking at features.
> > > 
> > It is discouraged by pci spec.
> > Pci caps for mostly doing very small init time sort of config.
> > Not to run frequent commands.
> 
> Interesting. where in the spec exactly?

And the reason I ask is because I'd like to understand the exact
limitation.

In any case, we should still maybe look for ways to separate it from config.
Maybe a capability points at a BAR and that is where we
have this stuff?

 
> > > > I am sure that this is used for passthrough mode of #1.
> > > > So, can you please confirm to write this up for mode #2 only?
> > > 
> > > To me it sounds like a generally useful capability that could be used as basis e.g.
> > > for admin command transport.
> > > 
> > Unfortunately, it cannot be pci capability. 
> > It needs to stay in virtio area and only fulfill use case of #2.


> > > > > > A variation of that for the member device, there is owner device,
> > > > > > hence
> > > > > admin command on the AQ can be used.
> > > > > >
> > > > > > If we can converge on common virtio interface between #1 and #2, great.
> > > > > > If we cannot be due to technical issues, we shouldn't step on each
> > > > > > other's
> > > > > toes, instead build the two interfaces for two different use cases
> > > > > overcoming its own technical challenges.
> > > > > >
> > > > > > And when in future, someone want to implement different kind of
> > > > > > bisections,
> > > > > they can propose the extensions.
> > > > >
> > > > > Not good at all, this means the interface is very narrow.
> > > > > Your "propose an extension" just doesn't work practically.
> > > > > It takes years for things to be widely deployed in the field, by the
> > > > > time they are there are more use-cases.
> > > >
> > > > We usually see it getting deployed in < 1 year time with new spec
> > > advancement pace for many features.
> > > > Building something for unreasonable amount of time without use case results
> > > in missing the immediate deployments that happens in 2024 to 2027 of 1.4 spec
> > > time frame.
> > > >
> > > > > We need something universal and admin commands were supposed to be
> > > > > just this.
> > > > I don't see a universal solution for all problems for above #1 and #2.
> > > >
> > > > Solving above #2 will cover large part of deployments that users are doing.
> > > 
> > > OK. But additionally, if an interface can cover a couple of use-cases we can be
> > > reasonably sure it's going to cover more going forward.
> > May be.
> 
> Yes hard to be sure. But if it can't then that's a good sign it's
> problematic.
> 
> -- 
> MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:26                                                       ` Michael S. Tsirkin
  2023-10-19  9:33                                                         ` Michael S. Tsirkin
@ 2023-10-19  9:39                                                         ` Parav Pandit
  2023-10-19  9:49                                                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  9:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 19, 2023 2:57 PM
> 
> On Thu, Oct 19, 2023 at 09:20:22AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, October 19, 2023 2:41 PM
> > >
> > > On Thu, Oct 19, 2023 at 08:58:10AM +0000, Parav Pandit wrote:
> > > > It cannot be in the PCI 4K config space for sure.
> > > > It must reside in the virtio config space.
> > >
> > > Why?
> > Because pci spec has clearly called out to not place any device specific things
> in there.
> >
> > Citation in pci spec
> > " It is strongly recommended that PCI Express devices place no
> > registers in Configuration Space other than those in headers or Capability
> structures architected by applicable PCI specifications."
> 
> But of course, we'd place them inside a vendor specific capability.
> We just need a version of virtio_pci_cap64 that's suitable for express.
> 
It is one and the same thing, wrapping dma commands using capability, is same as open coding them.
Read only cap should say where admin command section is located, which must be done when the DRIVER_OK is done.

For sure, we wont use this for passthrough member devices.

So before discussing where to place them, it is fundamental to agree its use case which is #2.

> 
> > > My concern with virtio config space would be that it's not
> > > orthogonal to other things in the config space. E.g. you need to
> > > look at feature bits to discover presence. How does this work while
> > > you are documenting that device can undergo reset at any time?
> > This is why such solution cannot work for passthrough. It can only fulfil #2
> based approach.
> >
> > > With a capability you can discover it without poking at features.
> > >
> > It is discouraged by pci spec.
> > Pci caps for mostly doing very small init time sort of config.
> > Not to run frequent commands.
> 
> Interesting. where in the spec exactly?
> 
Same above section, section 7.2.2.2 implementation notes.

> > > > I am sure that this is used for passthrough mode of #1.
> > > > So, can you please confirm to write this up for mode #2 only?
> > >
> > > To me it sounds like a generally useful capability that could be used as basis
> e.g.
> > > for admin command transport.
> > >
> > Unfortunately, it cannot be pci capability.
> > It needs to stay in virtio area and only fulfill use case of #2.
> >
> > > > > > A variation of that for the member device, there is owner
> > > > > > device, hence
> > > > > admin command on the AQ can be used.
> > > > > >
> > > > > > If we can converge on common virtio interface between #1 and #2,
> great.
> > > > > > If we cannot be due to technical issues, we shouldn't step on
> > > > > > each other's
> > > > > toes, instead build the two interfaces for two different use
> > > > > cases overcoming its own technical challenges.
> > > > > >
> > > > > > And when in future, someone want to implement different kind
> > > > > > of bisections,
> > > > > they can propose the extensions.
> > > > >
> > > > > Not good at all, this means the interface is very narrow.
> > > > > Your "propose an extension" just doesn't work practically.
> > > > > It takes years for things to be widely deployed in the field, by
> > > > > the time they are there are more use-cases.
> > > >
> > > > We usually see it getting deployed in < 1 year time with new spec
> > > advancement pace for many features.
> > > > Building something for unreasonable amount of time without use
> > > > case results
> > > in missing the immediate deployments that happens in 2024 to 2027 of
> > > 1.4 spec time frame.
> > > >
> > > > > We need something universal and admin commands were supposed to
> > > > > be just this.
> > > > I don't see a universal solution for all problems for above #1 and #2.
> > > >
> > > > Solving above #2 will cover large part of deployments that users are doing.
> > >
> > > OK. But additionally, if an interface can cover a couple of
> > > use-cases we can be reasonably sure it's going to cover more going forward.
> > May be.
> 
> Yes hard to be sure. But if it can't then that's a good sign it's problematic.
I don't see it problematic at all.
If we write the virtio specification for device migration in 1.4-time frame, I am 100% sure that it will be deployed before spec release.
Other uses should be able to extend it as they evolve and explain why the current one does not fit them.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:33                                                         ` Michael S. Tsirkin
@ 2023-10-19  9:41                                                           ` Parav Pandit
  2023-10-19  9:53                                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  9:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 19, 2023 3:03 PM
> > > > With a capability you can discover it without poking at features.
> > > >
> > > It is discouraged by pci spec.
> > > Pci caps for mostly doing very small init time sort of config.
> > > Not to run frequent commands.
> >
> > Interesting. where in the spec exactly?
> 
> And the reason I ask is because I'd like to understand the exact limitation.
> 
> In any case, we should still maybe look for ways to separate it from config.
> Maybe a capability points at a BAR and that is where we have this stuff?
> 
Yes, I replied in your previous email, a similar suggestion.
Lets first agree that it is drafted for non-passthrough mode.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:39                                                         ` Parav Pandit
@ 2023-10-19  9:49                                                           ` Michael S. Tsirkin
  2023-10-19  9:57                                                             ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  9:49 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 09:39:48AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 19, 2023 2:57 PM
> > 
> > On Thu, Oct 19, 2023 at 09:20:22AM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, October 19, 2023 2:41 PM
> > > >
> > > > On Thu, Oct 19, 2023 at 08:58:10AM +0000, Parav Pandit wrote:
> > > > > It cannot be in the PCI 4K config space for sure.
> > > > > It must reside in the virtio config space.
> > > >
> > > > Why?
> > > Because pci spec has clearly called out to not place any device specific things
> > in there.
> > >
> > > Citation in pci spec
> > > " It is strongly recommended that PCI Express devices place no
> > > registers in Configuration Space other than those in headers or Capability
> > structures architected by applicable PCI specifications."
> > 
> > But of course, we'd place them inside a vendor specific capability.
> > We just need a version of virtio_pci_cap64 that's suitable for express.
> > 
> It is one and the same thing, wrapping dma commands using capability, is same as open coding them.
> Read only cap should say where admin command section is located, which must be done when the DRIVER_OK is done.

This seems very different from your current commands.
There's less value in reusing commands if semantics are
subtly different ..

> For sure, we wont use this for passthrough member devices.
> 
> So before discussing where to place them, it is fundamental to agree its use case which is #2.

Again components need to be versatile, not just focused on one use-case.


> > 
> > > > My concern with virtio config space would be that it's not
> > > > orthogonal to other things in the config space. E.g. you need to
> > > > look at feature bits to discover presence. How does this work while
> > > > you are documenting that device can undergo reset at any time?
> > > This is why such solution cannot work for passthrough. It can only fulfil #2
> > based approach.
> > >
> > > > With a capability you can discover it without poking at features.
> > > >
> > > It is discouraged by pci spec.
> > > Pci caps for mostly doing very small init time sort of config.
> > > Not to run frequent commands.
> > 
> > Interesting. where in the spec exactly?
> > 
> Same above section, section 7.2.2.2 implementation notes.

I see. So the valid use case is access before memory is enabled.
But in fact, e.g. FLR actually disables memory does it not?
So if we want same behaviour as you are proposing here then
these need to work with memory disabled yes?

BTW they also recommend against read side effects which virtio violates :(


> > > > > I am sure that this is used for passthrough mode of #1.
> > > > > So, can you please confirm to write this up for mode #2 only?
> > > >
> > > > To me it sounds like a generally useful capability that could be used as basis
> > e.g.
> > > > for admin command transport.
> > > >
> > > Unfortunately, it cannot be pci capability.
> > > It needs to stay in virtio area and only fulfill use case of #2.
> > >
> > > > > > > A variation of that for the member device, there is owner
> > > > > > > device, hence
> > > > > > admin command on the AQ can be used.
> > > > > > >
> > > > > > > If we can converge on common virtio interface between #1 and #2,
> > great.
> > > > > > > If we cannot be due to technical issues, we shouldn't step on
> > > > > > > each other's
> > > > > > toes, instead build the two interfaces for two different use
> > > > > > cases overcoming its own technical challenges.
> > > > > > >
> > > > > > > And when in future, someone want to implement different kind
> > > > > > > of bisections,
> > > > > > they can propose the extensions.
> > > > > >
> > > > > > Not good at all, this means the interface is very narrow.
> > > > > > Your "propose an extension" just doesn't work practically.
> > > > > > It takes years for things to be widely deployed in the field, by
> > > > > > the time they are there are more use-cases.
> > > > >
> > > > > We usually see it getting deployed in < 1 year time with new spec
> > > > advancement pace for many features.
> > > > > Building something for unreasonable amount of time without use
> > > > > case results
> > > > in missing the immediate deployments that happens in 2024 to 2027 of
> > > > 1.4 spec time frame.
> > > > >
> > > > > > We need something universal and admin commands were supposed to
> > > > > > be just this.
> > > > > I don't see a universal solution for all problems for above #1 and #2.
> > > > >
> > > > > Solving above #2 will cover large part of deployments that users are doing.
> > > >
> > > > OK. But additionally, if an interface can cover a couple of
> > > > use-cases we can be reasonably sure it's going to cover more going forward.
> > > May be.
> > 
> > Yes hard to be sure. But if it can't then that's a good sign it's problematic.
> I don't see it problematic at all.
> If we write the virtio specification for device migration in 1.4-time frame, I am 100% sure that it will be deployed before spec release.
> Other uses should be able to extend it as they evolve and explain why the current one does not fit them.

My experience tells me this results in a huge mess of incompatible
interfaces, the further you go down this road the harder it
becomes to add new things as they conflict with old ones.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:41                                                           ` Parav Pandit
@ 2023-10-19  9:53                                                             ` Michael S. Tsirkin
  2023-10-19  9:54                                                               ` Michael S. Tsirkin
  2023-10-19 10:00                                                               ` Parav Pandit
  0 siblings, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  9:53 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 09:41:46AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 19, 2023 3:03 PM
> > > > > With a capability you can discover it without poking at features.
> > > > >
> > > > It is discouraged by pci spec.
> > > > Pci caps for mostly doing very small init time sort of config.
> > > > Not to run frequent commands.
> > >
> > > Interesting. where in the spec exactly?
> > 
> > And the reason I ask is because I'd like to understand the exact limitation.
> > 
> > In any case, we should still maybe look for ways to separate it from config.
> > Maybe a capability points at a BAR and that is where we have this stuff?
> > 
> Yes, I replied in your previous email, a similar suggestion.
> Lets first agree that it is drafted for non-passthrough mode.

I could see passthrough use it too if there, why not.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:53                                                             ` Michael S. Tsirkin
@ 2023-10-19  9:54                                                               ` Michael S. Tsirkin
  2023-10-19 10:00                                                               ` Parav Pandit
  1 sibling, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19  9:54 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 05:53:29AM -0400, Michael S. Tsirkin wrote:
> On Thu, Oct 19, 2023 at 09:41:46AM +0000, Parav Pandit wrote:
> > 
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, October 19, 2023 3:03 PM
> > > > > > With a capability you can discover it without poking at features.
> > > > > >
> > > > > It is discouraged by pci spec.
> > > > > Pci caps for mostly doing very small init time sort of config.
> > > > > Not to run frequent commands.
> > > >
> > > > Interesting. where in the spec exactly?
> > > 
> > > And the reason I ask is because I'd like to understand the exact limitation.
> > > 
> > > In any case, we should still maybe look for ways to separate it from config.
> > > Maybe a capability points at a BAR and that is where we have this stuff?
> > > 
> > Yes, I replied in your previous email, a similar suggestion.
> > Lets first agree that it is drafted for non-passthrough mode.
> 
> I could see passthrough use it too if there, why not.

The big limitation of such a hack is dependence on PASID.
So it's more of a long term option when PASID becomes more
widespread.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:49                                                           ` Michael S. Tsirkin
@ 2023-10-19  9:57                                                             ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-19  9:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 19, 2023 3:20 PM
> 
> On Thu, Oct 19, 2023 at 09:39:48AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, October 19, 2023 2:57 PM
> > >
> > > On Thu, Oct 19, 2023 at 09:20:22AM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, October 19, 2023 2:41 PM
> > > > >
> > > > > On Thu, Oct 19, 2023 at 08:58:10AM +0000, Parav Pandit wrote:
> > > > > > It cannot be in the PCI 4K config space for sure.
> > > > > > It must reside in the virtio config space.
> > > > >
> > > > > Why?
> > > > Because pci spec has clearly called out to not place any device
> > > > specific things
> > > in there.
> > > >
> > > > Citation in pci spec
> > > > " It is strongly recommended that PCI Express devices place no
> > > > registers in Configuration Space other than those in headers or
> > > > Capability
> > > structures architected by applicable PCI specifications."
> > >
> > > But of course, we'd place them inside a vendor specific capability.
> > > We just need a version of virtio_pci_cap64 that's suitable for express.
> > >
> > It is one and the same thing, wrapping dma commands using capability, is
> same as open coding them.
> > Read only cap should say where admin command section is located, which
> must be done when the DRIVER_OK is done.
> 
> This seems very different from your current commands.
> There's less value in reusing commands if semantics are subtly different ..
> 
I mean read only cap points to a location in BAR.
And mediation software who fully owns the member device, issues side command there after DRIVER_OK.

> > For sure, we wont use this for passthrough member devices.
> >
> > So before discussing where to place them, it is fundamental to agree its use
> case which is #2.
> 
> Again components need to be versatile, not just focused on one use-case.
> 
Don't see how one can make a tire in a truck and car both. :)
We are trying, so lets see, may be we can.

> 
> > >
> > > > > My concern with virtio config space would be that it's not
> > > > > orthogonal to other things in the config space. E.g. you need to
> > > > > look at feature bits to discover presence. How does this work
> > > > > while you are documenting that device can undergo reset at any time?
> > > > This is why such solution cannot work for passthrough. It can only
> > > > fulfil #2
> > > based approach.
> > > >
> > > > > With a capability you can discover it without poking at features.
> > > > >
> > > > It is discouraged by pci spec.
> > > > Pci caps for mostly doing very small init time sort of config.
> > > > Not to run frequent commands.
> > >
> > > Interesting. where in the spec exactly?
> > >
> > Same above section, section 7.2.2.2 implementation notes.
> 
> I see. So the valid use case is access before memory is enabled.
> But in fact, e.g. FLR actually disables memory does it not?
It does. This is why all this VF resident registers only work for non-passthrough area.

> So if we want same behaviour as you are proposing here then these need to
> work with memory disabled yes?
> 
> BTW they also recommend against read side effects which virtio violates :(
> 
Yes.
This is why staying away from these caps which is not the vehicle for admin commands.

> 
> > > > > > I am sure that this is used for passthrough mode of #1.
> > > > > > So, can you please confirm to write this up for mode #2 only?
> > > > >
> > > > > To me it sounds like a generally useful capability that could be
> > > > > used as basis
> > > e.g.
> > > > > for admin command transport.
> > > > >
> > > > Unfortunately, it cannot be pci capability.
> > > > It needs to stay in virtio area and only fulfill use case of #2.
> > > >
> > > > > > > > A variation of that for the member device, there is owner
> > > > > > > > device, hence
> > > > > > > admin command on the AQ can be used.
> > > > > > > >
> > > > > > > > If we can converge on common virtio interface between #1
> > > > > > > > and #2,
> > > great.
> > > > > > > > If we cannot be due to technical issues, we shouldn't step
> > > > > > > > on each other's
> > > > > > > toes, instead build the two interfaces for two different use
> > > > > > > cases overcoming its own technical challenges.
> > > > > > > >
> > > > > > > > And when in future, someone want to implement different
> > > > > > > > kind of bisections,
> > > > > > > they can propose the extensions.
> > > > > > >
> > > > > > > Not good at all, this means the interface is very narrow.
> > > > > > > Your "propose an extension" just doesn't work practically.
> > > > > > > It takes years for things to be widely deployed in the
> > > > > > > field, by the time they are there are more use-cases.
> > > > > >
> > > > > > We usually see it getting deployed in < 1 year time with new
> > > > > > spec
> > > > > advancement pace for many features.
> > > > > > Building something for unreasonable amount of time without use
> > > > > > case results
> > > > > in missing the immediate deployments that happens in 2024 to
> > > > > 2027 of
> > > > > 1.4 spec time frame.
> > > > > >
> > > > > > > We need something universal and admin commands were supposed
> > > > > > > to be just this.
> > > > > > I don't see a universal solution for all problems for above #1 and #2.
> > > > > >
> > > > > > Solving above #2 will cover large part of deployments that users are
> doing.
> > > > >
> > > > > OK. But additionally, if an interface can cover a couple of
> > > > > use-cases we can be reasonably sure it's going to cover more going
> forward.
> > > > May be.
> > >
> > > Yes hard to be sure. But if it can't then that's a good sign it's problematic.
> > I don't see it problematic at all.
> > If we write the virtio specification for device migration in 1.4-time frame, I
> am 100% sure that it will be deployed before spec release.
> > Other uses should be able to extend it as they evolve and explain why the
> current one does not fit them.
> 
> My experience tells me this results in a huge mess of incompatible interfaces,
> the further you go down this road the harder it becomes to add new things as
> they conflict with old ones.
Too abstract for me.
Will focus on the above technical parts that is key to draft a common interface.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  9:53                                                             ` Michael S. Tsirkin
  2023-10-19  9:54                                                               ` Michael S. Tsirkin
@ 2023-10-19 10:00                                                               ` Parav Pandit
  2023-10-19 10:01                                                                 ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-19 10:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 19, 2023 3:23 PM
> 
> On Thu, Oct 19, 2023 at 09:41:46AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, October 19, 2023 3:03 PM
> > > > > > With a capability you can discover it without poking at features.
> > > > > >
> > > > > It is discouraged by pci spec.
> > > > > Pci caps for mostly doing very small init time sort of config.
> > > > > Not to run frequent commands.
> > > >
> > > > Interesting. where in the spec exactly?
> > >
> > > And the reason I ask is because I'd like to understand the exact limitation.
> > >
> > > In any case, we should still maybe look for ways to separate it from config.
> > > Maybe a capability points at a BAR and that is where we have this stuff?
> > >
> > Yes, I replied in your previous email, a similar suggestion.
> > Lets first agree that it is drafted for non-passthrough mode.
> 
> I could see passthrough use it too if there, why not.
For three reasons:
1. it does not survive FLR
2. DMA occurs on the RID, A vm can directly attack to this programmed memory address in it if done without PASID
3. PASID is not available anytime soon in virtio 1.4-time frame on most common platforms.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19 10:00                                                               ` Parav Pandit
@ 2023-10-19 10:01                                                                 ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-19 10:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Parav Pandit
> Sent: Thursday, October 19, 2023 3:30 PM
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, October 19, 2023 3:23 PM
> >
> > On Thu, Oct 19, 2023 at 09:41:46AM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, October 19, 2023 3:03 PM
> > > > > > > With a capability you can discover it without poking at features.
> > > > > > >
> > > > > > It is discouraged by pci spec.
> > > > > > Pci caps for mostly doing very small init time sort of config.
> > > > > > Not to run frequent commands.
> > > > >
> > > > > Interesting. where in the spec exactly?
> > > >
> > > > And the reason I ask is because I'd like to understand the exact limitation.
> > > >
> > > > In any case, we should still maybe look for ways to separate it from config.
> > > > Maybe a capability points at a BAR and that is where we have this stuff?
> > > >
> > > Yes, I replied in your previous email, a similar suggestion.
> > > Lets first agree that it is drafted for non-passthrough mode.
> >
> > I could see passthrough use it too if there, why not.
> For three reasons:
> 1. it does not survive FLR
> 2. DMA occurs on the RID, A vm can directly attack to this programmed
> memory address in it if done without PASID 3. PASID is not available anytime
> soon in virtio 1.4-time frame on most common platforms.
And 4th, these new registers do not scale at VF.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19  9:18                                                       ` Zhu, Lingshan
@ 2023-10-19 10:33                                                         ` Parav Pandit
  2023-10-19 11:19                                                           ` Michael S. Tsirkin
  2023-10-20  9:31                                                           ` Zhu, Lingshan
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-19 10:33 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, October 19, 2023 2:48 PM
> 
> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> >>> Oh, really? Quite interesting, do you want to move all config space
> >>> fields in VF to admin vq? Have a plan?
> >> Not in my plan for spec 1.4 time frame.
> >> I do not want to divert the discussion, would like to focus on device
> migration phases.
> >> Lets please discuss in some other dedicated thread.
> > Possibly, if there's a way to send admin commands to vf itself then
> > Lingshan will be happy?
> still need to prove why admin commands are better than registers.

Virtio spec development is not proof based approach. Please stop asking for it.

I tried my best to have technical answer in [1].
I explained that registers simply do not work for passthrough mode
(if this is what you are asking when you are asking prove its better).
They can work for non_passthrough mediated mode.

A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.

Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
A register holding a command content make sense.

Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
And trying to understand, why AQ cannot possible or inferior?

If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.

[1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19 10:33                                                         ` Parav Pandit
@ 2023-10-19 11:19                                                           ` Michael S. Tsirkin
  2023-10-19 12:02                                                             ` Parav Pandit
  2023-10-20  9:31                                                           ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-19 11:19 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 10:33:15AM +0000, Parav Pandit wrote:
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Thursday, October 19, 2023 2:48 PM
> > 
> > On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> > > On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> > >>> Oh, really? Quite interesting, do you want to move all config space
> > >>> fields in VF to admin vq? Have a plan?
> > >> Not in my plan for spec 1.4 time frame.
> > >> I do not want to divert the discussion, would like to focus on device
> > migration phases.
> > >> Lets please discuss in some other dedicated thread.
> > > Possibly, if there's a way to send admin commands to vf itself then
> > > Lingshan will be happy?
> > still need to prove why admin commands are better than registers.
> 
> Virtio spec development is not proof based approach. Please stop asking for it.
> 
> I tried my best to have technical answer in [1].
> I explained that registers simply do not work for passthrough mode
> (if this is what you are asking when you are asking prove its better).
> They can work for non_passthrough mediated mode.
> 
> A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.
> 
> Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
> A register holding a command content make sense.
> 
> Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
> And trying to understand, why AQ cannot possible or inferior?

I think the real limitation is with the SRIOV group type. If using PASID
instead of source id for isolation then we'd need a new group type.
And maybe a simpler interface than virtqueue since it's all
slow path anyway.


> If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
> If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
> I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.
> 
> [1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19 11:19                                                           ` Michael S. Tsirkin
@ 2023-10-19 12:02                                                             ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-19 12:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, October 19, 2023 4:50 PM

> > Now, with that, if this can be useful only for non_passthrough, I made humble
> request to transport them using AQ, this way, you get all benefits of AQ.
> > And trying to understand, why AQ cannot possible or inferior?
> 
> I think the real limitation is with the SRIOV group type. If using PASID instead of
> source id for isolation then we'd need a new group type.
RID+PASID is the isolation object.
Haven’t seen just PASID.

> And maybe a simpler interface than virtqueue since it's all slow path anyway.
> 
Post functionally working, one will need all features of vq including interrupt notification.

> 
> > If you have commands like suspend/resume device, register or queue
> transport simply don’t work, because it's wrong to bifurcate the device with
> such weird API.
> > If you want to biferacate for mediation software, it probably makes sense to
> operate at each VQ level, config space level. Such are very different commands
> than passthrough.
> > I think vdpa has demonstrated that very well on how to do specific work for
> specific device type. So some of those work can be done using AQ.
> >
> > [1]
> > https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd10336
> > 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-19 10:33                                                         ` Parav Pandit
  2023-10-19 11:19                                                           ` Michael S. Tsirkin
@ 2023-10-20  9:31                                                           ` Zhu, Lingshan
  2023-10-20  9:41                                                             ` Michael S. Tsirkin
  2023-10-20 12:54                                                             ` Parav Pandit
  1 sibling, 2 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-20  9:31 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/19/2023 6:33 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, October 19, 2023 2:48 PM
>>
>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>> Oh, really? Quite interesting, do you want to move all config space
>>>>> fields in VF to admin vq? Have a plan?
>>>> Not in my plan for spec 1.4 time frame.
>>>> I do not want to divert the discussion, would like to focus on device
>> migration phases.
>>>> Lets please discuss in some other dedicated thread.
>>> Possibly, if there's a way to send admin commands to vf itself then
>>> Lingshan will be happy?
>> still need to prove why admin commands are better than registers.
> Virtio spec development is not proof based approach. Please stop asking for it.
>
> I tried my best to have technical answer in [1].
> I explained that registers simply do not work for passthrough mode
> (if this is what you are asking when you are asking prove its better).
> They can work for non_passthrough mediated mode.
>
> A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.
>
> Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
> A register holding a command content make sense.
>
> Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
> And trying to understand, why AQ cannot possible or inferior?
>
> If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
> If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
> I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.
>
> [1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
We have been through your statement for many times.
This is not about how many times you repeated, if you think this is 
true, you need to prove that with solid evidence.


For pass-through, I still recommend you to take a reference of current 
virito-pci implementation, it works for pass-through, right?
For scale, I already told you for many times that they are per-device 
facilities. How can a per-device facility not scale?
vDPA works fine on config space.

So, if you still insist admin vq is better than config space like in 
other thread you have concluded, you may imply that config space 
interfaces should be re-factored to admin vq.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-20  9:31                                                           ` Zhu, Lingshan
@ 2023-10-20  9:41                                                             ` Michael S. Tsirkin
  2023-10-20 11:11                                                               ` Zhu, Lingshan
  2023-10-23  3:53                                                               ` Jason Wang
  2023-10-20 12:54                                                             ` Parav Pandit
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-20  9:41 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 20, 2023 at 05:31:01PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/19/2023 6:33 PM, Parav Pandit wrote:
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Thursday, October 19, 2023 2:48 PM
> > > 
> > > On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> > > > On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> > > > > > Oh, really? Quite interesting, do you want to move all config space
> > > > > > fields in VF to admin vq? Have a plan?
> > > > > Not in my plan for spec 1.4 time frame.
> > > > > I do not want to divert the discussion, would like to focus on device
> > > migration phases.
> > > > > Lets please discuss in some other dedicated thread.
> > > > Possibly, if there's a way to send admin commands to vf itself then
> > > > Lingshan will be happy?
> > > still need to prove why admin commands are better than registers.
> > Virtio spec development is not proof based approach. Please stop asking for it.
> > 
> > I tried my best to have technical answer in [1].
> > I explained that registers simply do not work for passthrough mode
> > (if this is what you are asking when you are asking prove its better).
> > They can work for non_passthrough mediated mode.
> > 
> > A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.
> > 
> > Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
> > A register holding a command content make sense.
> > 
> > Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
> > And trying to understand, why AQ cannot possible or inferior?
> > 
> > If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
> > If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
> > I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.
> > 
> > [1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> We have been through your statement for many times.
> This is not about how many times you repeated, if you think this is true,
> you need to prove that with solid evidence.
> 
> 
> For pass-through, I still recommend you to take a reference of current
> virito-pci implementation, it works for pass-through, right?

Current migration implementation in e.g. QEMU? It does but it
traps data path accesses. That, I think we can agree,
should not be the only option to migrate.

> For scale, I already told you for many times that they are per-device
> facilities. How can a per-device facility not scale?
> vDPA works fine on config space.
> 
> So, if you still insist admin vq is better than config space like in other
> thread you have concluded, you may imply that config space interfaces should
> be re-factored to admin vq.

There are good arguments that yes, virtio needs a transport for config space
that is DMA based as opposed to memory mapped based.  This is one of the
things all vendors seem to prefer in IDPF so virtio should have the option.


-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-20  9:41                                                             ` Michael S. Tsirkin
@ 2023-10-20 11:11                                                               ` Zhu, Lingshan
  2023-10-20 12:47                                                                 ` Parav Pandit
  2023-10-21 15:34                                                                 ` Michael S. Tsirkin
  2023-10-23  3:53                                                               ` Jason Wang
  1 sibling, 2 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-20 11:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/20/2023 5:41 PM, Michael S. Tsirkin wrote:
> On Fri, Oct 20, 2023 at 05:31:01PM +0800, Zhu, Lingshan wrote:
>>
>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>
>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>>>> Oh, really? Quite interesting, do you want to move all config space
>>>>>>> fields in VF to admin vq? Have a plan?
>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>> I do not want to divert the discussion, would like to focus on device
>>>> migration phases.
>>>>>> Lets please discuss in some other dedicated thread.
>>>>> Possibly, if there's a way to send admin commands to vf itself then
>>>>> Lingshan will be happy?
>>>> still need to prove why admin commands are better than registers.
>>> Virtio spec development is not proof based approach. Please stop asking for it.
>>>
>>> I tried my best to have technical answer in [1].
>>> I explained that registers simply do not work for passthrough mode
>>> (if this is what you are asking when you are asking prove its better).
>>> They can work for non_passthrough mediated mode.
>>>
>>> A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.
>>>
>>> Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
>>> A register holding a command content make sense.
>>>
>>> Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
>>> And trying to understand, why AQ cannot possible or inferior?
>>>
>>> If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
>>> If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
>>> I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.
>>>
>>> [1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>> We have been through your statement for many times.
>> This is not about how many times you repeated, if you think this is true,
>> you need to prove that with solid evidence.
>>
>>
>> For pass-through, I still recommend you to take a reference of current
>> virito-pci implementation, it works for pass-through, right?
> Current migration implementation in e.g. QEMU? It does but it
> traps data path accesses. That, I think we can agree,
> should not be the only option to migrate.
OK, I am glad we agree that config space work for pass-through,
hope we don't need to discuss this anymore.
>
>> For scale, I already told you for many times that they are per-device
>> facilities. How can a per-device facility not scale?
>> vDPA works fine on config space.
>>
>> So, if you still insist admin vq is better than config space like in other
>> thread you have concluded, you may imply that config space interfaces should
>> be re-factored to admin vq.
> There are good arguments that yes, virtio needs a transport for config space
> that is DMA based as opposed to memory mapped based.  This is one of the
> things all vendors seem to prefer in IDPF so virtio should have the option.
Do you really want to refactor virtio-pci common config fields to PF's 
admin vq?
E.g, do you really want to move queue_enable in virtio-pci common config 
fields to PF's admin vq?

Config space is control path, DMA is data-path, let's better not mix 
them, we never expect to use config space to transfer data.

So we need DMA to transfer data, for example I take advantages of device 
DMA to logging dirty pages, This also applies to in-flight descriptors.

And we are implementing virito live migration, not only for PCI.

So both me and Jason keep repeating: We are implementing basic 
facilities, and the implementation is transport specific.

We have proposed to build admin vq based on our register solution, this 
can somehow even help tp resolve the nested issue.

But I see the proposed has been rejected.

I still believe the goal is to build a best spec, not "just can work" 
with limitations.



>
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-20 11:11                                                               ` Zhu, Lingshan
@ 2023-10-20 12:47                                                                 ` Parav Pandit
  2023-10-23  9:48                                                                   ` Zhu, Lingshan
  2023-10-21 15:34                                                                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-20 12:47 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, October 20, 2023 4:42 PM
> 
> On 10/20/2023 5:41 PM, Michael S. Tsirkin wrote:
> > On Fri, Oct 20, 2023 at 05:31:01PM +0800, Zhu, Lingshan wrote:
> >>
> >> On 10/19/2023 6:33 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Thursday, October 19, 2023 2:48 PM
> >>>>
> >>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> >>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> >>>>>>> Oh, really? Quite interesting, do you want to move all config
> >>>>>>> space fields in VF to admin vq? Have a plan?
> >>>>>> Not in my plan for spec 1.4 time frame.
> >>>>>> I do not want to divert the discussion, would like to focus on
> >>>>>> device
> >>>> migration phases.
> >>>>>> Lets please discuss in some other dedicated thread.
> >>>>> Possibly, if there's a way to send admin commands to vf itself
> >>>>> then Lingshan will be happy?
> >>>> still need to prove why admin commands are better than registers.
> >>> Virtio spec development is not proof based approach. Please stop asking for
> it.
> >>>
> >>> I tried my best to have technical answer in [1].
> >>> I explained that registers simply do not work for passthrough mode
> >>> (if this is what you are asking when you are asking prove its better).
> >>> They can work for non_passthrough mediated mode.
> >>>
> >>> A member device may do admin commands using registers. Michael and I
> are discussing presently in the same thread.
> >>>
> >>> Since there are multiple things to be done for device migration, dedicated
> register set for each functionality do not scale well, hard to maintain and
> extend.
> >>> A register holding a command content make sense.
> >>>
> >>> Now, with that, if this can be useful only for non_passthrough, I made
> humble request to transport them using AQ, this way, you get all benefits of AQ.
> >>> And trying to understand, why AQ cannot possible or inferior?
> >>>
> >>> If you have commands like suspend/resume device, register or queue
> transport simply don’t work, because it's wrong to bifurcate the device with
> such weird API.
> >>> If you want to biferacate for mediation software, it probably makes sense to
> operate at each VQ level, config space level. Such are very different commands
> than passthrough.
> >>> I think vdpa has demonstrated that very well on how to do specific work for
> specific device type. So some of those work can be done using AQ.
> >>>
> >>> [1]
> >>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103
> >>> 362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> >> We have been through your statement for many times.
> >> This is not about how many times you repeated, if you think this is
> >> true, you need to prove that with solid evidence.
> >>
> >>
> >> For pass-through, I still recommend you to take a reference of
> >> current virito-pci implementation, it works for pass-through, right?
> > Current migration implementation in e.g. QEMU? It does but it traps
> > data path accesses. That, I think we can agree, should not be the only
> > option to migrate.
> OK, I am glad we agree that config space work for pass-through, hope we don't
> need to discuss this anymore.
> >
> >> For scale, I already told you for many times that they are per-device
> >> facilities. How can a per-device facility not scale?
> >> vDPA works fine on config space.
> >>
> >> So, if you still insist admin vq is better than config space like in
> >> other thread you have concluded, you may imply that config space
> >> interfaces should be re-factored to admin vq.
> > There are good arguments that yes, virtio needs a transport for config
> > space that is DMA based as opposed to memory mapped based.  This is
> > one of the things all vendors seem to prefer in IDPF so virtio should have the
> option.
> Do you really want to refactor virtio-pci common config fields to PF's admin vq?
> E.g, do you really want to move queue_enable in virtio-pci common config
> fields to PF's admin vq?
> 
No. Please read the response carefully.
I said 'For non-backward compatible SIOV device of the future, yes, virtio-pci common config (non init registers) should be moved to a vq, located on the member device directly.'
Notice the 'member device directly'.
Not the PF admin vq.

> Config space is control path, DMA is data-path, let's better not mix them, we
> never expect to use config space to transfer data.
> 
And that control path is only for the init time configuration as correctly listed in the virtio spec as,

" Device configuration space should only be used for initialization-time parameters.".

> So we need DMA to transfer data, for example I take advantages of device DMA
> to logging dirty pages, This also applies to in-flight descriptors.
> 
Can you please explain via virtqueue cannot be used for DMA bulk data transfer as listed in virtio spec.

" The mechanism for bulk data transport on virtio devices is pretentiously called a virtqueue"

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-20  9:31                                                           ` Zhu, Lingshan
  2023-10-20  9:41                                                             ` Michael S. Tsirkin
@ 2023-10-20 12:54                                                             ` Parav Pandit
  2023-10-23 10:09                                                               ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-20 12:54 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, October 20, 2023 3:01 PM
> 
> On 10/19/2023 6:33 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Thursday, October 19, 2023 2:48 PM
> >>
> >> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> >>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> >>>>> Oh, really? Quite interesting, do you want to move all config
> >>>>> space fields in VF to admin vq? Have a plan?
> >>>> Not in my plan for spec 1.4 time frame.
> >>>> I do not want to divert the discussion, would like to focus on
> >>>> device
> >> migration phases.
> >>>> Lets please discuss in some other dedicated thread.
> >>> Possibly, if there's a way to send admin commands to vf itself then
> >>> Lingshan will be happy?
> >> still need to prove why admin commands are better than registers.
> > Virtio spec development is not proof based approach. Please stop asking for it.
> >
> > I tried my best to have technical answer in [1].
> > I explained that registers simply do not work for passthrough mode (if
> > this is what you are asking when you are asking prove its better).
> > They can work for non_passthrough mediated mode.
> >
> > A member device may do admin commands using registers. Michael and I are
> discussing presently in the same thread.
> >
> > Since there are multiple things to be done for device migration, dedicated
> register set for each functionality do not scale well, hard to maintain and
> extend.
> > A register holding a command content make sense.
> >
> > Now, with that, if this can be useful only for non_passthrough, I made humble
> request to transport them using AQ, this way, you get all benefits of AQ.
> > And trying to understand, why AQ cannot possible or inferior?
> >
> > If you have commands like suspend/resume device, register or queue
> transport simply don’t work, because it's wrong to bifurcate the device with
> such weird API.
> > If you want to biferacate for mediation software, it probably makes sense to
> operate at each VQ level, config space level. Such are very different commands
> than passthrough.
> > I think vdpa has demonstrated that very well on how to do specific work for
> specific device type. So some of those work can be done using AQ.
> >
> > [1]
> > https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd10336
> > 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> We have been through your statement for many times.
> This is not about how many times you repeated, if you think this is true, you
> need to prove that with solid evidence.
> 
I will not respond to this comment anymore.

> 
> For pass-through, I still recommend you to take a reference of current
> virito-pci implementation, it works for pass-through, right?

What do you mean by current virtio-pci implementation?

> For scale, I already told you for many times that they are per-device
> facilities. How can a per-device facility not scale?
Each VF device must implement new set of on-chip memory-based registers which demands more power, die area which does not scale efficiently to thousands of VFs.

> vDPA works fine on config space.
> 
> So, if you still insist admin vq is better than config space like in
> other thread you have concluded, you may imply that config space
> interfaces should be re-factored to admin vq.
Whatever is done in past is done, there is no way to change history.
An new non init time registers should not be placed in device specific config space as virtio spec has clear guideline on it for good.
Device context reading, dirty page address reading, changing vf device modes, all of these are clearly not a init time settings.
Hence, they do not belong to the registers.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-20 11:11                                                               ` Zhu, Lingshan
  2023-10-20 12:47                                                                 ` Parav Pandit
@ 2023-10-21 15:34                                                                 ` Michael S. Tsirkin
  2023-10-23 10:03                                                                   ` Zhu, Lingshan
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-21 15:34 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 20, 2023 at 07:11:49PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/20/2023 5:41 PM, Michael S. Tsirkin wrote:
> > On Fri, Oct 20, 2023 at 05:31:01PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 10/19/2023 6:33 PM, Parav Pandit wrote:
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Thursday, October 19, 2023 2:48 PM
> > > > > 
> > > > > On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> > > > > > > > Oh, really? Quite interesting, do you want to move all config space
> > > > > > > > fields in VF to admin vq? Have a plan?
> > > > > > > Not in my plan for spec 1.4 time frame.
> > > > > > > I do not want to divert the discussion, would like to focus on device
> > > > > migration phases.
> > > > > > > Lets please discuss in some other dedicated thread.
> > > > > > Possibly, if there's a way to send admin commands to vf itself then
> > > > > > Lingshan will be happy?
> > > > > still need to prove why admin commands are better than registers.
> > > > Virtio spec development is not proof based approach. Please stop asking for it.
> > > > 
> > > > I tried my best to have technical answer in [1].
> > > > I explained that registers simply do not work for passthrough mode
> > > > (if this is what you are asking when you are asking prove its better).
> > > > They can work for non_passthrough mediated mode.
> > > > 
> > > > A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.
> > > > 
> > > > Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
> > > > A register holding a command content make sense.
> > > > 
> > > > Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
> > > > And trying to understand, why AQ cannot possible or inferior?
> > > > 
> > > > If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
> > > > If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
> > > > I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.
> > > > 
> > > > [1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> > > We have been through your statement for many times.
> > > This is not about how many times you repeated, if you think this is true,
> > > you need to prove that with solid evidence.
> > > 
> > > 
> > > For pass-through, I still recommend you to take a reference of current
> > > virito-pci implementation, it works for pass-through, right?
> > Current migration implementation in e.g. QEMU? It does but it
> > traps data path accesses. That, I think we can agree,
> > should not be the only option to migrate.
> OK, I am glad we agree that config space work for pass-through,
> hope we don't need to discuss this anymore.
> > 
> > > For scale, I already told you for many times that they are per-device
> > > facilities. How can a per-device facility not scale?
> > > vDPA works fine on config space.
> > > 
> > > So, if you still insist admin vq is better than config space like in other
> > > thread you have concluded, you may imply that config space interfaces should
> > > be re-factored to admin vq.
> > There are good arguments that yes, virtio needs a transport for config space
> > that is DMA based as opposed to memory mapped based.  This is one of the
> > things all vendors seem to prefer in IDPF so virtio should have the option.
> Do you really want to refactor virtio-pci common config fields to PF's admin
> vq?
> E.g, do you really want to move queue_enable in virtio-pci common config
> fields to PF's admin vq?

No, I think we need a transport with as small # of memory mapped registers as
possible that passes admin commands through the VF itself.

> Config space is control path, DMA is data-path, let's better not mix them,
> we never expect to use config space to transfer data.
> 
> So we need DMA to transfer data, for example I take advantages of device DMA
> to logging dirty pages, This also applies to in-flight descriptors.

As long as you do, I personally see little benefit to retrieve parts of
state with memory mapped accesses.

> And we are implementing virito live migration, not only for PCI.
> 
> So both me and Jason keep repeating: We are implementing basic facilities,
> and the implementation is transport specific.

But the register based facilities you proposed are extremely limited and
seem to only work for migration. For example, it seems mostly useless for
debugging because retrieving state is rather complex and would
interfere with normal working of the device.


> We have proposed to build admin vq based on our register solution, this can
> somehow even help tp resolve the nested issue.
> 
> But I see the proposed has been rejected.
> 
> I still believe the goal is to build a best spec, not "just can work" with
> limitations.
> 
> 
> 
> > 
> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-18 10:22                                               ` Parav Pandit
  2023-10-18 10:47                                                 ` Michael S. Tsirkin
@ 2023-10-23  3:44                                                 ` Jason Wang
  2023-10-23  4:42                                                   ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-23  3:44 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Oct 18, 2023 at 6:23 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, October 18, 2023 3:26 PM
>
> > For completeness, and to shorten the thread, can you please list known
> > issues/use cases that are addressed by the status bit interface and how you plan
> > for them to be addressed?
>
> I will avoid listing known issues for a moment for status bit in this email.
>
> Status bit interface helps in following good ways.
> 1. suspend/resume the device fully by the guest by negotiating the new feature.
> This can be useful in the guest-controlled PM flows of suspend/resume.
> I still think for this, only feature bit is necessary, and device_status modification is not needed.

Which feature bit did you mean here?

> D0->D3 and D3->D0 transition of the pci can suspend and resume the device which can preserve the last device_status value before entering D3.

It's not only about the device status. I would not repeat the question
I've asked in another thread.

What's more, if you really want to suspend/freeze at PCI level and
deal with PCI specific issues like P2P.  You should really try to
leverage or invent a PCI mechanism instead of trying to carry such
semantics via a virtio specific stuff like adminq. Solving transport
specific problems at the virtio level is a layer violation.

> (Like preserving all rest of the fields of common and other device config).
> This is orthogonal and needed regardless of device migration.
>
> 2. If one does not want to passthrough a member device, but build a mediation-based device on top of existing virtio device,
> It can be useful with mediating software.
> Here the mediating software has ample duplicated knowledge of what the member device already has.

It is the way the hypervisors are doing for not only virtio but also
for CPU and MMU as well.

> This can fulfil the nested requirement differently provided a platform support it.
> (PASID limitation will be practical blocker here).

I don't think PASID is a blocker. It is only a blocker if you want to
do passthrough.

>
> How to I plan to address above two?
> a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI PM drives the state.

We can't duplicate every transport specific feature in virtio. This is
a layer violation again. We should reuse the PCI facility here.

> This will work orthogonal to VMM side migration and will co-exist with VMM based device migration.
>
> b. nested use case:
> L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> L1 guest to enable SR-IOV and mapping the VF to L2 guest.

Let me ask it again here, how can you migrate L2 using L1 "emulated"
PF? Emulation?

Thanks




> Consulting industry ecosystem to support nested outside of virtio.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-19  5:31                                         ` Parav Pandit
  2023-10-19  6:35                                           ` Michael S. Tsirkin
@ 2023-10-23  3:45                                           ` Jason Wang
  2023-10-23  4:42                                             ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-23  3:45 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Oct 19, 2023 at 1:32 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Jason Wang
> > Sent: Thursday, October 19, 2023 10:15 AM
> >
> > > > Again, if you don't want to talk about transport virtqueue, that's
> > > > fine. But let's leave the scalability issue aside as well.
> > > >
> > > Registers are related for functionality and scale.
> > >
> > > Lets first agree on use case before the design, that I asked above.
> > >
> > > I will wait to respond to any other emails until we agree on use case
> > requirements.
> >
> > There are more than just me who want you to define "passthrough" first where
> > you refuse to respond.
> >
> Totally disagree.
> In the previous email itself, I wrote what passthrough is.
> So let's try yet one more time.
> Either you can re-read last email or for better read below and see if it is understood or not.
>
> > How could we make any agreement without an accurate the definition of
> > "passthrough" who is a key to understand each other?
>
> I replied few times in past emails but since those email threads are so long, it is easy to miss out.
>
> Passthrough definition:
> a. virtio member device mapped to the guest vm

I really think we need to be accurate here. For example, what does
"map" mean here?

> b. only pci config space and msix of a member device is intercepted by hypervisor.

What's the criteria for choosing a cap/bar to be trapped or not? For
example, there're a lot of other things that need to be virtualized
besides MSI-X for sure.

> c. virtio config space, virtio cvqs, data vqs of a member device is directly accessed by the guest vm without intercepted by the hypervisor.
>
> (Why b?, no grand reason, it is how the hypervisors are working where to integrate the virtio member device to).

What you state here is more about the method to support a use case not
the use case itself. My understanding is "live migration" is a valid
use case for sure. And "passthough" is probably one way for achieving
live migration but this is what you need to define and justify in this
series.

What's more, I'm not convinced of a large series that is only designed
for a specific method of building a hypervisor. I'm more convinced
that if it can serve multiple different methods of building
hypervisors, I believe it would be more beneficial for virtio.

Thanks

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-20  9:41                                                             ` Michael S. Tsirkin
  2023-10-20 11:11                                                               ` Zhu, Lingshan
@ 2023-10-23  3:53                                                               ` Jason Wang
  2023-10-23 11:33                                                                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-23  3:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Oct 20, 2023 at 5:41 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Oct 20, 2023 at 05:31:01PM +0800, Zhu, Lingshan wrote:
> >
> >
> > On 10/19/2023 6:33 PM, Parav Pandit wrote:
> > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > Sent: Thursday, October 19, 2023 2:48 PM
> > > >
> > > > On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> > > > > > > Oh, really? Quite interesting, do you want to move all config space
> > > > > > > fields in VF to admin vq? Have a plan?
> > > > > > Not in my plan for spec 1.4 time frame.
> > > > > > I do not want to divert the discussion, would like to focus on device
> > > > migration phases.
> > > > > > Lets please discuss in some other dedicated thread.
> > > > > Possibly, if there's a way to send admin commands to vf itself then
> > > > > Lingshan will be happy?
> > > > still need to prove why admin commands are better than registers.
> > > Virtio spec development is not proof based approach. Please stop asking for it.
> > >
> > > I tried my best to have technical answer in [1].
> > > I explained that registers simply do not work for passthrough mode
> > > (if this is what you are asking when you are asking prove its better).
> > > They can work for non_passthrough mediated mode.
> > >
> > > A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.
> > >
> > > Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
> > > A register holding a command content make sense.
> > >
> > > Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
> > > And trying to understand, why AQ cannot possible or inferior?
> > >
> > > If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
> > > If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
> > > I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.
> > >
> > > [1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> > We have been through your statement for many times.
> > This is not about how many times you repeated, if you think this is true,
> > you need to prove that with solid evidence.
> >
> >
> > For pass-through, I still recommend you to take a reference of current
> > virito-pci implementation, it works for pass-through, right?
>
> Current migration implementation in e.g. QEMU? It does but it
> traps data path accesses. That, I think we can agree,
> should not be the only option to migrate.
>
> > For scale, I already told you for many times that they are per-device
> > facilities. How can a per-device facility not scale?
> > vDPA works fine on config space.
> >
> > So, if you still insist admin vq is better than config space like in other
> > thread you have concluded, you may imply that config space interfaces should
> > be re-factored to admin vq.
>
> There are good arguments that yes, virtio needs a transport for config space
> that is DMA based as opposed to memory mapped based.  This is one of the
> things all vendors seem to prefer in IDPF so virtio should have the option.

Then it is the transport vq or transport over adminq proposal?

Thanks

>
>
> --
> MST
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-23  3:45                                           ` Jason Wang
@ 2023-10-23  4:42                                             ` Parav Pandit
  2023-10-24  4:46                                               ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-23  4:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 23, 2023 9:15 AM
> 
> On Thu, Oct 19, 2023 at 1:32 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > Sent: Thursday, October 19, 2023 10:15 AM
> > >
> > > > > Again, if you don't want to talk about transport virtqueue,
> > > > > that's fine. But let's leave the scalability issue aside as well.
> > > > >
> > > > Registers are related for functionality and scale.
> > > >
> > > > Lets first agree on use case before the design, that I asked above.
> > > >
> > > > I will wait to respond to any other emails until we agree on use
> > > > case
> > > requirements.
> > >
> > > There are more than just me who want you to define "passthrough"
> > > first where you refuse to respond.
> > >
> > Totally disagree.
> > In the previous email itself, I wrote what passthrough is.
> > So let's try yet one more time.
> > Either you can re-read last email or for better read below and see if it is
> understood or not.
> >
> > > How could we make any agreement without an accurate the definition
> > > of "passthrough" who is a key to understand each other?
> >
> > I replied few times in past emails but since those email threads are so long, it is
> easy to miss out.
> >
> > Passthrough definition:
> > a. virtio member device mapped to the guest vm
> 
> I really think we need to be accurate here. For example, what does "map" mean
> here?
>
Not trapped by hypervisor is better wording than mapped.
 
> > b. only pci config space and msix of a member device is intercepted by
> hypervisor.
> 
> What's the criteria for choosing a cap/bar to be trapped or not? For example,
> there're a lot of other things that need to be virtualized besides MSI-X for sure.
> 
For passthrough, which are those?

> > c. virtio config space, virtio cvqs, data vqs of a member device is directly
> accessed by the guest vm without intercepted by the hypervisor.
> >
> > (Why b?, no grand reason, it is how the hypervisors are working where to
> integrate the virtio member device to).
> 
> What you state here is more about the method to support a use case not the
> use case itself. My understanding is "live migration" is a valid use case for sure.
> And "passthough" is probably one way for achieving live migration but this is
> what you need to define and justify in this series.
> 
I largely captured that in v2 in device context and in theory of operation.
I will keep it short as written in v2. If something is unclear (without writing the book), I am happy to extend it.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23  3:44                                                 ` Jason Wang
@ 2023-10-23  4:42                                                   ` Parav Pandit
  2023-10-24  4:56                                                     ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-23  4:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 23, 2023 9:15 AM
> 
> On Wed, Oct 18, 2023 at 6:23 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Wednesday, October 18, 2023 3:26 PM
> >
> > > For completeness, and to shorten the thread, can you please list
> > > known issues/use cases that are addressed by the status bit
> > > interface and how you plan for them to be addressed?
> >
> > I will avoid listing known issues for a moment for status bit in this email.
> >
> > Status bit interface helps in following good ways.
> > 1. suspend/resume the device fully by the guest by negotiating the new
> feature.
> > This can be useful in the guest-controlled PM flows of suspend/resume.
> > I still think for this, only feature bit is necessary, and device_status
> modification is not needed.
> 
> Which feature bit did you mean here?
> 
A new feature bit to indicate the guest that device supports suspend and resume, hence, there is no need to reset the device and destroy resources like how it is done today.

> > D0->D3 and D3->D0 transition of the pci can suspend and resume the device
> which can preserve the last device_status value before entering D3.
> 
> It's not only about the device status. I would not repeat the question I've asked
> in another thread.
> 
> What's more, if you really want to suspend/freeze at PCI level and deal with PCI
> specific issues like P2P.  You should really try to leverage or invent a PCI
> mechanism instead of trying to carry such semantics via a virtio specific stuff
> like adminq. Solving transport specific problems at the virtio level is a layer
> violation.
>
PCI spec has already defined what it needs to. SR-PCIM interface is already concluded being outside of PCI-spec by the pci-sig.
And no, there is no layer violation.

Any non PCI member device can always implement necessary STOP mode as no-op.

And all of those talk make sense when one creates MMIO based member device, until that point is just objections...
 
> > (Like preserving all rest of the fields of common and other device config).
> > This is orthogonal and needed regardless of device migration.
> >
> > 2. If one does not want to passthrough a member device, but build a
> > mediation-based device on top of existing virtio device, It can be useful with
> mediating software.
> > Here the mediating software has ample duplicated knowledge of what the
> member device already has.
> 
> It is the way the hypervisors are doing for not only virtio but also for CPU and
> MMU as well.
> 
Not really, vcpus and VMCS and more are part of the hardware support.
2 level nested page tables is hw support.
Anything beyond 2 level nesting, likely involves hypervisor.

> > This can fulfil the nested requirement differently provided a platform support
> it.
> > (PASID limitation will be practical blocker here).
> 
> I don't think PASID is a blocker. It is only a blocker if you want to do passthrough.
> 
Even without passthrough, one needs to steer the hypervisor DMA to non guest memory.
And guest driver must not be able to attack (read/write) from that memory.
I don’t see how one can do this without PASID. As all DMAs are tagged using only RID.

> >
> > How to I plan to address above two?
> > a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI
> PM drives the state.
> 
> We can't duplicate every transport specific feature in virtio. This is a layer
> violation again. We should reuse the PCI facility here.
> 
It is reused by having the feature bit to indicate that device supports suspend/resume.
If from Day_1, if the PCI PM bits used, it would not require the feature bit.
But that was not the case.
So the guest driver do not know if using the PCI PM bit is enough to decide, if suspend/resume by guest will work or not.
Hence the feature bit.

> > This will work orthogonal to VMM side migration and will co-exist with VMM
> based device migration.
> >
> > b. nested use case:
> > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> 
> Let me ask it again here, how can you migrate L2 using L1 "emulated"
> PF? Emulation?
>
Emulation is one way as most nested platform components do.
May be L1 VF which is = VF + SR-IOV capability is = emulated PF. This PF can run exact same commands as L0 level PF.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-20 12:47                                                                 ` Parav Pandit
@ 2023-10-23  9:48                                                                   ` Zhu, Lingshan
  2023-10-23 10:01                                                                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-23  9:48 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/20/2023 8:47 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, October 20, 2023 4:42 PM
>>
>> On 10/20/2023 5:41 PM, Michael S. Tsirkin wrote:
>>> On Fri, Oct 20, 2023 at 05:31:01PM +0800, Zhu, Lingshan wrote:
>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>>>
>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>>>>>> Oh, really? Quite interesting, do you want to move all config
>>>>>>>>> space fields in VF to admin vq? Have a plan?
>>>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>>>> I do not want to divert the discussion, would like to focus on
>>>>>>>> device
>>>>>> migration phases.
>>>>>>>> Lets please discuss in some other dedicated thread.
>>>>>>> Possibly, if there's a way to send admin commands to vf itself
>>>>>>> then Lingshan will be happy?
>>>>>> still need to prove why admin commands are better than registers.
>>>>> Virtio spec development is not proof based approach. Please stop asking for
>> it.
>>>>> I tried my best to have technical answer in [1].
>>>>> I explained that registers simply do not work for passthrough mode
>>>>> (if this is what you are asking when you are asking prove its better).
>>>>> They can work for non_passthrough mediated mode.
>>>>>
>>>>> A member device may do admin commands using registers. Michael and I
>> are discussing presently in the same thread.
>>>>> Since there are multiple things to be done for device migration, dedicated
>> register set for each functionality do not scale well, hard to maintain and
>> extend.
>>>>> A register holding a command content make sense.
>>>>>
>>>>> Now, with that, if this can be useful only for non_passthrough, I made
>> humble request to transport them using AQ, this way, you get all benefits of AQ.
>>>>> And trying to understand, why AQ cannot possible or inferior?
>>>>>
>>>>> If you have commands like suspend/resume device, register or queue
>> transport simply don’t work, because it's wrong to bifurcate the device with
>> such weird API.
>>>>> If you want to biferacate for mediation software, it probably makes sense to
>> operate at each VQ level, config space level. Such are very different commands
>> than passthrough.
>>>>> I think vdpa has demonstrated that very well on how to do specific work for
>> specific device type. So some of those work can be done using AQ.
>>>>> [1]
>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103
>>>>> 362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>>>> We have been through your statement for many times.
>>>> This is not about how many times you repeated, if you think this is
>>>> true, you need to prove that with solid evidence.
>>>>
>>>>
>>>> For pass-through, I still recommend you to take a reference of
>>>> current virito-pci implementation, it works for pass-through, right?
>>> Current migration implementation in e.g. QEMU? It does but it traps
>>> data path accesses. That, I think we can agree, should not be the only
>>> option to migrate.
>> OK, I am glad we agree that config space work for pass-through, hope we don't
>> need to discuss this anymore.
>>>> For scale, I already told you for many times that they are per-device
>>>> facilities. How can a per-device facility not scale?
>>>> vDPA works fine on config space.
>>>>
>>>> So, if you still insist admin vq is better than config space like in
>>>> other thread you have concluded, you may imply that config space
>>>> interfaces should be re-factored to admin vq.
>>> There are good arguments that yes, virtio needs a transport for config
>>> space that is DMA based as opposed to memory mapped based.  This is
>>> one of the things all vendors seem to prefer in IDPF so virtio should have the
>> option.
>> Do you really want to refactor virtio-pci common config fields to PF's admin vq?
>> E.g, do you really want to move queue_enable in virtio-pci common config
>> fields to PF's admin vq?
>>
> No. Please read the response carefully.
> I said 'For non-backward compatible SIOV device of the future, yes, virtio-pci common config (non init registers) should be moved to a vq, located on the member device directly.'
> Notice the 'member device directly'.
> Not the PF admin vq.
I think this is a question to Michael and he answered.

We are talking about PCI, not SIOV, for SIOV we need transport vq.

Here again, we are introducing basic facilities for live migration, and 
the implementation is transport-specific.
>
>> Config space is control path, DMA is data-path, let's better not mix them, we
>> never expect to use config space to transfer data.
>>
> And that control path is only for the init time configuration as correctly listed in the virtio spec as,
>
> " Device configuration space should only be used for initialization-time parameters.".
don't you know new field reset_vq is introduced to virtio common cfg? 
This is not only for initialization, right?

and your citation is from Appendix B. Creating New Device Types, are we 
creating a new device type?
>
>> So we need DMA to transfer data, for example I take advantages of device DMA
>> to logging dirty pages, This also applies to in-flight descriptors.
>>
> Can you please explain via virtqueue cannot be used for DMA bulk data transfer as listed in virtio spec.
>
> " The mechanism for bulk data transport on virtio devices is pretentiously called a virtqueue"
what is your point? vq can do DMA, so what?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23  9:48                                                                   ` Zhu, Lingshan
@ 2023-10-23 10:01                                                                     ` Parav Pandit
  2023-10-23 10:14                                                                       ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-23 10:01 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, October 23, 2023 3:18 PM


> > No. Please read the response carefully.
> > I said 'For non-backward compatible SIOV device of the future, yes, virtio-pci
> common config (non init registers) should be moved to a vq, located on the
> member device directly.'
> > Notice the 'member device directly'.
> > Not the PF admin vq.
> I think this is a question to Michael and he answered.
> 
> We are talking about PCI, not SIOV, for SIOV we need transport vq.
>
Hypervisor for future device and future functionality must not get involved in looking the device configuration.
Hence, as long as transport vq is located on the SIOV device itself for non-backward items, it is fine to transport SIOV configuration.
For backward compatibility purpose, one will be able to use the aq of the owner device. No need to create a new transport VQ.
To create a another transport vq, need to clarify the limitations of aq that transport vq can overcome, and why aq cannot be extended to overcome it.
 
> Here again, we are introducing basic facilities for live migration, and the
> implementation is transport-specific.
Not relevant comment.

> >
> >> Config space is control path, DMA is data-path, let's better not mix
> >> them, we never expect to use config space to transfer data.
> >>
> > And that control path is only for the init time configuration as
> > correctly listed in the virtio spec as,
> >
> > " Device configuration space should only be used for initialization-time
> parameters.".
> don't you know new field reset_vq is introduced to virtio common cfg?
> This is not only for initialization, right?
Right. It was unfortunate and also it was last moment entry that we had fixed in reset register polarity.

> 
> and your citation is from Appendix B. Creating New Device Types, are we
> creating a new device type?
That is guidance for the new device creation on "how to use config space?"
It equally applied to existing devices too to not grow.
The section is equally helpful for new creators and for extending devices like you and me to understand what not to put in config space.

> >
> >> So we need DMA to transfer data, for example I take advantages of
> >> device DMA to logging dirty pages, This also applies to in-flight descriptors.
> >>
> > Can you please explain via virtqueue cannot be used for DMA bulk data
> transfer as listed in virtio spec.
> >
> > " The mechanism for bulk data transport on virtio devices is pretentiously
> called a virtqueue"
> what is your point? vq can do DMA, so what?
I am asking,
If there is AQ on the member device, can you use it? If not, what is the technical reason(s) to not use it.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-21 15:34                                                                 ` Michael S. Tsirkin
@ 2023-10-23 10:03                                                                   ` Zhu, Lingshan
  2023-10-23 11:32                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-23 10:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/21/2023 11:34 PM, Michael S. Tsirkin wrote:
> On Fri, Oct 20, 2023 at 07:11:49PM +0800, Zhu, Lingshan wrote:
>>
>> On 10/20/2023 5:41 PM, Michael S. Tsirkin wrote:
>>> On Fri, Oct 20, 2023 at 05:31:01PM +0800, Zhu, Lingshan wrote:
>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>>>
>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>>>>>> Oh, really? Quite interesting, do you want to move all config space
>>>>>>>>> fields in VF to admin vq? Have a plan?
>>>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>>>> I do not want to divert the discussion, would like to focus on device
>>>>>> migration phases.
>>>>>>>> Lets please discuss in some other dedicated thread.
>>>>>>> Possibly, if there's a way to send admin commands to vf itself then
>>>>>>> Lingshan will be happy?
>>>>>> still need to prove why admin commands are better than registers.
>>>>> Virtio spec development is not proof based approach. Please stop asking for it.
>>>>>
>>>>> I tried my best to have technical answer in [1].
>>>>> I explained that registers simply do not work for passthrough mode
>>>>> (if this is what you are asking when you are asking prove its better).
>>>>> They can work for non_passthrough mediated mode.
>>>>>
>>>>> A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.
>>>>>
>>>>> Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
>>>>> A register holding a command content make sense.
>>>>>
>>>>> Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
>>>>> And trying to understand, why AQ cannot possible or inferior?
>>>>>
>>>>> If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
>>>>> If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
>>>>> I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.
>>>>>
>>>>> [1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>>>> We have been through your statement for many times.
>>>> This is not about how many times you repeated, if you think this is true,
>>>> you need to prove that with solid evidence.
>>>>
>>>>
>>>> For pass-through, I still recommend you to take a reference of current
>>>> virito-pci implementation, it works for pass-through, right?
>>> Current migration implementation in e.g. QEMU? It does but it
>>> traps data path accesses. That, I think we can agree,
>>> should not be the only option to migrate.
>> OK, I am glad we agree that config space work for pass-through,
>> hope we don't need to discuss this anymore.
>>>> For scale, I already told you for many times that they are per-device
>>>> facilities. How can a per-device facility not scale?
>>>> vDPA works fine on config space.
>>>>
>>>> So, if you still insist admin vq is better than config space like in other
>>>> thread you have concluded, you may imply that config space interfaces should
>>>> be re-factored to admin vq.
>>> There are good arguments that yes, virtio needs a transport for config space
>>> that is DMA based as opposed to memory mapped based.  This is one of the
>>> things all vendors seem to prefer in IDPF so virtio should have the option.
>> Do you really want to refactor virtio-pci common config fields to PF's admin
>> vq?
>> E.g, do you really want to move queue_enable in virtio-pci common config
>> fields to PF's admin vq?
> No, I think we need a transport with as small # of memory mapped registers as
> possible that passes admin commands through the VF itself.
Then do you believe admin commands are better than registers in control 
path? This is an identical question to the above one,
do you want to replace current virtio common cfg with admin commands? Do 
you want to use admin commands to process queue_enable
other than queeu_enable register?

config space, MMIO, registers work for years, what is wrong with them?
>
>> Config space is control path, DMA is data-path, let's better not mix them,
>> we never expect to use config space to transfer data.
>>
>> So we need DMA to transfer data, for example I take advantages of device DMA
>> to logging dirty pages, This also applies to in-flight descriptors.
> As long as you do, I personally see little benefit to retrieve parts of
> state with memory mapped accesses.
registers only control, and I personally believe a single register is 
much better
than processing admin commands, more light-weight, more reliable, 
working for years.

Config space interfaces are fundamental for virtio-pci.
>
>> And we are implementing virito live migration, not only for PCI.
>>
>> So both me and Jason keep repeating: We are implementing basic facilities,
>> and the implementation is transport specific.
> But the register based facilities you proposed are extremely limited and
> seem to only work for migration. For example, it seems mostly useless for
> debugging because retrieving state is rather complex and would
> interfere with normal working of the device.
If you want to prove the register controlling interfaces are extremely 
limited than admin vq or admin cmds,
you are also proving config space registers are extremely limited than
admin vq.

So the question still here: do you want to replace current virtio-pci 
common cfg
with admin vq or admin cmds?

And debug what? If you want to introduce more functionalities, we should 
discuss
case by case.

If debugging vq state, it is as easy as read queue_size, I don't see the 
limitations
as queue_size work for years.

I still believe our goal is to do our best, with our capabilities, to 
build the most optimal virtio spec
as we can do. Not other goals.

Thanks
Zhu Lingshan
>
>
>> We have proposed to build admin vq based on our register solution, this can
>> somehow even help tp resolve the nested issue.
>>
>> But I see the proposed has been rejected.
>>
>> I still believe the goal is to build a best spec, not "just can work" with
>> limitations.
>>
>>
>>
>>>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-20 12:54                                                             ` Parav Pandit
@ 2023-10-23 10:09                                                               ` Zhu, Lingshan
  2023-10-23 10:14                                                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-23 10:09 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/20/2023 8:54 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, October 20, 2023 3:01 PM
>>
>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>
>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>>>> Oh, really? Quite interesting, do you want to move all config
>>>>>>> space fields in VF to admin vq? Have a plan?
>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>> I do not want to divert the discussion, would like to focus on
>>>>>> device
>>>> migration phases.
>>>>>> Lets please discuss in some other dedicated thread.
>>>>> Possibly, if there's a way to send admin commands to vf itself then
>>>>> Lingshan will be happy?
>>>> still need to prove why admin commands are better than registers.
>>> Virtio spec development is not proof based approach. Please stop asking for it.
>>>
>>> I tried my best to have technical answer in [1].
>>> I explained that registers simply do not work for passthrough mode (if
>>> this is what you are asking when you are asking prove its better).
>>> They can work for non_passthrough mediated mode.
>>>
>>> A member device may do admin commands using registers. Michael and I are
>> discussing presently in the same thread.
>>> Since there are multiple things to be done for device migration, dedicated
>> register set for each functionality do not scale well, hard to maintain and
>> extend.
>>> A register holding a command content make sense.
>>>
>>> Now, with that, if this can be useful only for non_passthrough, I made humble
>> request to transport them using AQ, this way, you get all benefits of AQ.
>>> And trying to understand, why AQ cannot possible or inferior?
>>>
>>> If you have commands like suspend/resume device, register or queue
>> transport simply don’t work, because it's wrong to bifurcate the device with
>> such weird API.
>>> If you want to biferacate for mediation software, it probably makes sense to
>> operate at each VQ level, config space level. Such are very different commands
>> than passthrough.
>>> I think vdpa has demonstrated that very well on how to do specific work for
>> specific device type. So some of those work can be done using AQ.
>>> [1]
>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd10336
>>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>> We have been through your statement for many times.
>> This is not about how many times you repeated, if you think this is true, you
>> need to prove that with solid evidence.
>>
> I will not respond to this comment anymore.
Ok if you choose not to respond.
>
>> For pass-through, I still recommend you to take a reference of current
>> virito-pci implementation, it works for pass-through, right?
> What do you mean by current virtio-pci implementation?
current virito-pci works for pass-through
>
>> For scale, I already told you for many times that they are per-device
>> facilities. How can a per-device facility not scale?
> Each VF device must implement new set of on-chip memory-based registers which demands more power, die area which does not scale efficiently to thousands of VFs.
that can be fpga gates or SOC implementing new features, you think that 
is a waste?
>
>> vDPA works fine on config space.
>>
>> So, if you still insist admin vq is better than config space like in
>> other thread you have concluded, you may imply that config space
>> interfaces should be re-factored to admin vq.
> Whatever is done in past is done, there is no way to change history.
> An new non init time registers should not be placed in device specific config space as virtio spec has clear guideline on it for good.
> Device context reading, dirty page address reading, changing vf device modes, all of these are clearly not a init time settings.
> Hence, they do not belong to the registers.
reset vq? and you get it from Appendix B. Creating New Device Types, are 
we implementing a new type of device???


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23 10:01                                                                     ` Parav Pandit
@ 2023-10-23 10:14                                                                       ` Zhu, Lingshan
  2023-10-23 10:26                                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-23 10:14 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/23/2023 6:01 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, October 23, 2023 3:18 PM
>
>>> No. Please read the response carefully.
>>> I said 'For non-backward compatible SIOV device of the future, yes, virtio-pci
>> common config (non init registers) should be moved to a vq, located on the
>> member device directly.'
>>> Notice the 'member device directly'.
>>> Not the PF admin vq.
>> I think this is a question to Michael and he answered.
>>
>> We are talking about PCI, not SIOV, for SIOV we need transport vq.
>>
> Hypervisor for future device and future functionality must not get involved in looking the device configuration.
> Hence, as long as transport vq is located on the SIOV device itself for non-backward items, it is fine to transport SIOV configuration.
> For backward compatibility purpose, one will be able to use the aq of the owner device. No need to create a new transport VQ.
> To create a another transport vq, need to clarify the limitations of aq that transport vq can overcome, and why aq cannot be extended to overcome it.
SIOV and transport vq is not related to this topic, don't mix them.

and admin vq is not a must for live migration.

and we are not introducing a new device type here.

For future device and future functionalities, let's discuss when they 
are implementing, on their series.
>   
>> Here again, we are introducing basic facilities for live migration, and the
>> implementation is transport-specific.
> Not relevant comment.
>
>>>> Config space is control path, DMA is data-path, let's better not mix
>>>> them, we never expect to use config space to transfer data.
>>>>
>>> And that control path is only for the init time configuration as
>>> correctly listed in the virtio spec as,
>>>
>>> " Device configuration space should only be used for initialization-time
>> parameters.".
>> don't you know new field reset_vq is introduced to virtio common cfg?
>> This is not only for initialization, right?
> Right. It was unfortunate and also it was last moment entry that we had fixed in reset register polarity.
Appendix B. Creating New Device Types, and we are not introducing new 
device type.
>
>> and your citation is from Appendix B. Creating New Device Types, are we
>> creating a new device type?
> That is guidance for the new device creation on "how to use config space?"
> It equally applied to existing devices too to not grow.
> The section is equally helpful for new creators and for extending devices like you and me to understand what not to put in config space.
this does not make any sense, if you stick to the wording, then let me 
repeat again "Appendix B. Creating New Device Types"!!!!!!
>
>>>> So we need DMA to transfer data, for example I take advantages of
>>>> device DMA to logging dirty pages, This also applies to in-flight descriptors.
>>>>
>>> Can you please explain via virtqueue cannot be used for DMA bulk data
>> transfer as listed in virtio spec.
>>> " The mechanism for bulk data transport on virtio devices is pretentiously
>> called a virtqueue"
>> what is your point? vq can do DMA, so what?
> I am asking,
> If there is AQ on the member device, can you use it? If not, what is the technical reason(s) to not use it.
Repeated for many times, QOS, nested and so on.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23 10:09                                                               ` Zhu, Lingshan
@ 2023-10-23 10:14                                                                 ` Parav Pandit
  2023-10-24 10:30                                                                   ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-23 10:14 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, October 23, 2023 3:39 PM
> 
> On 10/20/2023 8:54 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, October 20, 2023 3:01 PM
> >>
> >> On 10/19/2023 6:33 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Thursday, October 19, 2023 2:48 PM
> >>>>
> >>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> >>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> >>>>>>> Oh, really? Quite interesting, do you want to move all config
> >>>>>>> space fields in VF to admin vq? Have a plan?
> >>>>>> Not in my plan for spec 1.4 time frame.
> >>>>>> I do not want to divert the discussion, would like to focus on
> >>>>>> device
> >>>> migration phases.
> >>>>>> Lets please discuss in some other dedicated thread.
> >>>>> Possibly, if there's a way to send admin commands to vf itself
> >>>>> then Lingshan will be happy?
> >>>> still need to prove why admin commands are better than registers.
> >>> Virtio spec development is not proof based approach. Please stop asking for
> it.
> >>>
> >>> I tried my best to have technical answer in [1].
> >>> I explained that registers simply do not work for passthrough mode
> >>> (if this is what you are asking when you are asking prove its better).
> >>> They can work for non_passthrough mediated mode.
> >>>
> >>> A member device may do admin commands using registers. Michael and I
> >>> are
> >> discussing presently in the same thread.
> >>> Since there are multiple things to be done for device migration,
> >>> dedicated
> >> register set for each functionality do not scale well, hard to
> >> maintain and extend.
> >>> A register holding a command content make sense.
> >>>
> >>> Now, with that, if this can be useful only for non_passthrough, I
> >>> made humble
> >> request to transport them using AQ, this way, you get all benefits of AQ.
> >>> And trying to understand, why AQ cannot possible or inferior?
> >>>
> >>> If you have commands like suspend/resume device, register or queue
> >> transport simply don’t work, because it's wrong to bifurcate the
> >> device with such weird API.
> >>> If you want to biferacate for mediation software, it probably makes
> >>> sense to
> >> operate at each VQ level, config space level. Such are very different
> >> commands than passthrough.
> >>> I think vdpa has demonstrated that very well on how to do specific
> >>> work for
> >> specific device type. So some of those work can be done using AQ.
> >>> [1]
> >>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103
> >>> 36
> >>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> >> We have been through your statement for many times.
> >> This is not about how many times you repeated, if you think this is
> >> true, you need to prove that with solid evidence.
> >>
> > I will not respond to this comment anymore.
> Ok if you choose not to respond.
> >
> >> For pass-through, I still recommend you to take a reference of
> >> current virito-pci implementation, it works for pass-through, right?
> > What do you mean by current virtio-pci implementation?
> current virito-pci works for pass-through
I still don’t understand what is "current virtio-pci".
Do you mean qemu implementation of emulated virtio-pci or you mean virtio-pci specification for passthrough?
What do you want me to refer to for passthrough? Please clarify.

> >
> >> For scale, I already told you for many times that they are per-device
> >> facilities. How can a per-device facility not scale?
> > Each VF device must implement new set of on-chip memory-based registers
> which demands more power, die area which does not scale efficiently to
> thousands of VFs.
> that can be fpga gates or SOC implementing new features, you think that is a
> waste?
It is waste in hw, if there is a better approach possible to not burn them as gates and save on resources for rarely used items.


> >
> >> vDPA works fine on config space.
> >>
> >> So, if you still insist admin vq is better than config space like in
> >> other thread you have concluded, you may imply that config space
> >> interfaces should be re-factored to admin vq.
> > Whatever is done in past is done, there is no way to change history.
> > An new non init time registers should not be placed in device specific config
> space as virtio spec has clear guideline on it for good.
> > Device context reading, dirty page address reading, changing vf device modes,
> all of these are clearly not a init time settings.
> > Hence, they do not belong to the registers.
> reset vq? and you get it from Appendix B. Creating New Device Types, are we
> implementing a new type of device???
I don’t understand your question.
I replied the history of reset_vq.
Take good examples to follow, reset_vq clearly is not the one.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23 10:14                                                                       ` Zhu, Lingshan
@ 2023-10-23 10:26                                                                         ` Parav Pandit
  2023-10-24 10:10                                                                           ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-23 10:26 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, October 23, 2023 3:44 PM
> 
> On 10/23/2023 6:01 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, October 23, 2023 3:18 PM
> >
> >>> No. Please read the response carefully.
> >>> I said 'For non-backward compatible SIOV device of the future, yes,
> >>> virtio-pci
> >> common config (non init registers) should be moved to a vq, located
> >> on the member device directly.'
> >>> Notice the 'member device directly'.
> >>> Not the PF admin vq.
> >> I think this is a question to Michael and he answered.
> >>
> >> We are talking about PCI, not SIOV, for SIOV we need transport vq.
> >>
> > Hypervisor for future device and future functionality must not get involved in
> looking the device configuration.
> > Hence, as long as transport vq is located on the SIOV device itself for non-
> backward items, it is fine to transport SIOV configuration.
> > For backward compatibility purpose, one will be able to use the aq of the
> owner device. No need to create a new transport VQ.
> > To create a another transport vq, need to clarify the limitations of aq that
> transport vq can overcome, and why aq cannot be extended to overcome it.
> SIOV and transport vq is not related to this topic, don't mix them.
> 
> and admin vq is not a must for live migration.

Ok. You raised the point of transport vq...

>
All repeat points, not leading anywhere for you nor me.
 
> and we are not introducing a new device type here.
> 
It does not matter.

> For future device and future functionalities, let's discuss when they are
> implementing, on their series.
The new device will inherit the "basic functionality non init time register"...
So please don’t propose to implement such.

> >
> >> Here again, we are introducing basic facilities for live migration,
> >> and the implementation is transport-specific.
> > Not relevant comment.
> >
> >>>> Config space is control path, DMA is data-path, let's better not
> >>>> mix them, we never expect to use config space to transfer data.
> >>>>
> >>> And that control path is only for the init time configuration as
> >>> correctly listed in the virtio spec as,
> >>>
> >>> " Device configuration space should only be used for
> >>> initialization-time
> >> parameters.".
> >> don't you know new field reset_vq is introduced to virtio common cfg?
> >> This is not only for initialization, right?
> > Right. It was unfortunate and also it was last moment entry that we had fixed
> in reset register polarity.
> Appendix B. Creating New Device Types, and we are not introducing new device
> type.
The concept still applies to existing device type.
It is illogical otherwise.

> >
> >> and your citation is from Appendix B. Creating New Device Types, are
> >> we creating a new device type?
> > That is guidance for the new device creation on "how to use config space?"
> > It equally applied to existing devices too to not grow.
> > The section is equally helpful for new creators and for extending devices like
> you and me to understand what not to put in config space.
> this does not make any sense, if you stick to the wording, then let me repeat
> again "Appendix B. Creating New Device Types"!!!!!!
Sorry, your implying is: new device type should be efficient and existing one can make it further bad. Does not make sense to me.

It is fully logical to have only init time things in the config registers as done today in the spec for existing and new devices.

I would be happy to extend B.5 Device improvements to capture it too.

> >
> >>>> So we need DMA to transfer data, for example I take advantages of
> >>>> device DMA to logging dirty pages, This also applies to in-flight descriptors.
> >>>>
> >>> Can you please explain via virtqueue cannot be used for DMA bulk
> >>> data
> >> transfer as listed in virtio spec.
> >>> " The mechanism for bulk data transport on virtio devices is
> >>> pretentiously
> >> called a virtqueue"
> >> what is your point? vq can do DMA, so what?
> > I am asking,
> > If there is AQ on the member device, can you use it? If not, what is the
> technical reason(s) to not use it.
> Repeated for many times, QOS, nested and so on.
Why would there be any QoS when the AQ is on the member device for non-passthrough use case?
Why nested won't work when the AQ is on the member device for non-passthrough use case?

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23 10:03                                                                   ` Zhu, Lingshan
@ 2023-10-23 11:32                                                                     ` Michael S. Tsirkin
  2023-10-24 10:27                                                                       ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-23 11:32 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Mon, Oct 23, 2023 at 06:03:10PM +0800, Zhu, Lingshan wrote:
> config space, MMIO, registers work for years, what is wrong with them?

Nothing as such. They don't seem to be appropriate for all use-case
where people want to utilize virtio. I think a new transport
will be needed to address these.


> > 
> > > Config space is control path, DMA is data-path, let's better not mix them,
> > > we never expect to use config space to transfer data.
> > > 
> > > So we need DMA to transfer data, for example I take advantages of device DMA
> > > to logging dirty pages, This also applies to in-flight descriptors.
> > As long as you do, I personally see little benefit to retrieve parts of
> > state with memory mapped accesses.
> registers only control, and I personally believe a single register is much
> better
> than processing admin commands, more light-weight, more reliable, working
> for years.

Yea. It would be, if we could do everything through that register.
But we can't really. Migration has too much data to pass around
for that to be reasonable.

> Config space interfaces are fundamental for virtio-pci.


They are in fact fundamental to virtio. Multiple transports to
use config space are also fundamental.


> > 
> > > And we are implementing virito live migration, not only for PCI.
> > > 
> > > So both me and Jason keep repeating: We are implementing basic facilities,
> > > and the implementation is transport specific.
> > But the register based facilities you proposed are extremely limited and
> > seem to only work for migration. For example, it seems mostly useless for
> > debugging because retrieving state is rather complex and would
> > interfere with normal working of the device.
> If you want to prove the register controlling interfaces are extremely
> limited than admin vq or admin cmds,
> you are also proving config space registers are extremely limited than
> admin vq.

Yes. Migration needs ability to pass large amounts of data around, and
is too complex a functionality to work reliably without ability to
report errors.

> So the question still here: do you want to replace current virtio-pci common
> cfg
> with admin vq or admin cmds?

I think we need to add a new transport that will use admin commands.
Which one to use would be up to a specific device.


> And debug what? If you want to introduce more functionalities, we should
> discuss
> case by case.
> 
> If debugging vq state, it is as easy as read queue_size, I don't see the
> limitations
> as queue_size work for years.

No one reads queue_size. In fact for years we didn't have any debugging
functionality and we are fine. If we are adding it, it really needs to
be accessible when driver and device are wedged.


> I still believe our goal is to do our best, with our capabilities, to build
> the most optimal virtio spec
> as we can do. Not other goals.
> 
> Thanks
> Zhu Lingshan
> > 
> > 
> > > We have proposed to build admin vq based on our register solution, this can
> > > somehow even help tp resolve the nested issue.
> > > 
> > > But I see the proposed has been rejected.
> > > 
> > > I still believe the goal is to build a best spec, not "just can work" with
> > > limitations.
> > > 
> > > 
> > > 
> > > > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23  3:53                                                               ` Jason Wang
@ 2023-10-23 11:33                                                                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-23 11:33 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Mon, Oct 23, 2023 at 11:53:50AM +0800, Jason Wang wrote:
> On Fri, Oct 20, 2023 at 5:41 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Oct 20, 2023 at 05:31:01PM +0800, Zhu, Lingshan wrote:
> > >
> > >
> > > On 10/19/2023 6:33 PM, Parav Pandit wrote:
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Thursday, October 19, 2023 2:48 PM
> > > > >
> > > > > On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> > > > > > > > Oh, really? Quite interesting, do you want to move all config space
> > > > > > > > fields in VF to admin vq? Have a plan?
> > > > > > > Not in my plan for spec 1.4 time frame.
> > > > > > > I do not want to divert the discussion, would like to focus on device
> > > > > migration phases.
> > > > > > > Lets please discuss in some other dedicated thread.
> > > > > > Possibly, if there's a way to send admin commands to vf itself then
> > > > > > Lingshan will be happy?
> > > > > still need to prove why admin commands are better than registers.
> > > > Virtio spec development is not proof based approach. Please stop asking for it.
> > > >
> > > > I tried my best to have technical answer in [1].
> > > > I explained that registers simply do not work for passthrough mode
> > > > (if this is what you are asking when you are asking prove its better).
> > > > They can work for non_passthrough mediated mode.
> > > >
> > > > A member device may do admin commands using registers. Michael and I are discussing presently in the same thread.
> > > >
> > > > Since there are multiple things to be done for device migration, dedicated register set for each functionality do not scale well, hard to maintain and extend.
> > > > A register holding a command content make sense.
> > > >
> > > > Now, with that, if this can be useful only for non_passthrough, I made humble request to transport them using AQ, this way, you get all benefits of AQ.
> > > > And trying to understand, why AQ cannot possible or inferior?
> > > >
> > > > If you have commands like suspend/resume device, register or queue transport simply don’t work, because it's wrong to bifurcate the device with such weird API.
> > > > If you want to biferacate for mediation software, it probably makes sense to operate at each VQ level, config space level. Such are very different commands than passthrough.
> > > > I think vdpa has demonstrated that very well on how to do specific work for specific device type. So some of those work can be done using AQ.
> > > >
> > > > [1] https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103362dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> > > We have been through your statement for many times.
> > > This is not about how many times you repeated, if you think this is true,
> > > you need to prove that with solid evidence.
> > >
> > >
> > > For pass-through, I still recommend you to take a reference of current
> > > virito-pci implementation, it works for pass-through, right?
> >
> > Current migration implementation in e.g. QEMU? It does but it
> > traps data path accesses. That, I think we can agree,
> > should not be the only option to migrate.
> >
> > > For scale, I already told you for many times that they are per-device
> > > facilities. How can a per-device facility not scale?
> > > vDPA works fine on config space.
> > >
> > > So, if you still insist admin vq is better than config space like in other
> > > thread you have concluded, you may imply that config space interfaces should
> > > be re-factored to admin vq.
> >
> > There are good arguments that yes, virtio needs a transport for config space
> > that is DMA based as opposed to memory mapped based.  This is one of the
> > things all vendors seem to prefer in IDPF so virtio should have the option.
> 
> Then it is the transport vq or transport over adminq proposal?
> 
> Thanks

That's one way to do it, yes.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-23  4:42                                             ` Parav Pandit
@ 2023-10-24  4:46                                               ` Jason Wang
  2023-10-24  4:49                                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-24  4:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Mon, Oct 23, 2023 at 12:42 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, October 23, 2023 9:15 AM
> >
> > On Thu, Oct 19, 2023 at 1:32 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: virtio-comment@lists.oasis-open.org
> > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > Sent: Thursday, October 19, 2023 10:15 AM
> > > >
> > > > > > Again, if you don't want to talk about transport virtqueue,
> > > > > > that's fine. But let's leave the scalability issue aside as well.
> > > > > >
> > > > > Registers are related for functionality and scale.
> > > > >
> > > > > Lets first agree on use case before the design, that I asked above.
> > > > >
> > > > > I will wait to respond to any other emails until we agree on use
> > > > > case
> > > > requirements.
> > > >
> > > > There are more than just me who want you to define "passthrough"
> > > > first where you refuse to respond.
> > > >
> > > Totally disagree.
> > > In the previous email itself, I wrote what passthrough is.
> > > So let's try yet one more time.
> > > Either you can re-read last email or for better read below and see if it is
> > understood or not.
> > >
> > > > How could we make any agreement without an accurate the definition
> > > > of "passthrough" who is a key to understand each other?
> > >
> > > I replied few times in past emails but since those email threads are so long, it is
> > easy to miss out.
> > >
> > > Passthrough definition:
> > > a. virtio member device mapped to the guest vm
> >
> > I really think we need to be accurate here. For example, what does "map" mean
> > here?
> >
> Not trapped by hypervisor is better wording than mapped.
>
> > > b. only pci config space and msix of a member device is intercepted by
> > hypervisor.
> >
> > What's the criteria for choosing a cap/bar to be trapped or not? For example,
> > there're a lot of other things that need to be virtualized besides MSI-X for sure.
> >
> For passthrough, which are those?

I haven't gone through all the caps but this is what in my mind

1) vIOMMU related stuffs: ATS/PRI, assign PASID to a virtqueue in the future
2) capability related to resources: like Resizable BAR etc

Thanks


>
> > > c. virtio config space, virtio cvqs, data vqs of a member device is directly
> > accessed by the guest vm without intercepted by the hypervisor.
> > >
> > > (Why b?, no grand reason, it is how the hypervisors are working where to
> > integrate the virtio member device to).
> >
> > What you state here is more about the method to support a use case not the
> > use case itself. My understanding is "live migration" is a valid use case for sure.
> > And "passthough" is probably one way for achieving live migration but this is
> > what you need to define and justify in this series.
> >
> I largely captured that in v2 in device context and in theory of operation.
> I will keep it short as written in v2. If something is unclear (without writing the book), I am happy to extend it.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-24  4:46                                               ` Jason Wang
@ 2023-10-24  4:49                                                 ` Parav Pandit
  2023-10-25  1:28                                                   ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-24  4:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 24, 2023 10:16 AM
> 
> On Mon, Oct 23, 2023 at 12:42 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, October 23, 2023 9:15 AM
> > >
> > > On Thu, Oct 19, 2023 at 1:32 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: virtio-comment@lists.oasis-open.org
> > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > > Sent: Thursday, October 19, 2023 10:15 AM
> > > > >
> > > > > > > Again, if you don't want to talk about transport virtqueue,
> > > > > > > that's fine. But let's leave the scalability issue aside as well.
> > > > > > >
> > > > > > Registers are related for functionality and scale.
> > > > > >
> > > > > > Lets first agree on use case before the design, that I asked above.
> > > > > >
> > > > > > I will wait to respond to any other emails until we agree on
> > > > > > use case
> > > > > requirements.
> > > > >
> > > > > There are more than just me who want you to define "passthrough"
> > > > > first where you refuse to respond.
> > > > >
> > > > Totally disagree.
> > > > In the previous email itself, I wrote what passthrough is.
> > > > So let's try yet one more time.
> > > > Either you can re-read last email or for better read below and see
> > > > if it is
> > > understood or not.
> > > >
> > > > > How could we make any agreement without an accurate the
> > > > > definition of "passthrough" who is a key to understand each other?
> > > >
> > > > I replied few times in past emails but since those email threads
> > > > are so long, it is
> > > easy to miss out.
> > > >
> > > > Passthrough definition:
> > > > a. virtio member device mapped to the guest vm
> > >
> > > I really think we need to be accurate here. For example, what does
> > > "map" mean here?
> > >
> > Not trapped by hypervisor is better wording than mapped.
> >
> > > > b. only pci config space and msix of a member device is
> > > > intercepted by
> > > hypervisor.
> > >
> > > What's the criteria for choosing a cap/bar to be trapped or not? For
> > > example, there're a lot of other things that need to be virtualized besides
> MSI-X for sure.
> > >
> > For passthrough, which are those?
> 
> I haven't gone through all the caps but this is what in my mind
> 
> 1) vIOMMU related stuffs: ATS/PRI, assign PASID to a virtqueue in the future
> 2) capability related to resources: like Resizable BAR etc
For passthrough PASID assignment vq is not needed.
If at all it is done, it will be done from the guest by the driver using virtio interface.
Capabilities of #2 is generic across all pci devices, so it will be handled by the HV.
ATS/PRI cap is also generic manner handled by the HV and PCI device.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23  4:42                                                   ` Parav Pandit
@ 2023-10-24  4:56                                                     ` Jason Wang
  2023-10-24 10:01                                                       ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-24  4:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Mon, Oct 23, 2023 at 12:43 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, October 23, 2023 9:15 AM
> >
> > On Wed, Oct 18, 2023 at 6:23 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Wednesday, October 18, 2023 3:26 PM
> > >
> > > > For completeness, and to shorten the thread, can you please list
> > > > known issues/use cases that are addressed by the status bit
> > > > interface and how you plan for them to be addressed?
> > >
> > > I will avoid listing known issues for a moment for status bit in this email.
> > >
> > > Status bit interface helps in following good ways.
> > > 1. suspend/resume the device fully by the guest by negotiating the new
> > feature.
> > > This can be useful in the guest-controlled PM flows of suspend/resume.
> > > I still think for this, only feature bit is necessary, and device_status
> > modification is not needed.
> >
> > Which feature bit did you mean here?
> >
> A new feature bit to indicate the guest that device supports suspend and resume, hence, there is no need to reset the device and destroy resources like how it is done today.

Well, I don't see how it is different from what LingShan proposed.

>
> > > D0->D3 and D3->D0 transition of the pci can suspend and resume the device
> > which can preserve the last device_status value before entering D3.
> >
> > It's not only about the device status. I would not repeat the question I've asked
> > in another thread.
> >
> > What's more, if you really want to suspend/freeze at PCI level and deal with PCI
> > specific issues like P2P.  You should really try to leverage or invent a PCI
> > mechanism instead of trying to carry such semantics via a virtio specific stuff
> > like adminq. Solving transport specific problems at the virtio level is a layer
> > violation.
> >
> PCI spec has already defined what it needs to.

If PCI spec has good support for suspend/resume, why bother inventing
mechanisms in virtio?

> SR-PCIM interface is already concluded being outside of PCI-spec by the pci-sig.
> And no, there is no layer violation.
>
> Any non PCI member device can always implement necessary STOP mode as no-op.
>
> And all of those talk make sense when one creates MMIO based member device, until that point is just objections...

They are different layers:

1) suspend/resume at virtio level
2) suspend/resume at transport level

We need both of them to satisfy different cases. Just as we need to
reset at both virtio and VF(FLR). Lingshan proposes 1) while it looks
to me you propose 2) via virtio adminq but you said it has been
supported by PCI which is then a duplication.

>
> > > (Like preserving all rest of the fields of common and other device config).
> > > This is orthogonal and needed regardless of device migration.
> > >
> > > 2. If one does not want to passthrough a member device, but build a
> > > mediation-based device on top of existing virtio device, It can be useful with
> > mediating software.
> > > Here the mediating software has ample duplicated knowledge of what the
> > member device already has.
> >
> > It is the way the hypervisors are doing for not only virtio but also for CPU and
> > MMU as well.
> >
> Not really, vcpus and VMCS and more are part of the hardware support.

That's not the context here. Hypervisors need to know almost every
detail to make CPU virtualization work. That's the fact, and it works
for virio as well for years.

What's more, nothing prevents us from inventing something similar in
virtio to speed up the context switch or migration if necessary.

> 2 level nested page tables is hw support.
> Anything beyond 2 level nesting, likely involves hypervisor.

Needs emulation/trap for sure. That's the point.

>
> > > This can fulfil the nested requirement differently provided a platform support
> > it.
> > > (PASID limitation will be practical blocker here).
> >
> > I don't think PASID is a blocker. It is only a blocker if you want to do passthrough.
> >
> Even without passthrough, one needs to steer the hypervisor DMA to non guest memory.
> And guest driver must not be able to attack (read/write) from that memory.
> I don’t see how one can do this without PASID. As all DMAs are tagged using only RID.

There are a lot of other ways, but in order to converge, we can leave
it for future discussions.

What's more, if we design virtio for the future, PASID must be
considered as a way as we all know it would come for sure.

>
> > >
> > > How to I plan to address above two?
> > > a. #1 to be addressed by having the _F_PM bit, when the bit is negotiated PCI
> > PM drives the state.
> >
> > We can't duplicate every transport specific feature in virtio. This is a layer
> > violation again. We should reuse the PCI facility here.
> >
> It is reused by having the feature bit to indicate that device supports suspend/resume.
> If from Day_1, if the PCI PM bits used, it would not require the feature bit.
> But that was not the case.
> So the guest driver do not know if using the PCI PM bit is enough to decide, if suspend/resume by guest will work or not.
> Hence the feature bit.

Anyhow you need to update the driver if it has an issue. In the
update, you can check and use PCI PM. If it doesn't have PCI PM, you
can only suspend/resume at virtio level. Defining transport semantics
at the virtio level breaks the layers.

>
> > > This will work orthogonal to VMM side migration and will co-exist with VMM
> > based device migration.

Actually not, if PF can suspend VF via PCI facilities, that would be
no layer violation any more.

> > >
> > > b. nested use case:
> > > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> >
> > Let me ask it again here, how can you migrate L2 using L1 "emulated"
> > PF? Emulation?
> >
> Emulation is one way as most nested platform components do.

That's the point, you can't avoid emulation.

Thanks


> May be L1 VF which is = VF + SR-IOV capability is = emulated PF. This PF can run exact same commands as L0 level PF.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-24  4:56                                                     ` Jason Wang
@ 2023-10-24 10:01                                                       ` Parav Pandit
  2023-10-25  1:28                                                         ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-24 10:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 24, 2023 10:27 AM
> 
> On Mon, Oct 23, 2023 at 12:43 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, October 23, 2023 9:15 AM
> > >
> > > On Wed, Oct 18, 2023 at 6:23 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Wednesday, October 18, 2023 3:26 PM
> > > >
> > > > > For completeness, and to shorten the thread, can you please list
> > > > > known issues/use cases that are addressed by the status bit
> > > > > interface and how you plan for them to be addressed?
> > > >
> > > > I will avoid listing known issues for a moment for status bit in this email.
> > > >
> > > > Status bit interface helps in following good ways.
> > > > 1. suspend/resume the device fully by the guest by negotiating the
> > > > new
> > > feature.
> > > > This can be useful in the guest-controlled PM flows of suspend/resume.
> > > > I still think for this, only feature bit is necessary, and
> > > > device_status
> > > modification is not needed.
> > >
> > > Which feature bit did you mean here?
> > >
> > A new feature bit to indicate the guest that device supports suspend and
> resume, hence, there is no need to reset the device and destroy resources like
> how it is done today.
> 
> Well, I don't see how it is different from what LingShan proposed.
The difference is, in passthrough mode, it will be fully controlled by the guest VM without involving hypervisor.
It will work even when device migration is ongoing.
What Lingshan proposed involved messing with the device status.
It should be separate register like how Jingchen proposed or not have register at all if the pci transport support it.
> 
> >
> > > > D0->D3 and D3->D0 transition of the pci can suspend and resume the
> > > > D0->device
> > > which can preserve the last device_status value before entering D3.
> > >
> > > It's not only about the device status. I would not repeat the
> > > question I've asked in another thread.
> > >
> > > What's more, if you really want to suspend/freeze at PCI level and
> > > deal with PCI specific issues like P2P.  You should really try to
> > > leverage or invent a PCI mechanism instead of trying to carry such
> > > semantics via a virtio specific stuff like adminq. Solving transport
> > > specific problems at the virtio level is a layer violation.
> > >
> > PCI spec has already defined what it needs to.
> 
> If PCI spec has good support for suspend/resume, why bother inventing
> mechanisms in virtio?
> 
Because virtio today does not know if the PCI level suspend/resume will actually work or not, because in past it has not worked even if the PM capability was exposed.
So only a feature bit is needed.

> > SR-PCIM interface is already concluded being outside of PCI-spec by the pci-
> sig.
> > And no, there is no layer violation.
> >
> > Any non PCI member device can always implement necessary STOP mode as
> no-op.
> >
> > And all of those talk make sense when one creates MMIO based member
> device, until that point is just objections...
> 
> They are different layers:
> 
> 1) suspend/resume at virtio level
> 2) suspend/resume at transport level
> 
> We need both of them to satisfy different cases. Just as we need to reset at both
> virtio and VF(FLR). Lingshan proposes 1) while it looks to me you propose 2) via
> virtio adminq but you said it has been supported by PCI which is then a
> duplication.
> 
#1 is needed and to be owned by the guest driver in passthrough
I didn’t propose #2.
I proposed #2 be controlled by the vmm/hypervisor (via admin cmd) who is in charge of vm suspend/resume flow.

> >
> > > > (Like preserving all rest of the fields of common and other device config).
> > > > This is orthogonal and needed regardless of device migration.
> > > >
> > > > 2. If one does not want to passthrough a member device, but build
> > > > a mediation-based device on top of existing virtio device, It can
> > > > be useful with
> > > mediating software.
> > > > Here the mediating software has ample duplicated knowledge of what
> > > > the
> > > member device already has.
> > >
> > > It is the way the hypervisors are doing for not only virtio but also
> > > for CPU and MMU as well.
> > >
> > Not really, vcpus and VMCS and more are part of the hardware support.
> 
> That's not the context here. Hypervisors need to know almost every detail to
> make CPU virtualization work. 
Cpu virtualization is accelerated for 1st level nesting including interrupts.

> That's the fact, and it works for virio as well for years.
> 
> What's more, nothing prevents us from inventing something similar in virtio to
> speed up the context switch or migration if necessary.
The major difference with cpu virtualization with nw device virtualization is, former flow is controlled by the sw, the later one is controlled by the network which is not predictable.
Hence, and context switching can mostly work in theory and not perform well with varied workload.
Most production users prefer dedicated/isolated non_context switched rx.

> 
> > 2 level nested page tables is hw support.
> > Anything beyond 2 level nesting, likely involves hypervisor.
> 
> Needs emulation/trap for sure. That's the point.
> 
> >
> > > > This can fulfil the nested requirement differently provided a
> > > > platform support
> > > it.
> > > > (PASID limitation will be practical blocker here).
> > >
> > > I don't think PASID is a blocker. It is only a blocker if you want to do
> passthrough.
> > >
> > Even without passthrough, one needs to steer the hypervisor DMA to non
> guest memory.
> > And guest driver must not be able to attack (read/write) from that memory.
> > I don’t see how one can do this without PASID. As all DMAs are tagged using
> only RID.
> 
> There are a lot of other ways, but in order to converge, we can leave it for
> future discussions.
> 
So, first level passthrough seems a basic requirement to support to operate from vmm control.

2nd level nesting can be emulated or accelerated to follow the principles of the paper you pointed.

> What's more, if we design virtio for the future, PASID must be considered as a
> way as we all know it would come for sure.
> 
For future PASID be fully controlled by the guest to continue like today.
PASID based bifurcation is still open question to me.

> >
> > > >
> > > > How to I plan to address above two?
> > > > a. #1 to be addressed by having the _F_PM bit, when the bit is
> > > > negotiated PCI
> > > PM drives the state.
> > >
> > > We can't duplicate every transport specific feature in virtio. This
> > > is a layer violation again. We should reuse the PCI facility here.
> > >
> > It is reused by having the feature bit to indicate that device supports
> suspend/resume.
> > If from Day_1, if the PCI PM bits used, it would not require the feature bit.
> > But that was not the case.
> > So the guest driver do not know if using the PCI PM bit is enough to decide, if
> suspend/resume by guest will work or not.
> > Hence the feature bit.
> 
> Anyhow you need to update the driver if it has an issue. In the update, you can
> check and use PCI PM. If it doesn't have PCI PM, you can only suspend/resume
> at virtio level. Defining transport semantics at the virtio level breaks the layers.
> 
This series does not define transport semantics at virtio level.
It only defines virtio level semantics of what to be done/not done.

> >
> > > > This will work orthogonal to VMM side migration and will co-exist
> > > > with VMM
> > > based device migration.
> 
> Actually not, if PF can suspend VF via PCI facilities, that would be no layer
> violation any more.
> 
There is no such PCI facility. PCI capabilities is not supposed to contain device migration kind of complex commands.
I explained in the discussion with Michael.

> > > >
> > > > b. nested use case:
> > > > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > > > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> > >
> > > Let me ask it again here, how can you migrate L2 using L1 "emulated"
> > > PF? Emulation?
> > >
> > Emulation is one way as most nested platform components do.
> 
> That's the point, you can't avoid emulation.
It is applicable only after first level.
First level must be able to take the benefit without emulation like rest of the system modules do today.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23 10:26                                                                         ` Parav Pandit
@ 2023-10-24 10:10                                                                           ` Zhu, Lingshan
  2023-10-24 10:11                                                                             ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-24 10:10 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

[-- Attachment #1: Type: text/plain, Size: 4857 bytes --]



On 10/23/2023 6:26 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>> Sent: Monday, October 23, 2023 3:44 PM
>>
>> On 10/23/2023 6:01 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>>>> Sent: Monday, October 23, 2023 3:18 PM
>>>>> No. Please read the response carefully.
>>>>> I said 'For non-backward compatible SIOV device of the future, yes,
>>>>> virtio-pci
>>>> common config (non init registers) should be moved to a vq, located
>>>> on the member device directly.'
>>>>> Notice the 'member device directly'.
>>>>> Not the PF admin vq.
>>>> I think this is a question to Michael and he answered.
>>>>
>>>> We are talking about PCI, not SIOV, for SIOV we need transport vq.
>>>>
>>> Hypervisor for future device and future functionality must not get involved in
>> looking the device configuration.
>>> Hence, as long as transport vq is located on the SIOV device itself for non-
>> backward items, it is fine to transport SIOV configuration.
>>> For backward compatibility purpose, one will be able to use the aq of the
>> owner device. No need to create a new transport VQ.
>>> To create a another transport vq, need to clarify the limitations of aq that
>> transport vq can overcome, and why aq cannot be extended to overcome it.
>> SIOV and transport vq is not related to this topic, don't mix them.
>>
>> and admin vq is not a must for live migration.
> Ok. You raised the point of transport vq...
>
> All repeat points, not leading anywhere for you nor me.
so, let's not take SIOV into consideration for this topic.
>   
>> and we are not introducing a new device type here.
>>
> It does not matter.
you citation is in that section.
>
>> For future device and future functionalities, let's discuss when they are
>> implementing, on their series.
> The new device will inherit the "basic functionality non init time register"...
> So please don’t propose to implement such.
so, still, I am not introducing a new type of device, right?
>
>>>> Here again, we are introducing basic facilities for live migration,
>>>> and the implementation is transport-specific.
>>> Not relevant comment.
>>>
>>>>>> Config space is control path, DMA is data-path, let's better not
>>>>>> mix them, we never expect to use config space to transfer data.
>>>>>>
>>>>> And that control path is only for the init time configuration as
>>>>> correctly listed in the virtio spec as,
>>>>>
>>>>> " Device configuration space should only be used for
>>>>> initialization-time
>>>> parameters.".
>>>> don't you know new field reset_vq is introduced to virtio common cfg?
>>>> This is not only for initialization, right?
>>> Right. It was unfortunate and also it was last moment entry that we had fixed
>> in reset register polarity.
>> Appendix B. Creating New Device Types, and we are not introducing new device
>> type.
> The concept still applies to existing device type.
> It is illogical otherwise.
In the title "new device"
>
>>>> and your citation is from Appendix B. Creating New Device Types, are
>>>> we creating a new device type?
>>> That is guidance for the new device creation on "how to use config space?"
>>> It equally applied to existing devices too to not grow.
>>> The section is equally helpful for new creators and for extending devices like
>> you and me to understand what not to put in config space.
>> this does not make any sense, if you stick to the wording, then let me repeat
>> again "Appendix B. Creating New Device Types"!!!!!!
> Sorry, your implying is: new device type should be efficient and existing one can make it further bad. Does not make sense to me.
>
> It is fully logical to have only init time things in the config registers as done today in the spec for existing and new devices.
>
> I would be happy to extend B.5 Device improvements to capture it too.
That section applies to new device with its title "Appendix B. Creating 
New Device Types"

And I agree we should modify this section, should remove this limitation.
>
>>>>>> So we need DMA to transfer data, for example I take advantages of
>>>>>> device DMA to logging dirty pages, This also applies to in-flight descriptors.
>>>>>>
>>>>> Can you please explain via virtqueue cannot be used for DMA bulk
>>>>> data
>>>> transfer as listed in virtio spec.
>>>>> " The mechanism for bulk data transport on virtio devices is
>>>>> pretentiously
>>>> called a virtqueue"
>>>> what is your point? vq can do DMA, so what?
>>> I am asking,
>>> If there is AQ on the member device, can you use it? If not, what is the
>> technical reason(s) to not use it.
>> Repeated for many times, QOS, nested and so on.
> Why would there be any QoS when the AQ is on the member device for non-passthrough use case?
> Why nested won't work when the AQ is on the member device for non-passthrough use case?
Have not we discussed these before?

[-- Attachment #2: Type: text/html, Size: 10545 bytes --]

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-24 10:10                                                                           ` Zhu, Lingshan
@ 2023-10-24 10:11                                                                             ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-24 10:11 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

[-- Attachment #1: Type: text/plain, Size: 5486 bytes --]

Please fix the email client. The email is in non-text format.

From: Zhu, Lingshan <lingshan.zhu@intel.com>
Sent: Tuesday, October 24, 2023 3:40 PM
To: Parav Pandit <parav@nvidia.com>; Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>; virtio-comment@lists.oasis-open.org; cohuck@redhat.com; sburla@marvell.com; Shahaf Shuler <shahafs@nvidia.com>; Maor Gottlieb <maorg@nvidia.com>; Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration

On 10/23/2023 6:26 PM, Parav Pandit wrote:

From: Zhu, Lingshan <lingshan.zhu@intel.com><mailto:lingshan.zhu@intel.com>

Sent: Monday, October 23, 2023 3:44 PM

On 10/23/2023 6:01 PM, Parav Pandit wrote:

From: Zhu, Lingshan <lingshan.zhu@intel.com><mailto:lingshan.zhu@intel.com>

Sent: Monday, October 23, 2023 3:18 PM

No. Please read the response carefully.

I said 'For non-backward compatible SIOV device of the future, yes,

virtio-pci

common config (non init registers) should be moved to a vq, located

on the member device directly.'

Notice the 'member device directly'.

Not the PF admin vq.

I think this is a question to Michael and he answered.

We are talking about PCI, not SIOV, for SIOV we need transport vq.

Hypervisor for future device and future functionality must not get involved in

looking the device configuration.

Hence, as long as transport vq is located on the SIOV device itself for non-

backward items, it is fine to transport SIOV configuration.

For backward compatibility purpose, one will be able to use the aq of the

owner device. No need to create a new transport VQ.

To create a another transport vq, need to clarify the limitations of aq that

transport vq can overcome, and why aq cannot be extended to overcome it.

SIOV and transport vq is not related to this topic, don't mix them.

and admin vq is not a must for live migration.

Ok. You raised the point of transport vq...

All repeat points, not leading anywhere for you nor me.
so, let's not take SIOV into consideration for this topic.

and we are not introducing a new device type here.

It does not matter.
you citation is in that section.

For future device and future functionalities, let's discuss when they are

implementing, on their series.

The new device will inherit the "basic functionality non init time register"...

So please don’t propose to implement such.
so, still, I am not introducing a new type of device, right?

Here again, we are introducing basic facilities for live migration,

and the implementation is transport-specific.

Not relevant comment.

Config space is control path, DMA is data-path, let's better not

mix them, we never expect to use config space to transfer data.

And that control path is only for the init time configuration as

correctly listed in the virtio spec as,

" Device configuration space should only be used for

initialization-time

parameters.".

don't you know new field reset_vq is introduced to virtio common cfg?

This is not only for initialization, right?

Right. It was unfortunate and also it was last moment entry that we had fixed

in reset register polarity.

Appendix B. Creating New Device Types, and we are not introducing new device

type.

The concept still applies to existing device type.

It is illogical otherwise.
In the title "new device"

and your citation is from Appendix B. Creating New Device Types, are

we creating a new device type?

That is guidance for the new device creation on "how to use config space?"

It equally applied to existing devices too to not grow.

The section is equally helpful for new creators and for extending devices like

you and me to understand what not to put in config space.

this does not make any sense, if you stick to the wording, then let me repeat

again "Appendix B. Creating New Device Types"!!!!!!

Sorry, your implying is: new device type should be efficient and existing one can make it further bad. Does not make sense to me.

It is fully logical to have only init time things in the config registers as done today in the spec for existing and new devices.

I would be happy to extend B.5 Device improvements to capture it too.
That section applies to new device with its title "Appendix B. Creating New Device Types"

And I agree we should modify this section, should remove this limitation.

So we need DMA to transfer data, for example I take advantages of

device DMA to logging dirty pages, This also applies to in-flight descriptors.

Can you please explain via virtqueue cannot be used for DMA bulk

data

transfer as listed in virtio spec.

" The mechanism for bulk data transport on virtio devices is

pretentiously

called a virtqueue"

what is your point? vq can do DMA, so what?

I am asking,

If there is AQ on the member device, can you use it? If not, what is the

technical reason(s) to not use it.

Repeated for many times, QOS, nested and so on.

Why would there be any QoS when the AQ is on the member device for non-passthrough use case?

Why nested won't work when the AQ is on the member device for non-passthrough use case?
Have not we discussed these before?

[-- Attachment #2: Type: text/html, Size: 13621 bytes --]

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23 11:32                                                                     ` Michael S. Tsirkin
@ 2023-10-24 10:27                                                                       ` Zhu, Lingshan
  2023-10-25  8:33                                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-24 10:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

[-- Attachment #1: Type: text/plain, Size: 4559 bytes --]



On 10/23/2023 7:32 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 23, 2023 at 06:03:10PM +0800, Zhu, Lingshan wrote:
>> config space, MMIO, registers work for years, what is wrong with them?
> Nothing as such. They don't seem to be appropriate for all use-case
> where people want to utilize virtio. I think a new transport
> will be needed to address these.
New transport for new type of devices for sure, like transport vq for SIOV.

I agree admin vq or admin cmds are useful in some use cases, that is
another story, should be case by case.

For now, let's don't talk about all-use cases, just for current task, 
for live migration.

So IMHO, I still think we should use config space registers to control 
live migration process.
>
>>>> Config space is control path, DMA is data-path, let's better not mix them,
>>>> we never expect to use config space to transfer data.
>>>>
>>>> So we need DMA to transfer data, for example I take advantages of device DMA
>>>> to logging dirty pages, This also applies to in-flight descriptors.
>>> As long as you do, I personally see little benefit to retrieve parts of
>>> state with memory mapped accesses.
>> registers only control, and I personally believe a single register is much
>> better
>> than processing admin commands, more light-weight, more reliable, working
>> for years.
> Yea. It would be, if we could do everything through that register.
> But we can't really. Migration has too much data to pass around
> for that to be reasonable.
data are not transferred by registers, they only control.

We transfer data by DMA, the device writes DMA dirty pages 
information(bitmap)
to host isolated memory region.
>
>> Config space interfaces are fundamental for virtio-pci.
>
> They are in fact fundamental to virtio. Multiple transports to
> use config space are also fundamental.
I agree. So I also agree to build admin vq live migration solution based 
on our
basic facilities, as Jason ever proposed.
>
>
>>>> And we are implementing virito live migration, not only for PCI.
>>>>
>>>> So both me and Jason keep repeating: We are implementing basic facilities,
>>>> and the implementation is transport specific.
>>> But the register based facilities you proposed are extremely limited and
>>> seem to only work for migration. For example, it seems mostly useless for
>>> debugging because retrieving state is rather complex and would
>>> interfere with normal working of the device.
>> If you want to prove the register controlling interfaces are extremely
>> limited than admin vq or admin cmds,
>> you are also proving config space registers are extremely limited than
>> admin vq.
> Yes. Migration needs ability to pass large amounts of data around, and
> is too complex a functionality to work reliably without ability to
> report errors.
what errors? when device DMA?
missing some dirty pages? If the device can detect such errors, it can 
recover by itself,
or how can driver fix this?

for control path, virtio uses re-read for many years and it works well. 
I believe we have
went through this issue before.

>
>> So the question still here: do you want to replace current virtio-pci common
>> cfg
>> with admin vq or admin cmds?
> I think we need to add a new transport that will use admin commands.
> Which one to use would be up to a specific device.
For new device type like SIOV, yes we need a new transport, transport vq.

Let's focus on this live migration feature, if there are new features in 
the future
requires admin vq, let's discuss when they proposed.
>
>
>> And debug what? If you want to introduce more functionalities, we should
>> discuss
>> case by case.
>>
>> If debugging vq state, it is as easy as read queue_size, I don't see the
>> limitations
>> as queue_size work for years.
> No one reads queue_size. In fact for years we didn't have any debugging
> functionality and we are fine. If we are adding it, it really needs to
> be accessible when driver and device are wedged.
OK, I don't disagree to implement new device debugging features.

But let's focus on current live migration task.
>
>
>> I still believe our goal is to do our best, with our capabilities, to build
>> the most optimal virtio spec
>> as we can do. Not other goals.
>>
>> Thanks
>> Zhu Lingshan
>>>
>>>> We have proposed to build admin vq based on our register solution, this can
>>>> somehow even help tp resolve the nested issue.
>>>>
>>>> But I see the proposed has been rejected.
>>>>
>>>> I still believe the goal is to build a best spec, not "just can work" with
>>>> limitations.
>>>>
>>>>
>>>>

[-- Attachment #2: Type: text/html, Size: 8277 bytes --]

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-23 10:14                                                                 ` Parav Pandit
@ 2023-10-24 10:30                                                                   ` Zhu, Lingshan
  2023-10-24 10:37                                                                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-24 10:30 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/23/2023 6:14 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, October 23, 2023 3:39 PM
>>
>> On 10/20/2023 8:54 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Friday, October 20, 2023 3:01 PM
>>>>
>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>>>
>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>>>>>> Oh, really? Quite interesting, do you want to move all config
>>>>>>>>> space fields in VF to admin vq? Have a plan?
>>>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>>>> I do not want to divert the discussion, would like to focus on
>>>>>>>> device
>>>>>> migration phases.
>>>>>>>> Lets please discuss in some other dedicated thread.
>>>>>>> Possibly, if there's a way to send admin commands to vf itself
>>>>>>> then Lingshan will be happy?
>>>>>> still need to prove why admin commands are better than registers.
>>>>> Virtio spec development is not proof based approach. Please stop asking for
>> it.
>>>>> I tried my best to have technical answer in [1].
>>>>> I explained that registers simply do not work for passthrough mode
>>>>> (if this is what you are asking when you are asking prove its better).
>>>>> They can work for non_passthrough mediated mode.
>>>>>
>>>>> A member device may do admin commands using registers. Michael and I
>>>>> are
>>>> discussing presently in the same thread.
>>>>> Since there are multiple things to be done for device migration,
>>>>> dedicated
>>>> register set for each functionality do not scale well, hard to
>>>> maintain and extend.
>>>>> A register holding a command content make sense.
>>>>>
>>>>> Now, with that, if this can be useful only for non_passthrough, I
>>>>> made humble
>>>> request to transport them using AQ, this way, you get all benefits of AQ.
>>>>> And trying to understand, why AQ cannot possible or inferior?
>>>>>
>>>>> If you have commands like suspend/resume device, register or queue
>>>> transport simply don’t work, because it's wrong to bifurcate the
>>>> device with such weird API.
>>>>> If you want to biferacate for mediation software, it probably makes
>>>>> sense to
>>>> operate at each VQ level, config space level. Such are very different
>>>> commands than passthrough.
>>>>> I think vdpa has demonstrated that very well on how to do specific
>>>>> work for
>>>> specific device type. So some of those work can be done using AQ.
>>>>> [1]
>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd103
>>>>> 36
>>>>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>>>> We have been through your statement for many times.
>>>> This is not about how many times you repeated, if you think this is
>>>> true, you need to prove that with solid evidence.
>>>>
>>> I will not respond to this comment anymore.
>> Ok if you choose not to respond.
>>>> For pass-through, I still recommend you to take a reference of
>>>> current virito-pci implementation, it works for pass-through, right?
>>> What do you mean by current virtio-pci implementation?
>> current virito-pci works for pass-through
> I still don’t understand what is "current virtio-pci".
> Do you mean qemu implementation of emulated virtio-pci or you mean virtio-pci specification for passthrough?
> What do you want me to refer to for passthrough? Please clarify.
you know guest vcpu and its vRC can not access host side devices, and 
there must be a driver helping the pass-through
use cases, like vDPA and vfio
>
>>>> For scale, I already told you for many times that they are per-device
>>>> facilities. How can a per-device facility not scale?
>>> Each VF device must implement new set of on-chip memory-based registers
>> which demands more power, die area which does not scale efficiently to
>> thousands of VFs.
>> that can be fpga gates or SOC implementing new features, you think that is a
>> waste?
> It is waste in hw, if there is a better approach possible to not burn them as gates and save on resources for rarely used items.
Is a new entry in MSIX table a waste of HW?
Can I say implementing admin vq in SOC is a waste of cores?
>
>
>>>> vDPA works fine on config space.
>>>>
>>>> So, if you still insist admin vq is better than config space like in
>>>> other thread you have concluded, you may imply that config space
>>>> interfaces should be re-factored to admin vq.
>>> Whatever is done in past is done, there is no way to change history.
>>> An new non init time registers should not be placed in device specific config
>> space as virtio spec has clear guideline on it for good.
>>> Device context reading, dirty page address reading, changing vf device modes,
>> all of these are clearly not a init time settings.
>>> Hence, they do not belong to the registers.
>> reset vq? and you get it from Appendix B. Creating New Device Types, are we
>> implementing a new type of device???
> I don’t understand your question.
> I replied the history of reset_vq.
> Take good examples to follow, reset_vq clearly is not the one.
so again, we are not implementing new device type, so your citation 
doesn't apply.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-24 10:30                                                                   ` Zhu, Lingshan
@ 2023-10-24 10:37                                                                     ` Parav Pandit
  2023-10-26  6:44                                                                       ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-24 10:37 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Tuesday, October 24, 2023 4:00 PM
> 
> On 10/23/2023 6:14 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, October 23, 2023 3:39 PM
> >>
> >> On 10/20/2023 8:54 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, October 20, 2023 3:01 PM
> >>>>
> >>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Thursday, October 19, 2023 2:48 PM
> >>>>>>
> >>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> >>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> >>>>>>>>> Oh, really? Quite interesting, do you want to move all config
> >>>>>>>>> space fields in VF to admin vq? Have a plan?
> >>>>>>>> Not in my plan for spec 1.4 time frame.
> >>>>>>>> I do not want to divert the discussion, would like to focus on
> >>>>>>>> device
> >>>>>> migration phases.
> >>>>>>>> Lets please discuss in some other dedicated thread.
> >>>>>>> Possibly, if there's a way to send admin commands to vf itself
> >>>>>>> then Lingshan will be happy?
> >>>>>> still need to prove why admin commands are better than registers.
> >>>>> Virtio spec development is not proof based approach. Please stop
> >>>>> asking for
> >> it.
> >>>>> I tried my best to have technical answer in [1].
> >>>>> I explained that registers simply do not work for passthrough mode
> >>>>> (if this is what you are asking when you are asking prove its better).
> >>>>> They can work for non_passthrough mediated mode.
> >>>>>
> >>>>> A member device may do admin commands using registers. Michael and
> >>>>> I are
> >>>> discussing presently in the same thread.
> >>>>> Since there are multiple things to be done for device migration,
> >>>>> dedicated
> >>>> register set for each functionality do not scale well, hard to
> >>>> maintain and extend.
> >>>>> A register holding a command content make sense.
> >>>>>
> >>>>> Now, with that, if this can be useful only for non_passthrough, I
> >>>>> made humble
> >>>> request to transport them using AQ, this way, you get all benefits of AQ.
> >>>>> And trying to understand, why AQ cannot possible or inferior?
> >>>>>
> >>>>> If you have commands like suspend/resume device, register or queue
> >>>> transport simply don’t work, because it's wrong to bifurcate the
> >>>> device with such weird API.
> >>>>> If you want to biferacate for mediation software, it probably
> >>>>> makes sense to
> >>>> operate at each VQ level, config space level. Such are very
> >>>> different commands than passthrough.
> >>>>> I think vdpa has demonstrated that very well on how to do specific
> >>>>> work for
> >>>> specific device type. So some of those work can be done using AQ.
> >>>>> [1]
> >>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd1
> >>>>> 03
> >>>>> 36
> >>>>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> >>>> We have been through your statement for many times.
> >>>> This is not about how many times you repeated, if you think this is
> >>>> true, you need to prove that with solid evidence.
> >>>>
> >>> I will not respond to this comment anymore.
> >> Ok if you choose not to respond.
> >>>> For pass-through, I still recommend you to take a reference of
> >>>> current virito-pci implementation, it works for pass-through, right?
> >>> What do you mean by current virtio-pci implementation?
> >> current virito-pci works for pass-through
> > I still don’t understand what is "current virtio-pci".
> > Do you mean qemu implementation of emulated virtio-pci or you mean
> virtio-pci specification for passthrough?
> > What do you want me to refer to for passthrough? Please clarify.
> you know guest vcpu and its vRC can not access host side devices, and there
> must be a driver helping the pass-through use cases, like vDPA and vfio
I am not sure how to corelate this answer to the question of "virtio-pci for passthrough".
:(

Today when a virtio-pci member device is passthrough to the guest VM, hypervisor is not involved in virtio interface such as config space, cvq, data vq etc.
Do you agree?

> >>>> For scale, I already told you for many times that they are
> >>>> per-device facilities. How can a per-device facility not scale?
> >>> Each VF device must implement new set of on-chip memory-based
> >>> registers
> >> which demands more power, die area which does not scale efficiently
> >> to thousands of VFs.
> >> that can be fpga gates or SOC implementing new features, you think
> >> that is a waste?
> > It is waste in hw, if there is a better approach possible to not burn them as
> gates and save on resources for rarely used items.
> Is a new entry in MSIX table a waste of HW?
Not as must as existing MSI-X table entries which requires linear amount of on-chip memory.

> Can I say implementing admin vq in SOC is a waste of cores?
Which cores in the SoC?
If it is on the PF, there is only handful of AQs for scale of N VFs.

> >
> >
> >>>> vDPA works fine on config space.
> >>>>
> >>>> So, if you still insist admin vq is better than config space like
> >>>> in other thread you have concluded, you may imply that config space
> >>>> interfaces should be re-factored to admin vq.
> >>> Whatever is done in past is done, there is no way to change history.
> >>> An new non init time registers should not be placed in device
> >>> specific config
> >> space as virtio spec has clear guideline on it for good.
> >>> Device context reading, dirty page address reading, changing vf
> >>> device modes,
> >> all of these are clearly not a init time settings.
> >>> Hence, they do not belong to the registers.
> >> reset vq? and you get it from Appendix B. Creating New Device Types,
> >> are we implementing a new type of device???
> > I don’t understand your question.
> > I replied the history of reset_vq.
> > Take good examples to follow, reset_vq clearly is not the one.
> so again, we are not implementing new device type, so your citation doesn't
> apply.
I disagree.
I am engineer to build practical systems considering limitations and also advancements of the transport; while listening to other industry efforts,
I am no from legal department.
Hence, Appendix B makes a sense to me to apply to the existing device which also has the section for "device improvements".

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-24  4:49                                                 ` Parav Pandit
@ 2023-10-25  1:28                                                   ` Jason Wang
  2023-10-25  7:02                                                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-25  1:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Oct 24, 2023 at 12:49 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 24, 2023 10:16 AM
> >
> > On Mon, Oct 23, 2023 at 12:42 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, October 23, 2023 9:15 AM
> > > >
> > > > On Thu, Oct 19, 2023 at 1:32 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > > > Sent: Thursday, October 19, 2023 10:15 AM
> > > > > >
> > > > > > > > Again, if you don't want to talk about transport virtqueue,
> > > > > > > > that's fine. But let's leave the scalability issue aside as well.
> > > > > > > >
> > > > > > > Registers are related for functionality and scale.
> > > > > > >
> > > > > > > Lets first agree on use case before the design, that I asked above.
> > > > > > >
> > > > > > > I will wait to respond to any other emails until we agree on
> > > > > > > use case
> > > > > > requirements.
> > > > > >
> > > > > > There are more than just me who want you to define "passthrough"
> > > > > > first where you refuse to respond.
> > > > > >
> > > > > Totally disagree.
> > > > > In the previous email itself, I wrote what passthrough is.
> > > > > So let's try yet one more time.
> > > > > Either you can re-read last email or for better read below and see
> > > > > if it is
> > > > understood or not.
> > > > >
> > > > > > How could we make any agreement without an accurate the
> > > > > > definition of "passthrough" who is a key to understand each other?
> > > > >
> > > > > I replied few times in past emails but since those email threads
> > > > > are so long, it is
> > > > easy to miss out.
> > > > >
> > > > > Passthrough definition:
> > > > > a. virtio member device mapped to the guest vm
> > > >
> > > > I really think we need to be accurate here. For example, what does
> > > > "map" mean here?
> > > >
> > > Not trapped by hypervisor is better wording than mapped.
> > >
> > > > > b. only pci config space and msix of a member device is
> > > > > intercepted by
> > > > hypervisor.
> > > >
> > > > What's the criteria for choosing a cap/bar to be trapped or not? For
> > > > example, there're a lot of other things that need to be virtualized besides
> > MSI-X for sure.
> > > >
> > > For passthrough, which are those?
> >
> > I haven't gone through all the caps but this is what in my mind
> >
> > 1) vIOMMU related stuffs: ATS/PRI, assign PASID to a virtqueue in the future
> > 2) capability related to resources: like Resizable BAR etc
> For passthrough PASID assignment vq is not needed.

How do you know that? There are works ongoing to make vPASID work for
the guest like vSVA. Virtio doesn't differ from other devices.

> If at all it is done, it will be done from the guest by the driver using virtio interface.

Then you need to trap. Such things couldn't be passed through to
guests directly.

> Capabilities of #2 is generic across all pci devices, so it will be handled by the HV.
> ATS/PRI cap is also generic manner handled by the HV and PCI device.

No, ATS/PRI requires the cooperation from the vIOMMU. You can simply
do ATS/PRI passthrough but with an emulated vIOMMU.

Thanks

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-24 10:01                                                       ` Parav Pandit
@ 2023-10-25  1:28                                                         ` Jason Wang
  2023-10-25  7:15                                                           ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-25  1:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Oct 24, 2023 at 6:02 PM Parav Pandit <parav@nvidia.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 24, 2023 10:27 AM
> >
> > On Mon, Oct 23, 2023 at 12:43 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, October 23, 2023 9:15 AM
> > > >
> > > > On Wed, Oct 18, 2023 at 6:23 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Wednesday, October 18, 2023 3:26 PM
> > > > >
> > > > > > For completeness, and to shorten the thread, can you please list
> > > > > > known issues/use cases that are addressed by the status bit
> > > > > > interface and how you plan for them to be addressed?
> > > > >
> > > > > I will avoid listing known issues for a moment for status bit in this email.
> > > > >
> > > > > Status bit interface helps in following good ways.
> > > > > 1. suspend/resume the device fully by the guest by negotiating the
> > > > > new
> > > > feature.
> > > > > This can be useful in the guest-controlled PM flows of suspend/resume.
> > > > > I still think for this, only feature bit is necessary, and
> > > > > device_status
> > > > modification is not needed.
> > > >
> > > > Which feature bit did you mean here?
> > > >
> > > A new feature bit to indicate the guest that device supports suspend and
> > resume, hence, there is no need to reset the device and destroy resources like
> > how it is done today.
> >
> > Well, I don't see how it is different from what LingShan proposed.
> The difference is, in passthrough mode, it will be fully controlled by the guest VM without involving hypervisor.

How does a rest in guest work but not suspend? You can choose to pass
through the suspend to the guest, and use save/load to migrate it.

> It will work even when device migration is ongoing.
> What Lingshan proposed involved messing with the device status.

Your proposal messes with the PCI semantics (as you want to rule the
behaviours like P2P).

> It should be separate register like how Jingchen proposed or not have register at all if the pci transport support it.

It should not, then you will end up defining the interaction with the
status state machine.

> >
> > >
> > > > > D0->D3 and D3->D0 transition of the pci can suspend and resume the
> > > > > D0->device
> > > > which can preserve the last device_status value before entering D3.
> > > >
> > > > It's not only about the device status. I would not repeat the
> > > > question I've asked in another thread.
> > > >
> > > > What's more, if you really want to suspend/freeze at PCI level and
> > > > deal with PCI specific issues like P2P.  You should really try to
> > > > leverage or invent a PCI mechanism instead of trying to carry such
> > > > semantics via a virtio specific stuff like adminq. Solving transport
> > > > specific problems at the virtio level is a layer violation.
> > > >
> > > PCI spec has already defined what it needs to.
> >
> > If PCI spec has good support for suspend/resume, why bother inventing
> > mechanisms in virtio?
> >
> Because virtio today does not know if the PCI level suspend/resume will actually work or not,

It's not the charge of virtio to know about this. Otherwise how many
PCI stuffs needs virtio to understand? PCIE supports various
capabilities.

> because in past it has not worked even if the PM capability was exposed.

Let's fix the hypervisor but last time I checked, suspend/hibernation
works at least for virtio-net.

> So only a feature bit is needed.
>
> > > SR-PCIM interface is already concluded being outside of PCI-spec by the pci-
> > sig.
> > > And no, there is no layer violation.
> > >
> > > Any non PCI member device can always implement necessary STOP mode as
> > no-op.
> > >
> > > And all of those talk make sense when one creates MMIO based member
> > device, until that point is just objections...
> >
> > They are different layers:
> >
> > 1) suspend/resume at virtio level
> > 2) suspend/resume at transport level
> >
> > We need both of them to satisfy different cases. Just as we need to reset at both
> > virtio and VF(FLR). Lingshan proposes 1) while it looks to me you propose 2) via
> > virtio adminq but you said it has been supported by PCI which is then a
> > duplication.
> >
> #1 is needed and to be owned by the guest driver in passthrough
> I didn’t propose #2.
> I proposed #2 be controlled by the vmm/hypervisor (via admin cmd) who is in charge of vm suspend/resume flow.

So you're saying it's the virtio level suspend but you want to limit
PCI transactions in P2P. That's not the suspend/resume at virtio level
for sure.

>
> > >
> > > > > (Like preserving all rest of the fields of common and other device config).
> > > > > This is orthogonal and needed regardless of device migration.
> > > > >
> > > > > 2. If one does not want to passthrough a member device, but build
> > > > > a mediation-based device on top of existing virtio device, It can
> > > > > be useful with
> > > > mediating software.
> > > > > Here the mediating software has ample duplicated knowledge of what
> > > > > the
> > > > member device already has.
> > > >
> > > > It is the way the hypervisors are doing for not only virtio but also
> > > > for CPU and MMU as well.
> > > >
> > > Not really, vcpus and VMCS and more are part of the hardware support.
> >
> > That's not the context here. Hypervisors need to know almost every detail to
> > make CPU virtualization work.
> Cpu virtualization is accelerated for 1st level nesting including interrupts.
>
> > That's the fact, and it works for virio as well for years.
> >
> > What's more, nothing prevents us from inventing something similar in virtio to
> > speed up the context switch or migration if necessary.
> The major difference with cpu virtualization with nw device virtualization is, former flow is controlled by the sw, the later one is controlled by the network which is not predictable.

The guest behaviour is also unpredictable, and guests may share
memories with others. I don't see your point.

> Hence, and context switching can mostly work in theory and not perform well with varied workload.

I don't think so, vCPU context is much more complicated than most of
the virtio devices. I don't see why it can't work for simple virtio
devices.

> Most production users prefer dedicated/isolated non_context switched rx.

I don't think you can cover "most production users" here. Such use
cases are limited with the missing save/load mechanism.

>
> >
> > > 2 level nested page tables is hw support.
> > > Anything beyond 2 level nesting, likely involves hypervisor.
> >
> > Needs emulation/trap for sure. That's the point.
> >
> > >
> > > > > This can fulfil the nested requirement differently provided a
> > > > > platform support
> > > > it.
> > > > > (PASID limitation will be practical blocker here).
> > > >
> > > > I don't think PASID is a blocker. It is only a blocker if you want to do
> > passthrough.
> > > >
> > > Even without passthrough, one needs to steer the hypervisor DMA to non
> > guest memory.
> > > And guest driver must not be able to attack (read/write) from that memory.
> > > I don’t see how one can do this without PASID. As all DMAs are tagged using
> > only RID.
> >
> > There are a lot of other ways, but in order to converge, we can leave it for
> > future discussions.
> >
> So, first level passthrough seems a basic requirement to support to operate from vmm control.
>
> 2nd level nesting can be emulated or accelerated to follow the principles of the paper you pointed.
>
> > What's more, if we design virtio for the future, PASID must be considered as a
> > way as we all know it would come for sure.
> >
> For future PASID be fully controlled by the guest to continue like today.
> PASID based bifurcation is still open question to me.

It is by design, e.g devices can have secondary PASID. It's not hard
to understand. And it's much simpler than doing "bifurcation" in PF.

>
> > >
> > > > >
> > > > > How to I plan to address above two?
> > > > > a. #1 to be addressed by having the _F_PM bit, when the bit is
> > > > > negotiated PCI
> > > > PM drives the state.
> > > >
> > > > We can't duplicate every transport specific feature in virtio. This
> > > > is a layer violation again. We should reuse the PCI facility here.
> > > >
> > > It is reused by having the feature bit to indicate that device supports
> > suspend/resume.
> > > If from Day_1, if the PCI PM bits used, it would not require the feature bit.
> > > But that was not the case.
> > > So the guest driver do not know if using the PCI PM bit is enough to decide, if
> > suspend/resume by guest will work or not.
> > > Hence the feature bit.
> >
> > Anyhow you need to update the driver if it has an issue. In the update, you can
> > check and use PCI PM. If it doesn't have PCI PM, you can only suspend/resume
> > at virtio level. Defining transport semantics at the virtio level breaks the layers.
> >
> This series does not define transport semantics at virtio level.

Don't you want to limit P2P in those states?

> It only defines virtio level semantics of what to be done/not done.
>
> > >
> > > > > This will work orthogonal to VMM side migration and will co-exist
> > > > > with VMM
> > > > based device migration.
> >
> > Actually not, if PF can suspend VF via PCI facilities, that would be no layer
> > violation any more.
> >
> There is no such PCI facility.

If you want to make passthrough work without layer violation, you need either:

1) invent them in the PCI

or

2) Trap and let hypervisor to control how to implement the suspend,
for example hypervisor can choose to control the PM of VF

> PCI capabilities is not supposed to contain device migration kind of complex commands.

We're discussing suspending here, no? Talking about PCI, even if
capabilities are not, it doesn't mean we can't extend PCI to use
others. Anyhow, this is really ir-revelant to the discussion here.
Virtio does virtio not PCI, you can't invent new features in virtio in
order to be able to extend or fix the function of PCI.

> I explained in the discussion with Michael.
>
> > > > >
> > > > > b. nested use case:
> > > > > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > > > > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> > > >
> > > > Let me ask it again here, how can you migrate L2 using L1 "emulated"
> > > > PF? Emulation?
> > > >
> > > Emulation is one way as most nested platform components do.
> >
> > That's the point, you can't avoid emulation.
> It is applicable only after first level.
> First level must be able to take the benefit without emulation like rest of the system modules do today.

You can't avoid traps and emulation. So the key is what/when/where to
trap, this is my logic of questions .

You want to pass through virtio facilities without trap and emulation,
you need to justify that.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-25  1:28                                                   ` Jason Wang
@ 2023-10-25  7:02                                                     ` Parav Pandit
  2023-10-26  0:46                                                       ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-25  7:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, October 25, 2023 6:59 AM
> > For passthrough PASID assignment vq is not needed.
> 
> How do you know that? 
Because for passthrough, the hypervisor is not involved in dealing with VQ at all.

> There are works ongoing to make vPASID work for the
> guest like vSVA. Virtio doesn't differ from other devices.
Passthrough do not run like SVA. Each passthrough device has PASID from its own space fully managed by the guest.
Some cpu required vPASID and SIOV is not going this way anmore.

> 
> > If at all it is done, it will be done from the guest by the driver using virtio
> interface.
> 
> Then you need to trap. Such things couldn't be passed through to guests directly.
> 
Only PASID capability is trapped. PASID allocation and usage is directly from guest.
Regardless it is not relevant to passthrough mode as PASID is yet another resource.
And for some cpu if it is trapped, it is generic layer, that does not require virtio involvement.
So virtio interface asking to trap something because generic facility has done in not the approach.

> > Capabilities of #2 is generic across all pci devices, so it will be handled by the
> HV.
> > ATS/PRI cap is also generic manner handled by the HV and PCI device.
> 
> No, ATS/PRI requires the cooperation from the vIOMMU. You can simply do
> ATS/PRI passthrough but with an emulated vIOMMU.
And that is not the reason for virtio device to build trap+emulation for passthrough member devices.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25  1:28                                                         ` Jason Wang
@ 2023-10-25  7:15                                                           ` Parav Pandit
  2023-10-25  8:24                                                             ` Michael S. Tsirkin
  2023-10-26  0:46                                                             ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-25  7:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, October 25, 2023 6:59 AM

[..]
> > > resume, hence, there is no need to reset the device and destroy
> > > resources like how it is done today.
> > >
> > > Well, I don't see how it is different from what LingShan proposed.
> > The difference is, in passthrough mode, it will be fully controlled by the guest
> VM without involving hypervisor.
> 
> How does a rest in guest work but not suspend? You can choose to pass through
> the suspend to the guest, and use save/load to migrate it.
> 
> > It will work even when device migration is ongoing.
> > What Lingshan proposed involved messing with the device status.
> 
> Your proposal messes with the PCI semantics (as you want to rule the
> behaviours like P2P).
> 
> > It should be separate register like how Jingchen proposed or not have register
> at all if the pci transport support it.
> 
> It should not, then you will end up defining the interaction with the status state
> machine.
> 
Treating all registers equally and synchronizing it in the device is better model to not bifurcate the device.

> > >
> > > >
> > > > > > D0->D3 and D3->D0 transition of the pci can suspend and resume
> > > > > > D0->the device
> > > > > which can preserve the last device_status value before entering D3.
> > > > >
> > > > > It's not only about the device status. I would not repeat the
> > > > > question I've asked in another thread.
> > > > >
> > > > > What's more, if you really want to suspend/freeze at PCI level
> > > > > and deal with PCI specific issues like P2P.  You should really
> > > > > try to leverage or invent a PCI mechanism instead of trying to
> > > > > carry such semantics via a virtio specific stuff like adminq.
> > > > > Solving transport specific problems at the virtio level is a layer violation.
> > > > >
> > > > PCI spec has already defined what it needs to.
> > >
> > > If PCI spec has good support for suspend/resume, why bother
> > > inventing mechanisms in virtio?
> > >
> > Because virtio today does not know if the PCI level suspend/resume
> > will actually work or not,
> 
> It's not the charge of virtio to know about this. Otherwise how many PCI stuffs
> needs virtio to understand? PCIE supports various capabilities.
> 
> > because in past it has not worked even if the PM capability was exposed.
> 
> Let's fix the hypervisor but last time I checked, suspend/hibernation works at
> least for virtio-net.
Because it destroyed the resource and re-created them.
It didn’t resume from where it left off. Ideally it should have done that.
Even if you fix the hypervisor, guest does not know that it is fixed in hypervisor, so guest does not know when to skip the current flow of reset.
Hence the bit is needed.

> 
> > So only a feature bit is needed.
> >
> > > > SR-PCIM interface is already concluded being outside of PCI-spec
> > > > by the pci-
> > > sig.
> > > > And no, there is no layer violation.
> > > >
> > > > Any non PCI member device can always implement necessary STOP mode
> > > > as
> > > no-op.
> > > >
> > > > And all of those talk make sense when one creates MMIO based
> > > > member
> > > device, until that point is just objections...
> > >
> > > They are different layers:
> > >
> > > 1) suspend/resume at virtio level
> > > 2) suspend/resume at transport level
> > >
> > > We need both of them to satisfy different cases. Just as we need to
> > > reset at both virtio and VF(FLR). Lingshan proposes 1) while it
> > > looks to me you propose 2) via virtio adminq but you said it has
> > > been supported by PCI which is then a duplication.
> > >
> > #1 is needed and to be owned by the guest driver in passthrough I
> > didn’t propose #2.
> > I proposed #2 be controlled by the vmm/hypervisor (via admin cmd) who is in
> charge of vm suspend/resume flow.
> 
> So you're saying it's the virtio level suspend but you want to limit PCI
> transactions in P2P. That's not the suspend/resume at virtio level for sure.
> 
Every virtio instruction translates to its underlying transport construct.
Be it driver notification or device notification moderation or vq dma.

Similarly, mode setting translated to its transport binding.

> >
> > > >
> > > > > > (Like preserving all rest of the fields of common and other device
> config).
> > > > > > This is orthogonal and needed regardless of device migration.
> > > > > >
> > > > > > 2. If one does not want to passthrough a member device, but
> > > > > > build a mediation-based device on top of existing virtio
> > > > > > device, It can be useful with
> > > > > mediating software.
> > > > > > Here the mediating software has ample duplicated knowledge of
> > > > > > what the
> > > > > member device already has.
> > > > >
> > > > > It is the way the hypervisors are doing for not only virtio but
> > > > > also for CPU and MMU as well.
> > > > >
> > > > Not really, vcpus and VMCS and more are part of the hardware support.
> > >
> > > That's not the context here. Hypervisors need to know almost every
> > > detail to make CPU virtualization work.
> > Cpu virtualization is accelerated for 1st level nesting including interrupts.
> >
> > > That's the fact, and it works for virio as well for years.
> > >
> > > What's more, nothing prevents us from inventing something similar in
> > > virtio to speed up the context switch or migration if necessary.
> > The major difference with cpu virtualization with nw device virtualization is,
> former flow is controlled by the sw, the later one is controlled by the network
> which is not predictable.
> 
> The guest behaviour is also unpredictable, and guests may share memories with
> others. I don't see your point.
> 
> > Hence, and context switching can mostly work in theory and not perform well
> with varied workload.
> 
> I don't think so, vCPU context is much more complicated than most of the virtio
> devices. I don't see why it can't work for simple virtio devices.
> 

So try to switch a RQ between two VMs at 100Gbps packet rate without a packet drop and see how it performs.

> > Most production users prefer dedicated/isolated non_context switched rx.
> 
> I don't think you can cover "most production users" here. Such use cases are
> limited with the missing save/load mechanism.
And they apparently are being blocked from year 2021 when these device migration efforts started.
Not any more..
> 
> >
> > >
> > > > 2 level nested page tables is hw support.
> > > > Anything beyond 2 level nesting, likely involves hypervisor.
> > >
> > > Needs emulation/trap for sure. That's the point.
> > >
> > > >
> > > > > > This can fulfil the nested requirement differently provided a
> > > > > > platform support
> > > > > it.
> > > > > > (PASID limitation will be practical blocker here).
> > > > >
> > > > > I don't think PASID is a blocker. It is only a blocker if you
> > > > > want to do
> > > passthrough.
> > > > >
> > > > Even without passthrough, one needs to steer the hypervisor DMA to
> > > > non
> > > guest memory.
> > > > And guest driver must not be able to attack (read/write) from that
> memory.
> > > > I don’t see how one can do this without PASID. As all DMAs are
> > > > tagged using
> > > only RID.
> > >
> > > There are a lot of other ways, but in order to converge, we can
> > > leave it for future discussions.
> > >
> > So, first level passthrough seems a basic requirement to support to operate
> from vmm control.
> >
> > 2nd level nesting can be emulated or accelerated to follow the principles of
> the paper you pointed.
> >
> > > What's more, if we design virtio for the future, PASID must be
> > > considered as a way as we all know it would come for sure.
> > >
> > For future PASID be fully controlled by the guest to continue like today.
> > PASID based bifurcation is still open question to me.
> 
> It is by design, e.g devices can have secondary PASID. It's not hard to
> understand. And it's much simpler than doing "bifurcation" in PF.
> 
> >
> > > >
> > > > > >
> > > > > > How to I plan to address above two?
> > > > > > a. #1 to be addressed by having the _F_PM bit, when the bit is
> > > > > > negotiated PCI
> > > > > PM drives the state.
> > > > >
> > > > > We can't duplicate every transport specific feature in virtio.
> > > > > This is a layer violation again. We should reuse the PCI facility here.
> > > > >
> > > > It is reused by having the feature bit to indicate that device
> > > > supports
> > > suspend/resume.
> > > > If from Day_1, if the PCI PM bits used, it would not require the feature bit.
> > > > But that was not the case.
> > > > So the guest driver do not know if using the PCI PM bit is enough
> > > > to decide, if
> > > suspend/resume by guest will work or not.
> > > > Hence the feature bit.
> > >
> > > Anyhow you need to update the driver if it has an issue. In the
> > > update, you can check and use PCI PM. If it doesn't have PCI PM, you
> > > can only suspend/resume at virtio level. Defining transport semantics at the
> virtio level breaks the layers.
> > >
> > This series does not define transport semantics at virtio level.
> 
> Don't you want to limit P2P in those states?
> 
At virtio level, they are not defined.
Virtio to transport binding has it like every single virtio construct has transport binding from notification, dma, sriov to anything else.

> > It only defines virtio level semantics of what to be done/not done.
> >
> > > >
> > > > > > This will work orthogonal to VMM side migration and will
> > > > > > co-exist with VMM
> > > > > based device migration.
> > >
> > > Actually not, if PF can suspend VF via PCI facilities, that would be
> > > no layer violation any more.
> > >
> > There is no such PCI facility.
> 
> If you want to make passthrough work without layer violation, you need either:
> 
> 1) invent them in the PCI
> 
This will follow the paper you pointed and follow all the principles listed there.

> or
> 
> 2) Trap and let hypervisor to control how to implement the suspend, for
> example hypervisor can choose to control the PM of VF
> 
> > PCI capabilities is not supposed to contain device migration kind of complex
> commands.
> 
> We're discussing suspending here, no? Talking about PCI, even if capabilities are
> not, it doesn't mean we can't extend PCI to use others. Anyhow, this is really ir-
> revelant to the discussion here.
PCI capability cannot contain virtio specific RW complex registers.
Vendor defined capability was done which is largely RO things which is ok.

> Virtio does virtio not PCI, you can't invent new features in virtio in order to be
> able to extend or fix the function of PCI.
Virtio needs to live with the limitation of the PCI and also needs to extend the PCI when it needs to.

> 
> > I explained in the discussion with Michael.
> >
> > > > > >
> > > > > > b. nested use case:
> > > > > > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > > > > > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> > > > >
> > > > > Let me ask it again here, how can you migrate L2 using L1 "emulated"
> > > > > PF? Emulation?
> > > > >
> > > > Emulation is one way as most nested platform components do.
> > >
> > > That's the point, you can't avoid emulation.
> > It is applicable only after first level.
> > First level must be able to take the benefit without emulation like rest of the
> system modules do today.
> 
> You can't avoid traps and emulation. So the key is what/when/where to trap,
> this is my logic of questions .
> 
I propose to do the nesting of the VF and follow the same model as 2 level nested page tables that actually work in the hw.

> You want to pass through virtio facilities without trap and emulation, you need
> to justify that.

For first level, it is clear to passthrough without trap and emulation like cpu page table walkthough.
I don’t know what you mean by justification, but it is the requirement to passthrough.
N level nesting is secondary requirement that should consult PCI-SIG if needed.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25  7:15                                                           ` Parav Pandit
@ 2023-10-25  8:24                                                             ` Michael S. Tsirkin
  2023-10-25  9:50                                                               ` Parav Pandit
  2023-10-26  0:46                                                             ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-25  8:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 25, 2023 at 07:15:24AM +0000, Parav Pandit wrote:
> Treating all registers equally and synchronizing it in the device is better model to not bifurcate the device.

Glad you like it, though I don't think many hypervisors treat all
registers equally. Personally I think where does the ideal balance pass
depends on specific hardware and software. And it is not wise to
tie ourselves to specific needs du jour.

So can we please stop arguing about which model is better on this list and
instead focus on building reusable components that different devices can
use as they see fit?

-- 
MST

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-24 10:27                                                                       ` Zhu, Lingshan
@ 2023-10-25  8:33                                                                         ` Michael S. Tsirkin
  2023-10-26  0:56                                                                           ` Jason Wang
  2023-10-26  6:38                                                                           ` Zhu, Lingshan
  0 siblings, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-25  8:33 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Tue, Oct 24, 2023 at 06:27:04PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/23/2023 7:32 PM, Michael S. Tsirkin wrote:
> 
>     On Mon, Oct 23, 2023 at 06:03:10PM +0800, Zhu, Lingshan wrote:
> 
>         config space, MMIO, registers work for years, what is wrong with them?
> 
>     Nothing as such. They don't seem to be appropriate for all use-case
>     where people want to utilize virtio. I think a new transport
>     will be needed to address these.
> 
> New transport for new type of devices for sure, like transport vq for SIOV.
> 
> I agree admin vq or admin cmds are useful in some use cases, that is
> another story, should be case by case.
> 
> For now, let's don't talk about all-use cases, just for current task, for live
> migration.
> 
> So IMHO, I still think we should use config space registers to control live
> migration process.
> 
> 

No because it forces integrating migration process with device driver.
Which is ok for some use-cases but not all of them.  Find some other
control plane for this.


>                 Config space is control path, DMA is data-path, let's better not mix them,
>                 we never expect to use config space to transfer data.
> 
>                 So we need DMA to transfer data, for example I take advantages of device DMA
>                 to logging dirty pages, This also applies to in-flight descriptors.
> 
>             As long as you do, I personally see little benefit to retrieve parts of
>             state with memory mapped accesses.
> 
>         registers only control, and I personally believe a single register is much
>         better
>         than processing admin commands, more light-weight, more reliable, working
>         for years.
> 
>     Yea. It would be, if we could do everything through that register.
>     But we can't really. Migration has too much data to pass around
>     for that to be reasonable.
> 
> data are not transferred by registers, they only control.
> 
> We transfer data by DMA, the device writes DMA dirty pages information(bitmap)
> to host isolated memory region.
> 


If you do that then I don't see any reason not to use admin
commands for that - either through a vq or a simpler
interface.


> 
>         Config space interfaces are fundamental for virtio-pci.
> 
> 
>     They are in fact fundamental to virtio. Multiple transports to
>     use config space are also fundamental.
> 
> I agree. So I also agree to build admin vq live migration solution based on our
> basic facilities, as Jason ever proposed.


I'm not sure it's even a vq. I suggest a minimal interface to send
admin commands. Could be used by migration, as transport, and more.

> 
> 
> 
>                 And we are implementing virito live migration, not only for PCI.
> 
>                 So both me and Jason keep repeating: We are implementing basic facilities,
>                 and the implementation is transport specific.
> 
>             But the register based facilities you proposed are extremely limited and
>             seem to only work for migration. For example, it seems mostly useless for
>             debugging because retrieving state is rather complex and would
>             interfere with normal working of the device.
> 
>         If you want to prove the register controlling interfaces are extremely
>         limited than admin vq or admin cmds,
>         you are also proving config space registers are extremely limited than
>         admin vq.
> 
>     Yes. Migration needs ability to pass large amounts of data around, and
>     is too complex a functionality to work reliably without ability to
>     report errors.
> 
> what errors? when device DMA?
> missing some dirty pages? If the device can detect such errors, it can recover
> by itself,
> or how can driver fix this?

Not just pages, there's a lot of internal device state.

You fix for example by reporting that state does not work
for a current device, and guest can be restarted on migration
source.


> for control path, virtio uses re-read for many years and it works well.

Let's not even get started with how live migration currently "works
well".  I happen to be familiar with it intimately.  We tried to
maintain migration compatiblity as best we could and we tend to break it
every second release.


> I
> believe we have
> went through this issue before.
>  
> 
> 
> 
>         So the question still here: do you want to replace current virtio-pci common
>         cfg
>         with admin vq or admin cmds?
> 
>     I think we need to add a new transport that will use admin commands.
>     Which one to use would be up to a specific device.
> 
> For new device type like SIOV, yes we need a new transport, transport vq.
> 
> Let's focus on this live migration feature, if there are new features in the
> future
> requires admin vq, let's discuss when they proposed.
> 
> 
> 
> 
>         And debug what? If you want to introduce more functionalities, we should
>         discuss
>         case by case.
> 
>         If debugging vq state, it is as easy as read queue_size, I don't see the
>         limitations
>         as queue_size work for years.
> 
>     No one reads queue_size. In fact for years we didn't have any debugging
>     functionality and we are fine. If we are adding it, it really needs to
>     be accessible when driver and device are wedged.
> 
> OK, I don't disagree to implement new device debugging features.
> 
> But let's focus on current live migration task.
> 
> 
> 
> 
>         I still believe our goal is to do our best, with our capabilities, to build
>         the most optimal virtio spec
>         as we can do. Not other goals.
> 
>         Thanks
>         Zhu Lingshan
> 
> 
> 
>                 We have proposed to build admin vq based on our register solution, this can
>                 somehow even help tp resolve the nested issue.
> 
>                 But I see the proposed has been rejected.
> 
>                 I still believe the goal is to build a best spec, not "just can work" with
>                 limitations.
> 
> 
> 
> 
> 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25  8:24                                                             ` Michael S. Tsirkin
@ 2023-10-25  9:50                                                               ` Parav Pandit
  2023-10-25 10:19                                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-25  9:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 25, 2023 1:55 PM
> 
> On Wed, Oct 25, 2023 at 07:15:24AM +0000, Parav Pandit wrote:
> > Treating all registers equally and synchronizing it in the device is better model
> to not bifurcate the device.
> 
> Glad you like it, though I don't think many hypervisors treat all registers equally.
> Personally I think where does the ideal balance pass depends on specific
> hardware and software. And it is not wise to tie ourselves to specific needs du
> jour.
> 
> So can we please stop arguing about which model is better on this list and
> instead focus on building reusable components that different devices can use as
> they see fit?

Sure. 
There is nothing special done in hypervisor for passthrough member device for virtio command and device specific area.
Going forward for passthrough, nothing special to be done for virtio common+dev specific registers by design.
So, nothing special to be build for it anyway here on the member device.
Hence, I will not debate it again unless it is questioned again.

Other models can anyway do anything suitable with these registers.
Hence, I believe device context as defined now is still reusable as common building block.
For non-passthrough, such hypervisor can simply ignore the fields which it is not interested in.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25  9:50                                                               ` Parav Pandit
@ 2023-10-25 10:19                                                                 ` Michael S. Tsirkin
  2023-10-25 10:22                                                                   ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-25 10:19 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 25, 2023 at 09:50:02AM +0000, Parav Pandit wrote:
> Hence, I believe device context as defined now is still reusable as common building block.
> For non-passthrough, such hypervisor can simply ignore the fields which it is not interested in.

Interesting. So a hypervisor encounters a field it does not
recognize. How does hypervisor know that a field is safe to ignore?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25 10:19                                                                 ` Michael S. Tsirkin
@ 2023-10-25 10:22                                                                   ` Parav Pandit
  2023-10-25 10:28                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-25 10:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 25, 2023 3:50 PM
> 
> On Wed, Oct 25, 2023 at 09:50:02AM +0000, Parav Pandit wrote:
> > Hence, I believe device context as defined now is still reusable as common
> building block.
> > For non-passthrough, such hypervisor can simply ignore the fields which it is
> not interested in.
> 
> Interesting. So a hypervisor encounters a field it does not recognize. How does
> hypervisor know that a field is safe to ignore?

In v2 all the supported fields are published in a query command.
If hypervisor wants to know the precise fields of interest, when the device reports unknown field, it can make its decision to ignore or fail to work.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25 10:22                                                                   ` Parav Pandit
@ 2023-10-25 10:28                                                                     ` Michael S. Tsirkin
  2023-10-26  3:32                                                                       ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-25 10:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 25, 2023 at 10:22:03AM +0000, Parav Pandit wrote:
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, October 25, 2023 3:50 PM
> > 
> > On Wed, Oct 25, 2023 at 09:50:02AM +0000, Parav Pandit wrote:
> > > Hence, I believe device context as defined now is still reusable as common
> > building block.
> > > For non-passthrough, such hypervisor can simply ignore the fields which it is
> > not interested in.
> > 
> > Interesting. So a hypervisor encounters a field it does not recognize. How does
> > hypervisor know that a field is safe to ignore?
> 
> In v2 all the supported fields are published in a query command.
> If hypervisor wants to know the precise fields of interest, when the device reports unknown field, it can make its decision to ignore or fail to work.

It really can't make any decisions here. Some fields are safe to ignore,
some are not but how can you tell?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-25  7:02                                                     ` Parav Pandit
@ 2023-10-26  0:46                                                       ` Jason Wang
  2023-10-26  3:45                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-26  0:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, October 25, 2023 6:59 AM
> > > For passthrough PASID assignment vq is not needed.
> >
> > How do you know that?
> Because for passthrough, the hypervisor is not involved in dealing with VQ at all.

Ok, so if I understand correctly, you are saying your design can't
work for the case of PASID assignment.

>
> > There are works ongoing to make vPASID work for the
> > guest like vSVA. Virtio doesn't differ from other devices.
> Passthrough do not run like SVA.

Great, you find another limitation of "passthrough" by yourself.

> Each passthrough device has PASID from its own space fully managed by the guest.
> Some cpu required vPASID and SIOV is not going this way anmore.

Then how to migrate? Invent a full set of something else through
another giant series like this to migrate to the SIOV thing? That's a
mess for sure.

>
> >
> > > If at all it is done, it will be done from the guest by the driver using virtio
> > interface.
> >
> > Then you need to trap. Such things couldn't be passed through to guests directly.
> >
> Only PASID capability is trapped. PASID allocation and usage is directly from guest.

How can you achieve this? Assigning a PAISD to a device is completely
device(virtio) specific. How can you use a general layer without the
knowledge of virtio to trap that?

> Regardless it is not relevant to passthrough mode as PASID is yet another resource.
> And for some cpu if it is trapped, it is generic layer, that does not require virtio involvement.
> So virtio interface asking to trap something because generic facility has done in not the approach.

This misses the point of PASID. How to use PASID is totally device specific.

>
> > > Capabilities of #2 is generic across all pci devices, so it will be handled by the
> > HV.
> > > ATS/PRI cap is also generic manner handled by the HV and PCI device.
> >
> > No, ATS/PRI requires the cooperation from the vIOMMU. You can simply do
> > ATS/PRI passthrough but with an emulated vIOMMU.
> And that is not the reason for virtio device to build trap+emulation for passthrough member devices.

vIOMMU is emulated by hypervisor with a PRI queue, how can you pass
through a hardware PRI request to a guest directly without trapping it
then? What's more, PCIE allows the PRI to be done in a vendor (virtio)
specific way, so you want to break this rule? Or you want to blacklist
ATS/PRI for virtio?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25  7:15                                                           ` Parav Pandit
  2023-10-25  8:24                                                             ` Michael S. Tsirkin
@ 2023-10-26  0:46                                                             ` Jason Wang
  2023-10-26  3:50                                                               ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-26  0:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Oct 25, 2023 at 3:15 PM Parav Pandit <parav@nvidia.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, October 25, 2023 6:59 AM
>
> [..]
> > > > resume, hence, there is no need to reset the device and destroy
> > > > resources like how it is done today.
> > > >
> > > > Well, I don't see how it is different from what LingShan proposed.
> > > The difference is, in passthrough mode, it will be fully controlled by the guest
> > VM without involving hypervisor.
> >
> > How does a rest in guest work but not suspend? You can choose to pass through
> > the suspend to the guest, and use save/load to migrate it.
> >
> > > It will work even when device migration is ongoing.
> > > What Lingshan proposed involved messing with the device status.
> >
> > Your proposal messes with the PCI semantics (as you want to rule the
> > behaviours like P2P).
> >
> > > It should be separate register like how Jingchen proposed or not have register
> > at all if the pci transport support it.
> >
> > It should not, then you will end up defining the interaction with the status state
> > machine.
> >
> Treating all registers equally and synchronizing it in the device is better model to not bifurcate the device.

If you keep using misleading terminology like bifurcation, the
discussion will be endless. Or you need to define it first.

For example, you still haven't succeeded in defining passthrough. And
you fail to explain why reset is part of the device status but suspend
can't.

>
> > > >
> > > > >
> > > > > > > D0->D3 and D3->D0 transition of the pci can suspend and resume
> > > > > > > D0->the device
> > > > > > which can preserve the last device_status value before entering D3.
> > > > > >
> > > > > > It's not only about the device status. I would not repeat the
> > > > > > question I've asked in another thread.
> > > > > >
> > > > > > What's more, if you really want to suspend/freeze at PCI level
> > > > > > and deal with PCI specific issues like P2P.  You should really
> > > > > > try to leverage or invent a PCI mechanism instead of trying to
> > > > > > carry such semantics via a virtio specific stuff like adminq.
> > > > > > Solving transport specific problems at the virtio level is a layer violation.
> > > > > >
> > > > > PCI spec has already defined what it needs to.
> > > >
> > > > If PCI spec has good support for suspend/resume, why bother
> > > > inventing mechanisms in virtio?
> > > >
> > > Because virtio today does not know if the PCI level suspend/resume
> > > will actually work or not,
> >
> > It's not the charge of virtio to know about this. Otherwise how many PCI stuffs
> > needs virtio to understand? PCIE supports various capabilities.
> >
> > > because in past it has not worked even if the PM capability was exposed.
> >
> > Let's fix the hypervisor but last time I checked, suspend/hibernation works at
> > least for virtio-net.
> Because it destroyed the resource and re-created them.

Who is "it"?

> It didn’t resume from where it left off. Ideally it should have done that.

We are developing virtio that has been implemented by multiple
different hypervisors/drivers. We can't introduce feature bits to
workaround issues that are spotted in just one of the hypervisors.

> Even if you fix the hypervisor, guest does not know that it is fixed in hypervisor, so guest does not know when to skip the current flow of reset.

From the view of the driver, it's more than sufficient to do what it
think is correct:

1) Use suspend at virtio layer, this requires a new feature but it's
not for PM for sure
2) Leverage the PM facility provided by the transport

> Hence the bit is needed.

Nope, a lot of bugs were fixed by the PCI layer and we don't do new
feature bits for them. The only concern is the migration compatibility
which is out of the scope of virtio.

>
> >
> > > So only a feature bit is needed.
> > >
> > > > > SR-PCIM interface is already concluded being outside of PCI-spec
> > > > > by the pci-
> > > > sig.
> > > > > And no, there is no layer violation.
> > > > >
> > > > > Any non PCI member device can always implement necessary STOP mode
> > > > > as
> > > > no-op.
> > > > >
> > > > > And all of those talk make sense when one creates MMIO based
> > > > > member
> > > > device, until that point is just objections...
> > > >
> > > > They are different layers:
> > > >
> > > > 1) suspend/resume at virtio level
> > > > 2) suspend/resume at transport level
> > > >
> > > > We need both of them to satisfy different cases. Just as we need to
> > > > reset at both virtio and VF(FLR). Lingshan proposes 1) while it
> > > > looks to me you propose 2) via virtio adminq but you said it has
> > > > been supported by PCI which is then a duplication.
> > > >
> > > #1 is needed and to be owned by the guest driver in passthrough I
> > > didn’t propose #2.
> > > I proposed #2 be controlled by the vmm/hypervisor (via admin cmd) who is in
> > charge of vm suspend/resume flow.
> >
> > So you're saying it's the virtio level suspend but you want to limit PCI
> > transactions in P2P. That's not the suspend/resume at virtio level for sure.
> >
> Every virtio instruction translates to its underlying transport construct.
> Be it driver notification or device notification moderation or vq dma.
>
> Similarly, mode setting translated to its transport binding.

I don't see your points. What I meant is, you need to use PCI
facilities to synchronize P2P instead of virtio. This is not hard to
be understood as there's no P2P definition in the core virtio layer
and P2P is not guaranteed to be useful for all of the other
transports.

>
> > >
> > > > >
> > > > > > > (Like preserving all rest of the fields of common and other device
> > config).
> > > > > > > This is orthogonal and needed regardless of device migration.
> > > > > > >
> > > > > > > 2. If one does not want to passthrough a member device, but
> > > > > > > build a mediation-based device on top of existing virtio
> > > > > > > device, It can be useful with
> > > > > > mediating software.
> > > > > > > Here the mediating software has ample duplicated knowledge of
> > > > > > > what the
> > > > > > member device already has.
> > > > > >
> > > > > > It is the way the hypervisors are doing for not only virtio but
> > > > > > also for CPU and MMU as well.
> > > > > >
> > > > > Not really, vcpus and VMCS and more are part of the hardware support.
> > > >
> > > > That's not the context here. Hypervisors need to know almost every
> > > > detail to make CPU virtualization work.
> > > Cpu virtualization is accelerated for 1st level nesting including interrupts.
> > >
> > > > That's the fact, and it works for virio as well for years.
> > > >
> > > > What's more, nothing prevents us from inventing something similar in
> > > > virtio to speed up the context switch or migration if necessary.
> > > The major difference with cpu virtualization with nw device virtualization is,
> > former flow is controlled by the sw, the later one is controlled by the network
> > which is not predictable.
> >
> > The guest behaviour is also unpredictable, and guests may share memories with
> > others. I don't see your point.
> >
> > > Hence, and context switching can mostly work in theory and not perform well
> > with varied workload.
> >
> > I don't think so, vCPU context is much more complicated than most of the virtio
> > devices. I don't see why it can't work for simple virtio devices.
> >
>
> So try to switch a RQ between two VMs at 100Gbps packet rate without a packet drop and see how it performs.

How can you avoid a packet drop if the VM is scheduled out? And did I
say such scheduling is needed in every use case?

So:

1) You stick your proposal can only work a specific setups and a
specific type of hypervisor
2) I try to show you there are just more setups and more types of
hypervisor that needs to be considered

That's the point. When I was talking about 2), you told me it doesn't
fit for 1). If you want to converge this discussion, let's don't shift
concepts.

Again, developing a feature with less assumptions and more use cases
are much more beneficial to virtio.

>
> > > Most production users prefer dedicated/isolated non_context switched rx.
> >
> > I don't think you can cover "most production users" here. Such use cases are
> > limited with the missing save/load mechanism.
> And they apparently are being blocked from year 2021 when these device migration efforts started.
> Not any more..

I don't get your points. It's blocked just because they're still
questions you haven't answered.

It's not a matter of how long but a matter of how well your proposal can be.

> >
> > >
> > > >
> > > > > 2 level nested page tables is hw support.
> > > > > Anything beyond 2 level nesting, likely involves hypervisor.
> > > >
> > > > Needs emulation/trap for sure. That's the point.
> > > >
> > > > >
> > > > > > > This can fulfil the nested requirement differently provided a
> > > > > > > platform support
> > > > > > it.
> > > > > > > (PASID limitation will be practical blocker here).
> > > > > >
> > > > > > I don't think PASID is a blocker. It is only a blocker if you
> > > > > > want to do
> > > > passthrough.
> > > > > >
> > > > > Even without passthrough, one needs to steer the hypervisor DMA to
> > > > > non
> > > > guest memory.
> > > > > And guest driver must not be able to attack (read/write) from that
> > memory.
> > > > > I don’t see how one can do this without PASID. As all DMAs are
> > > > > tagged using
> > > > only RID.
> > > >
> > > > There are a lot of other ways, but in order to converge, we can
> > > > leave it for future discussions.
> > > >
> > > So, first level passthrough seems a basic requirement to support to operate
> > from vmm control.
> > >
> > > 2nd level nesting can be emulated or accelerated to follow the principles of
> > the paper you pointed.
> > >
> > > > What's more, if we design virtio for the future, PASID must be
> > > > considered as a way as we all know it would come for sure.
> > > >
> > > For future PASID be fully controlled by the guest to continue like today.
> > > PASID based bifurcation is still open question to me.
> >
> > It is by design, e.g devices can have secondary PASID. It's not hard to
> > understand. And it's much simpler than doing "bifurcation" in PF.
> >
> > >
> > > > >
> > > > > > >
> > > > > > > How to I plan to address above two?
> > > > > > > a. #1 to be addressed by having the _F_PM bit, when the bit is
> > > > > > > negotiated PCI
> > > > > > PM drives the state.
> > > > > >
> > > > > > We can't duplicate every transport specific feature in virtio.
> > > > > > This is a layer violation again. We should reuse the PCI facility here.
> > > > > >
> > > > > It is reused by having the feature bit to indicate that device
> > > > > supports
> > > > suspend/resume.
> > > > > If from Day_1, if the PCI PM bits used, it would not require the feature bit.
> > > > > But that was not the case.
> > > > > So the guest driver do not know if using the PCI PM bit is enough
> > > > > to decide, if
> > > > suspend/resume by guest will work or not.
> > > > > Hence the feature bit.
> > > >
> > > > Anyhow you need to update the driver if it has an issue. In the
> > > > update, you can check and use PCI PM. If it doesn't have PCI PM, you
> > > > can only suspend/resume at virtio level. Defining transport semantics at the
> > virtio level breaks the layers.
> > > >
> > > This series does not define transport semantics at virtio level.
> >
> > Don't you want to limit P2P in those states?
> >
> At virtio level, they are not defined.

Assuming virtio level, why did you keep mentioning P2P and your V2
wants to freeze/stop to synchronize with FLR? That's really
self-contradictory.

> Virtio to transport binding has it like every single virtio construct has transport binding from notification, dma, sriov to anything else.

So you explain yourself here, virtio can't live with a specific
transport, if you want to rule transport specific behaviour like P2P,
why not do that directly at transport level? You know the P2P is not
necessarily done between 2 virtio devices, so how to synchronize it in
that case? If you'd expect a state like freeze/stop exists on all
other PCI devices, PCI is obviously a better place to do that.

>
> > > It only defines virtio level semantics of what to be done/not done.
> > >
> > > > >
> > > > > > > This will work orthogonal to VMM side migration and will
> > > > > > > co-exist with VMM
> > > > > > based device migration.
> > > >
> > > > Actually not, if PF can suspend VF via PCI facilities, that would be
> > > > no layer violation any more.
> > > >
> > > There is no such PCI facility.
> >
> > If you want to make passthrough work without layer violation, you need either:
> >
> > 1) invent them in the PCI
> >
> This will follow the paper you pointed and follow all the principles listed there.
>
> > or
> >
> > 2) Trap and let hypervisor to control how to implement the suspend, for
> > example hypervisor can choose to control the PM of VF
> >
> > > PCI capabilities is not supposed to contain device migration kind of complex
> > commands.
> >
> > We're discussing suspending here, no? Talking about PCI, even if capabilities are
> > not, it doesn't mean we can't extend PCI to use others. Anyhow, this is really ir-
> > revelant to the discussion here.
> PCI capability cannot contain virtio specific RW complex registers.

Again, if you keep debating the existing virtio-pci design, let's use
another thread please. I've explained this is not the truth in another
thread. If you don't want to converge the discussion, you can keep
raising those unrelated topics here.

> Vendor defined capability was done which is largely RO things which is ok.
>
> > Virtio does virtio not PCI, you can't invent new features in virtio in order to be
> > able to extend or fix the function of PCI.
> Virtio needs to live with the limitation of the PCI

Yes but it doesn't mean virtio needs to workaround those limitations.

And before blaming PCI, could it be a problem with your proposed architecture?

>and also needs to extend the PCI when it needs to.

That's the way to go.

>
> >
> > > I explained in the discussion with Michael.
> > >
> > > > > > >
> > > > > > > b. nested use case:
> > > > > > > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > > > > > > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> > > > > >
> > > > > > Let me ask it again here, how can you migrate L2 using L1 "emulated"
> > > > > > PF? Emulation?
> > > > > >
> > > > > Emulation is one way as most nested platform components do.
> > > >
> > > > That's the point, you can't avoid emulation.
> > > It is applicable only after first level.
> > > First level must be able to take the benefit without emulation like rest of the
> > system modules do today.
> >
> > You can't avoid traps and emulation. So the key is what/when/where to trap,
> > this is my logic of questions .
> >
> I propose to do the nesting of the VF and follow the same model as 2 level nested page tables that actually work in the hw.

It's not, I don't see where 2 level nested tables are located. And it
brings a lot of complexities in the architecture.

>
> > You want to pass through virtio facilities without trap and emulation, you need
> > to justify that.
>
> For first level, it is clear to passthrough without trap and emulation like cpu page table walkthough.
> I don’t know what you mean by justification,

Again. There are just too many questions that are still not answered.
I would not repeat it again here. It's your job to convince people.

For example, assuming you are correct, you still fail to explain

1) what is trapped and what's not, or what's the boundary
2) if the hypervisor is not developed with those assumptions, things
can work or not

Thanks







> but it is the requirement to passthrough.
> N level nesting is secondary requirement that should consult PCI-SIG if needed.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25  8:33                                                                         ` Michael S. Tsirkin
@ 2023-10-26  0:56                                                                           ` Jason Wang
  2023-10-26  3:58                                                                             ` Parav Pandit
  2023-10-26  6:22                                                                             ` Michael S. Tsirkin
  2023-10-26  6:38                                                                           ` Zhu, Lingshan
  1 sibling, 2 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-26  0:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Oct 25, 2023 at 4:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Oct 24, 2023 at 06:27:04PM +0800, Zhu, Lingshan wrote:
> >
> >
> > On 10/23/2023 7:32 PM, Michael S. Tsirkin wrote:
> >
> >     On Mon, Oct 23, 2023 at 06:03:10PM +0800, Zhu, Lingshan wrote:
> >
> >         config space, MMIO, registers work for years, what is wrong with them?
> >
> >     Nothing as such. They don't seem to be appropriate for all use-case
> >     where people want to utilize virtio. I think a new transport
> >     will be needed to address these.
> >
> > New transport for new type of devices for sure, like transport vq for SIOV.
> >
> > I agree admin vq or admin cmds are useful in some use cases, that is
> > another story, should be case by case.
> >
> > For now, let's don't talk about all-use cases, just for current task, for live
> > migration.
> >
> > So IMHO, I still think we should use config space registers to control live
> > migration process.
> >
> >
>
> No because it forces integrating migration process with device driver.
> Which is ok for some use-cases but not all of them.  Find some other
> control plane for this.
>
>
> >                 Config space is control path, DMA is data-path, let's better not mix them,
> >                 we never expect to use config space to transfer data.
> >
> >                 So we need DMA to transfer data, for example I take advantages of device DMA
> >                 to logging dirty pages, This also applies to in-flight descriptors.
> >
> >             As long as you do, I personally see little benefit to retrieve parts of
> >             state with memory mapped accesses.
> >
> >         registers only control, and I personally believe a single register is much
> >         better
> >         than processing admin commands, more light-weight, more reliable, working
> >         for years.
> >
> >     Yea. It would be, if we could do everything through that register.
> >     But we can't really. Migration has too much data to pass around
> >     for that to be reasonable.
> >
> > data are not transferred by registers, they only control.
> >
> > We transfer data by DMA, the device writes DMA dirty pages information(bitmap)
> > to host isolated memory region.
> >
>
>
> If you do that then I don't see any reason not to use admin
> commands for that - either through a vq or a simpler
> interface.

I think we need to agree that admin commands are the only interface
for any future features before we can have an agreement here.

My understanding is that it is optional for the transport that
requires administrative commands like provisioning etc. It is not
necessarily the interface for new features.

>
>
> >
> >         Config space interfaces are fundamental for virtio-pci.
> >
> >
> >     They are in fact fundamental to virtio. Multiple transports to
> >     use config space are also fundamental.
> >
> > I agree. So I also agree to build admin vq live migration solution based on our
> > basic facilities, as Jason ever proposed.
>
>
> I'm not sure it's even a vq. I suggest a minimal interface to send
> admin commands. Could be used by migration, as transport, and more.
>

It's better if we can do that below the layer of admin commands. For
example, we don't stick device status with any specific interface. We
can keep doing things like this.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25 10:28                                                                     ` Michael S. Tsirkin
@ 2023-10-26  3:32                                                                       ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-26  3:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, October 25, 2023 3:59 PM
> 
> On Wed, Oct 25, 2023 at 10:22:03AM +0000, Parav Pandit wrote:
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Wednesday, October 25, 2023 3:50 PM
> > >
> > > On Wed, Oct 25, 2023 at 09:50:02AM +0000, Parav Pandit wrote:
> > > > Hence, I believe device context as defined now is still reusable
> > > > as common
> > > building block.
> > > > For non-passthrough, such hypervisor can simply ignore the fields
> > > > which it is
> > > not interested in.
> > >
> > > Interesting. So a hypervisor encounters a field it does not
> > > recognize. How does hypervisor know that a field is safe to ignore?
> >
> > In v2 all the supported fields are published in a query command.
> > If hypervisor wants to know the precise fields of interest, when the device
> reports unknown field, it can make its decision to ignore or fail to work.
> 
> It really can't make any decisions here. Some fields are safe to ignore, some are
> not but how can you tell?
If there are unknown fields that can appear, hypervisor can avoid operating this device.
It does not have to be attributed as mandatory/optional fields.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-26  0:46                                                       ` Jason Wang
@ 2023-10-26  3:45                                                         ` Parav Pandit
  2023-10-30  4:06                                                           ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-26  3:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, October 26, 2023 6:16 AM
> 
> On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > For passthrough PASID assignment vq is not needed.
> > >
> > > How do you know that?
> > Because for passthrough, the hypervisor is not involved in dealing with VQ at
> all.
> 
> Ok, so if I understand correctly, you are saying your design can't work for the
> case of PASID assignment.
>
No. PASID assignment will happen from the guest for its own use and device migration will just work fine because device context will capture this. 
 
> >
> > > There are works ongoing to make vPASID work for the guest like vSVA.
> > > Virtio doesn't differ from other devices.
> > Passthrough do not run like SVA.
> 
> Great, you find another limitation of "passthrough" by yourself.
> 
No. it is not the limitation it is just the way it does not need complex SVA to split the device for unrelated usage.

> > Each passthrough device has PASID from its own space fully managed by the
> guest.
> > Some cpu required vPASID and SIOV is not going this way anmore.
> 
> Then how to migrate? Invent a full set of something else through another giant
> series like this to migrate to the SIOV thing? That's a mess for sure.
>
SIOV will for sure reuse most or all parts of this work, almost entirely as_is.
vPASID is cpu/platform specific things not part of the SIOV devices.

> >
> > >
> > > > If at all it is done, it will be done from the guest by the driver
> > > > using virtio
> > > interface.
> > >
> > > Then you need to trap. Such things couldn't be passed through to guests
> directly.
> > >
> > Only PASID capability is trapped. PASID allocation and usage is directly from
> guest.
> 
> How can you achieve this? Assigning a PAISD to a device is completely
> device(virtio) specific. How can you use a general layer without the knowledge
> of virtio to trap that?
When one wants to map vPASID to pPASID a platform needs to be involved.
When virtio passthrough device is in guest, it has all its PASID accessible.

All these is large deviation from current discussion of this series, so I will keep it short.

> 
> > Regardless it is not relevant to passthrough mode as PASID is yet another
> resource.
> > And for some cpu if it is trapped, it is generic layer, that does not require virtio
> involvement.
> > So virtio interface asking to trap something because generic facility has done
> in not the approach.
> 
> This misses the point of PASID. How to use PASID is totally device specific.
Sure, and how to virtualize vPASID/pPASID is platform specific as single PASID can be used by multiple devices and process.

> 
> >
> > > > Capabilities of #2 is generic across all pci devices, so it will
> > > > be handled by the
> > > HV.
> > > > ATS/PRI cap is also generic manner handled by the HV and PCI device.
> > >
> > > No, ATS/PRI requires the cooperation from the vIOMMU. You can simply
> > > do ATS/PRI passthrough but with an emulated vIOMMU.
> > And that is not the reason for virtio device to build trap+emulation for
> passthrough member devices.
> 
> vIOMMU is emulated by hypervisor with a PRI queue, 
PRI requests arrive on the PF for the VF.

> how can you pass
> through a hardware PRI request to a guest directly without trapping it then?
> What's more, PCIE allows the PRI to be done in a vendor (virtio) specific way, so
> you want to break this rule? Or you want to blacklist ATS/PRI for virtio?
> 
I was aware of only pci-sig way of PRI.
Do you have a reference to the ECN that enables vendor specific way of PRI? I would like to read it.
This will be very good to eliminate IOMMU PRI limitations.
PRI will directly go to the guest driver, and guest would interact with IOMMU to service the paging request through IOMMU APIs.
For PRI in vendor specific way needs a separate discussion. It is not related to live migration.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  0:46                                                             ` Jason Wang
@ 2023-10-26  3:50                                                               ` Parav Pandit
  2023-10-30  4:04                                                                 ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-26  3:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Jason Wang

> For example, you still haven't succeeded in defining passthrough. 
It was defined on 19th Oct in [1].
What part is not clear to you in definition of passthrough device?

[1] https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  0:56                                                                           ` Jason Wang
@ 2023-10-26  3:58                                                                             ` Parav Pandit
  2023-10-30  3:59                                                                               ` Jason Wang
  2023-10-26  6:22                                                                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-26  3:58 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, October 26, 2023 6:27 AM
> 
> On Wed, Oct 25, 2023 at 4:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Oct 24, 2023 at 06:27:04PM +0800, Zhu, Lingshan wrote:
> > >
> > >
> > > On 10/23/2023 7:32 PM, Michael S. Tsirkin wrote:
> > >
> > >     On Mon, Oct 23, 2023 at 06:03:10PM +0800, Zhu, Lingshan wrote:
> > >
> > >         config space, MMIO, registers work for years, what is wrong with them?
> > >
> > >     Nothing as such. They don't seem to be appropriate for all use-case
> > >     where people want to utilize virtio. I think a new transport
> > >     will be needed to address these.
> > >
> > > New transport for new type of devices for sure, like transport vq for SIOV.
> > >
> > > I agree admin vq or admin cmds are useful in some use cases, that is
> > > another story, should be case by case.
> > >
> > > For now, let's don't talk about all-use cases, just for current
> > > task, for live migration.
> > >
> > > So IMHO, I still think we should use config space registers to
> > > control live migration process.
> > >
> > >
> >
> > No because it forces integrating migration process with device driver.
> > Which is ok for some use-cases but not all of them.  Find some other
> > control plane for this.
> >
> >
> > >                 Config space is control path, DMA is data-path, let's better not mix
> them,
> > >                 we never expect to use config space to transfer data.
> > >
> > >                 So we need DMA to transfer data, for example I take advantages of
> device DMA
> > >                 to logging dirty pages, This also applies to in-flight descriptors.
> > >
> > >             As long as you do, I personally see little benefit to retrieve parts of
> > >             state with memory mapped accesses.
> > >
> > >         registers only control, and I personally believe a single register is much
> > >         better
> > >         than processing admin commands, more light-weight, more reliable,
> working
> > >         for years.
> > >
> > >     Yea. It would be, if we could do everything through that register.
> > >     But we can't really. Migration has too much data to pass around
> > >     for that to be reasonable.
> > >
> > > data are not transferred by registers, they only control.
> > >
> > > We transfer data by DMA, the device writes DMA dirty pages
> > > information(bitmap) to host isolated memory region.
> > >
> >
> >
> > If you do that then I don't see any reason not to use admin commands
> > for that - either through a vq or a simpler interface.
> 
> I think we need to agree that admin commands are the only interface for any
> future features before we can have an agreement here.
>
For passthrough member devices, this admin commands to be transported via the owner device by the migration driver which sits outside of the guest VM.

If this basic requirement is not clear it's pointless to discuss further.
How is the passthrough device defined? It is defined as [1].

Can non passthrough device modes such as vdpa also use these admin commands from the owner device?
Sure, why not?

[1] https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  0:56                                                                           ` Jason Wang
  2023-10-26  3:58                                                                             ` Parav Pandit
@ 2023-10-26  6:22                                                                             ` Michael S. Tsirkin
  2023-10-30  4:02                                                                               ` Jason Wang
  2023-11-01  0:33                                                                               ` Jason Wang
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-26  6:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 26, 2023 at 08:56:47AM +0800, Jason Wang wrote:
> > > We transfer data by DMA, the device writes DMA dirty pages information(bitmap)
> > > to host isolated memory region.
> > >
> >
> >
> > If you do that then I don't see any reason not to use admin
> > commands for that - either through a vq or a simpler
> > interface.
> 
> I think we need to agree that admin commands are the only interface
> for any future features before we can have an agreement here.

I don't think that needs to be the case. I do think that if
your goal is a separate channel from normal device operation
then this is what admin commands have been designed for.

> My understanding is that it is optional for the transport that
> requires administrative commands like provisioning etc. It is not
> necessarily the interface for new features.

Yes. And migration is IMO sufficiently "like provisioning".

> >
> >
> > >
> > >         Config space interfaces are fundamental for virtio-pci.
> > >
> > >
> > >     They are in fact fundamental to virtio. Multiple transports to
> > >     use config space are also fundamental.
> > >
> > > I agree. So I also agree to build admin vq live migration solution based on our
> > > basic facilities, as Jason ever proposed.
> >
> >
> > I'm not sure it's even a vq. I suggest a minimal interface to send
> > admin commands. Could be used by migration, as transport, and more.
> >
> 
> It's better if we can do that below the layer of admin commands. For
> example, we don't stick device status with any specific interface. We
> can keep doing things like this.
> 
> Thanks

Could go either way, but complex functionality like live migration
can benefit from a rich interface.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-25  8:33                                                                         ` Michael S. Tsirkin
  2023-10-26  0:56                                                                           ` Jason Wang
@ 2023-10-26  6:38                                                                           ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-26  6:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/25/2023 4:33 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 24, 2023 at 06:27:04PM +0800, Zhu, Lingshan wrote:
>>
>> On 10/23/2023 7:32 PM, Michael S. Tsirkin wrote:
>>
>>      On Mon, Oct 23, 2023 at 06:03:10PM +0800, Zhu, Lingshan wrote:
>>
>>          config space, MMIO, registers work for years, what is wrong with them?
>>
>>      Nothing as such. They don't seem to be appropriate for all use-case
>>      where people want to utilize virtio. I think a new transport
>>      will be needed to address these.
>>
>> New transport for new type of devices for sure, like transport vq for SIOV.
>>
>> I agree admin vq or admin cmds are useful in some use cases, that is
>> another story, should be case by case.
>>
>> For now, let's don't talk about all-use cases, just for current task, for live
>> migration.
>>
>> So IMHO, I still think we should use config space registers to control live
>> migration process.
>>
>>
> No because it forces integrating migration process with device driver.
> Which is ok for some use-cases but not all of them.  Find some other
> control plane for this.
what interesting migration process? It needs to work with host driver anyway
>
>
>>                  Config space is control path, DMA is data-path, let's better not mix them,
>>                  we never expect to use config space to transfer data.
>>
>>                  So we need DMA to transfer data, for example I take advantages of device DMA
>>                  to logging dirty pages, This also applies to in-flight descriptors.
>>
>>              As long as you do, I personally see little benefit to retrieve parts of
>>              state with memory mapped accesses.
>>
>>          registers only control, and I personally believe a single register is much
>>          better
>>          than processing admin commands, more light-weight, more reliable, working
>>          for years.
>>
>>      Yea. It would be, if we could do everything through that register.
>>      But we can't really. Migration has too much data to pass around
>>      for that to be reasonable.
>>
>> data are not transferred by registers, they only control.
>>
>> We transfer data by DMA, the device writes DMA dirty pages information(bitmap)
>> to host isolated memory region.
>>
>
> If you do that then I don't see any reason not to use admin
> commands for that - either through a vq or a simpler
> interface.
I mean Device DMA writing a bitmap to report dirty pages.
Why you want admin cmds doing that?
>
>
>>          Config space interfaces are fundamental for virtio-pci.
>>
>>
>>      They are in fact fundamental to virtio. Multiple transports to
>>      use config space are also fundamental.
>>
>> I agree. So I also agree to build admin vq live migration solution based on our
>> basic facilities, as Jason ever proposed.
>
> I'm not sure it's even a vq. I suggest a minimal interface to send
> admin commands. Could be used by migration, as transport, and more.
so you still need to explain why admin cmds are better than registers for
this live migration task.

Or you are saying admin cmds are better than config space, I am not sure I
agree with this statement.
>
>>
>>
>>                  And we are implementing virito live migration, not only for PCI.
>>
>>                  So both me and Jason keep repeating: We are implementing basic facilities,
>>                  and the implementation is transport specific.
>>
>>              But the register based facilities you proposed are extremely limited and
>>              seem to only work for migration. For example, it seems mostly useless for
>>              debugging because retrieving state is rather complex and would
>>              interfere with normal working of the device.
>>
>>          If you want to prove the register controlling interfaces are extremely
>>          limited than admin vq or admin cmds,
>>          you are also proving config space registers are extremely limited than
>>          admin vq.
>>
>>      Yes. Migration needs ability to pass large amounts of data around, and
>>      is too complex a functionality to work reliably without ability to
>>      report errors.
>>
>> what errors? when device DMA?
>> missing some dirty pages? If the device can detect such errors, it can recover
>> by itself,
>> or how can driver fix this?
> Not just pages, there's a lot of internal device state.
>
> You fix for example by reporting that state does not work
> for a current device, and guest can be restarted on migration
> source.
If re-read fail, means status checking fail, then recover or failed 
migration
If migration fail, like timeout, then resume.

If guest restart, the hypervisor is aware of this process and the device 
reset as well.
What's wrong for this process?
>
>
>> for control path, virtio uses re-read for many years and it works well.
> Let's not even get started with how live migration currently "works
> well".  I happen to be familiar with it intimately.  We tried to
> maintain migration compatiblity as best we could and we tend to break it
> every second release.
I mean re-read to check, like re-read device status to make sure the 
device is
suspended, like how virtio handle feartures_ok. this "re-read" work well.

>
>
>> I
>> believe we have
>> went through this issue before.
>>   
>>
>>
>>
>>          So the question still here: do you want to replace current virtio-pci common
>>          cfg
>>          with admin vq or admin cmds?
>>
>>      I think we need to add a new transport that will use admin commands.
>>      Which one to use would be up to a specific device.
>>
>> For new device type like SIOV, yes we need a new transport, transport vq.
>>
>> Let's focus on this live migration feature, if there are new features in the
>> future
>> requires admin vq, let's discuss when they proposed.
>>
>>
>>
>>
>>          And debug what? If you want to introduce more functionalities, we should
>>          discuss
>>          case by case.
>>
>>          If debugging vq state, it is as easy as read queue_size, I don't see the
>>          limitations
>>          as queue_size work for years.
>>
>>      No one reads queue_size. In fact for years we didn't have any debugging
>>      functionality and we are fine. If we are adding it, it really needs to
>>      be accessible when driver and device are wedged.
>>
>> OK, I don't disagree to implement new device debugging features.
>>
>> But let's focus on current live migration task.
>>
>>
>>
>>
>>          I still believe our goal is to do our best, with our capabilities, to build
>>          the most optimal virtio spec
>>          as we can do. Not other goals.
>>
>>          Thanks
>>          Zhu Lingshan
>>
>>
>>
>>                  We have proposed to build admin vq based on our register solution, this can
>>                  somehow even help tp resolve the nested issue.
>>
>>                  But I see the proposed has been rejected.
>>
>>                  I still believe the goal is to build a best spec, not "just can work" with
>>                  limitations.
>>
>>
>>
>>
>>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-24 10:37                                                                     ` Parav Pandit
@ 2023-10-26  6:44                                                                       ` Zhu, Lingshan
  2023-10-26  7:04                                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-26  6:44 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/24/2023 6:37 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Tuesday, October 24, 2023 4:00 PM
>>
>> On 10/23/2023 6:14 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, October 23, 2023 3:39 PM
>>>>
>>>> On 10/20/2023 8:54 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Friday, October 20, 2023 3:01 PM
>>>>>>
>>>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>>>>>
>>>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>>>>>>>> Oh, really? Quite interesting, do you want to move all config
>>>>>>>>>>> space fields in VF to admin vq? Have a plan?
>>>>>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>>>>>> I do not want to divert the discussion, would like to focus on
>>>>>>>>>> device
>>>>>>>> migration phases.
>>>>>>>>>> Lets please discuss in some other dedicated thread.
>>>>>>>>> Possibly, if there's a way to send admin commands to vf itself
>>>>>>>>> then Lingshan will be happy?
>>>>>>>> still need to prove why admin commands are better than registers.
>>>>>>> Virtio spec development is not proof based approach. Please stop
>>>>>>> asking for
>>>> it.
>>>>>>> I tried my best to have technical answer in [1].
>>>>>>> I explained that registers simply do not work for passthrough mode
>>>>>>> (if this is what you are asking when you are asking prove its better).
>>>>>>> They can work for non_passthrough mediated mode.
>>>>>>>
>>>>>>> A member device may do admin commands using registers. Michael and
>>>>>>> I are
>>>>>> discussing presently in the same thread.
>>>>>>> Since there are multiple things to be done for device migration,
>>>>>>> dedicated
>>>>>> register set for each functionality do not scale well, hard to
>>>>>> maintain and extend.
>>>>>>> A register holding a command content make sense.
>>>>>>>
>>>>>>> Now, with that, if this can be useful only for non_passthrough, I
>>>>>>> made humble
>>>>>> request to transport them using AQ, this way, you get all benefits of AQ.
>>>>>>> And trying to understand, why AQ cannot possible or inferior?
>>>>>>>
>>>>>>> If you have commands like suspend/resume device, register or queue
>>>>>> transport simply don’t work, because it's wrong to bifurcate the
>>>>>> device with such weird API.
>>>>>>> If you want to biferacate for mediation software, it probably
>>>>>>> makes sense to
>>>>>> operate at each VQ level, config space level. Such are very
>>>>>> different commands than passthrough.
>>>>>>> I think vdpa has demonstrated that very well on how to do specific
>>>>>>> work for
>>>>>> specific device type. So some of those work can be done using AQ.
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-bd1
>>>>>>> 03
>>>>>>> 36
>>>>>>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>>>>>> We have been through your statement for many times.
>>>>>> This is not about how many times you repeated, if you think this is
>>>>>> true, you need to prove that with solid evidence.
>>>>>>
>>>>> I will not respond to this comment anymore.
>>>> Ok if you choose not to respond.
>>>>>> For pass-through, I still recommend you to take a reference of
>>>>>> current virito-pci implementation, it works for pass-through, right?
>>>>> What do you mean by current virtio-pci implementation?
>>>> current virito-pci works for pass-through
>>> I still don’t understand what is "current virtio-pci".
>>> Do you mean qemu implementation of emulated virtio-pci or you mean
>> virtio-pci specification for passthrough?
>>> What do you want me to refer to for passthrough? Please clarify.
>> you know guest vcpu and its vRC can not access host side devices, and there
>> must be a driver helping the pass-through use cases, like vDPA and vfio
> I am not sure how to corelate this answer to the question of "virtio-pci for passthrough".
> :(
>
> Today when a virtio-pci member device is passthrough to the guest VM, hypervisor is not involved in virtio interface such as config space, cvq, data vq etc.
> Do you agree?
Can vCPU access host side device config space? It needs a pass-through 
helper driver like vfio, right?
>
>>>>>> For scale, I already told you for many times that they are
>>>>>> per-device facilities. How can a per-device facility not scale?
>>>>> Each VF device must implement new set of on-chip memory-based
>>>>> registers
>>>> which demands more power, die area which does not scale efficiently
>>>> to thousands of VFs.
>>>> that can be fpga gates or SOC implementing new features, you think
>>>> that is a waste?
>>> It is waste in hw, if there is a better approach possible to not burn them as
>> gates and save on resources for rarely used items.
>> Is a new entry in MSIX table a waste of HW?
> Not as must as existing MSI-X table entries which requires linear amount of on-chip memory.
anyway, even only one MSIX entry cost my HW resource than the amount of 
new registers in my proposal.
>
>> Can I say implementing admin vq in SOC is a waste of cores?
> Which cores in the SoC?
> If it is on the PF, there is only handful of AQs for scale of N VFs.
I see you got the point anyway, new features cost extra resource
>
>>>
>>>>>> vDPA works fine on config space.
>>>>>>
>>>>>> So, if you still insist admin vq is better than config space like
>>>>>> in other thread you have concluded, you may imply that config space
>>>>>> interfaces should be re-factored to admin vq.
>>>>> Whatever is done in past is done, there is no way to change history.
>>>>> An new non init time registers should not be placed in device
>>>>> specific config
>>>> space as virtio spec has clear guideline on it for good.
>>>>> Device context reading, dirty page address reading, changing vf
>>>>> device modes,
>>>> all of these are clearly not a init time settings.
>>>>> Hence, they do not belong to the registers.
>>>> reset vq? and you get it from Appendix B. Creating New Device Types,
>>>> are we implementing a new type of device???
>>> I don’t understand your question.
>>> I replied the history of reset_vq.
>>> Take good examples to follow, reset_vq clearly is not the one.
>> so again, we are not implementing new device type, so your citation doesn't
>> apply.
> I disagree.
> I am engineer to build practical systems considering limitations and also advancements of the transport; while listening to other industry efforts,
> I am no from legal department.
> Hence, Appendix B makes a sense to me to apply to the existing device which also has the section for "device improvements".
it titled as "new device", and I think this discussion is non-sense. So 
if you want to fix this statement, works for me.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  6:44                                                                       ` Zhu, Lingshan
@ 2023-10-26  7:04                                                                         ` Parav Pandit
  2023-10-30  3:44                                                                           ` Zhu, Lingshan
  2023-10-30 11:27                                                                           ` Michael S. Tsirkin
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-26  7:04 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, October 26, 2023 12:14 PM
> 
> 
> On 10/24/2023 6:37 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Tuesday, October 24, 2023 4:00 PM
> >>
> >> On 10/23/2023 6:14 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Monday, October 23, 2023 3:39 PM
> >>>>
> >>>> On 10/20/2023 8:54 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Friday, October 20, 2023 3:01 PM
> >>>>>>
> >>>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
> >>>>>>>>
> >>>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> >>>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> >>>>>>>>>>> Oh, really? Quite interesting, do you want to move all
> >>>>>>>>>>> config space fields in VF to admin vq? Have a plan?
> >>>>>>>>>> Not in my plan for spec 1.4 time frame.
> >>>>>>>>>> I do not want to divert the discussion, would like to focus
> >>>>>>>>>> on device
> >>>>>>>> migration phases.
> >>>>>>>>>> Lets please discuss in some other dedicated thread.
> >>>>>>>>> Possibly, if there's a way to send admin commands to vf itself
> >>>>>>>>> then Lingshan will be happy?
> >>>>>>>> still need to prove why admin commands are better than registers.
> >>>>>>> Virtio spec development is not proof based approach. Please stop
> >>>>>>> asking for
> >>>> it.
> >>>>>>> I tried my best to have technical answer in [1].
> >>>>>>> I explained that registers simply do not work for passthrough
> >>>>>>> mode (if this is what you are asking when you are asking prove its
> better).
> >>>>>>> They can work for non_passthrough mediated mode.
> >>>>>>>
> >>>>>>> A member device may do admin commands using registers. Michael
> >>>>>>> and I are
> >>>>>> discussing presently in the same thread.
> >>>>>>> Since there are multiple things to be done for device migration,
> >>>>>>> dedicated
> >>>>>> register set for each functionality do not scale well, hard to
> >>>>>> maintain and extend.
> >>>>>>> A register holding a command content make sense.
> >>>>>>>
> >>>>>>> Now, with that, if this can be useful only for non_passthrough,
> >>>>>>> I made humble
> >>>>>> request to transport them using AQ, this way, you get all benefits of AQ.
> >>>>>>> And trying to understand, why AQ cannot possible or inferior?
> >>>>>>>
> >>>>>>> If you have commands like suspend/resume device, register or
> >>>>>>> queue
> >>>>>> transport simply don’t work, because it's wrong to bifurcate the
> >>>>>> device with such weird API.
> >>>>>>> If you want to biferacate for mediation software, it probably
> >>>>>>> makes sense to
> >>>>>> operate at each VQ level, config space level. Such are very
> >>>>>> different commands than passthrough.
> >>>>>>> I think vdpa has demonstrated that very well on how to do
> >>>>>>> specific work for
> >>>>>> specific device type. So some of those work can be done using AQ.
> >>>>>>> [1]
> >>>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-b
> >>>>>>> d1
> >>>>>>> 03
> >>>>>>> 36
> >>>>>>>
> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> >>>>>> We have been through your statement for many times.
> >>>>>> This is not about how many times you repeated, if you think this
> >>>>>> is true, you need to prove that with solid evidence.
> >>>>>>
> >>>>> I will not respond to this comment anymore.
> >>>> Ok if you choose not to respond.
> >>>>>> For pass-through, I still recommend you to take a reference of
> >>>>>> current virito-pci implementation, it works for pass-through, right?
> >>>>> What do you mean by current virtio-pci implementation?
> >>>> current virito-pci works for pass-through
> >>> I still don’t understand what is "current virtio-pci".
> >>> Do you mean qemu implementation of emulated virtio-pci or you mean
> >> virtio-pci specification for passthrough?
> >>> What do you want me to refer to for passthrough? Please clarify.
> >> you know guest vcpu and its vRC can not access host side devices, and
> >> there must be a driver helping the pass-through use cases, like vDPA
> >> and vfio
> > I am not sure how to corelate this answer to the question of "virtio-pci for
> passthrough".
> > :(
> >
> > Today when a virtio-pci member device is passthrough to the guest VM,
> hypervisor is not involved in virtio interface such as config space, cvq, data vq
> etc.
> > Do you agree?
You didn’t respond yet to this question.
Can you please respond?

> Can vCPU access host side device config space? It needs a pass-through helper
> driver like vfio, right?
Right. 
And if you are implying that, because generic pci config space is intercepted hence, all virtio common and device specific things MUST BE ALWAYS intercepted as well.
Then I do not agree with such derivation.

The main reasons are:
1. It breaks the future TDISP model
2. Without hypervisor getting involved, all the member device MMIO space is accessible which follows the efficiency and equivalency principle of Jason listed paper

I hope you are not implying to trap+emulate virtio interfaces (which is not listed in the pci-spec) in hypervisor for member passthrough devices.

> >
> >>>>>> For scale, I already told you for many times that they are
> >>>>>> per-device facilities. How can a per-device facility not scale?
> >>>>> Each VF device must implement new set of on-chip memory-based
> >>>>> registers
> >>>> which demands more power, die area which does not scale efficiently
> >>>> to thousands of VFs.
> >>>> that can be fpga gates or SOC implementing new features, you think
> >>>> that is a waste?
> >>> It is waste in hw, if there is a better approach possible to not
> >>> burn them as
> >> gates and save on resources for rarely used items.
> >> Is a new entry in MSIX table a waste of HW?
> > Not as must as existing MSI-X table entries which requires linear amount of
> on-chip memory.
> anyway, even only one MSIX entry cost my HW resource than the amount of
> new registers in my proposal.
Yes, this is why new MSI-X proposals are on table to improve, the first known approach to me was from Intel using IMS.
Hence, virtio already learnt it seen in the Appendix to not keep adding non init time registers.

> >
> >> Can I say implementing admin vq in SOC is a waste of cores?
> > Which cores in the SoC?
> > If it is on the PF, there is only handful of AQs for scale of N VFs.
> I see you got the point anyway, new features cost extra resource
> >
> >>>
> >>>>>> vDPA works fine on config space.
> >>>>>>
> >>>>>> So, if you still insist admin vq is better than config space like
> >>>>>> in other thread you have concluded, you may imply that config
> >>>>>> space interfaces should be re-factored to admin vq.
> >>>>> Whatever is done in past is done, there is no way to change history.
> >>>>> An new non init time registers should not be placed in device
> >>>>> specific config
> >>>> space as virtio spec has clear guideline on it for good.
> >>>>> Device context reading, dirty page address reading, changing vf
> >>>>> device modes,
> >>>> all of these are clearly not a init time settings.
> >>>>> Hence, they do not belong to the registers.
> >>>> reset vq? and you get it from Appendix B. Creating New Device
> >>>> Types, are we implementing a new type of device???
> >>> I don’t understand your question.
> >>> I replied the history of reset_vq.
> >>> Take good examples to follow, reset_vq clearly is not the one.
> >> so again, we are not implementing new device type, so your citation
> >> doesn't apply.
> > I disagree.
> > I am engineer to build practical systems considering limitations and
> > also advancements of the transport; while listening to other industry efforts, I
> am no from legal department.
> > Hence, Appendix B makes a sense to me to apply to the existing device which
> also has the section for "device improvements".
> it titled as "new device", and I think this discussion is non-sense. So if you want
> to fix this statement, works for me.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  7:04                                                                         ` Parav Pandit
@ 2023-10-30  3:44                                                                           ` Zhu, Lingshan
  2023-10-30  4:17                                                                             ` Parav Pandit
  2023-10-30 11:27                                                                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-30  3:44 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/26/2023 3:04 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, October 26, 2023 12:14 PM
>>
>>
>> On 10/24/2023 6:37 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Tuesday, October 24, 2023 4:00 PM
>>>>
>>>> On 10/23/2023 6:14 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Monday, October 23, 2023 3:39 PM
>>>>>>
>>>>>> On 10/20/2023 8:54 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Friday, October 20, 2023 3:01 PM
>>>>>>>>
>>>>>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>>>>>>>
>>>>>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>>>>>>>>>> Oh, really? Quite interesting, do you want to move all
>>>>>>>>>>>>> config space fields in VF to admin vq? Have a plan?
>>>>>>>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>>>>>>>> I do not want to divert the discussion, would like to focus
>>>>>>>>>>>> on device
>>>>>>>>>> migration phases.
>>>>>>>>>>>> Lets please discuss in some other dedicated thread.
>>>>>>>>>>> Possibly, if there's a way to send admin commands to vf itself
>>>>>>>>>>> then Lingshan will be happy?
>>>>>>>>>> still need to prove why admin commands are better than registers.
>>>>>>>>> Virtio spec development is not proof based approach. Please stop
>>>>>>>>> asking for
>>>>>> it.
>>>>>>>>> I tried my best to have technical answer in [1].
>>>>>>>>> I explained that registers simply do not work for passthrough
>>>>>>>>> mode (if this is what you are asking when you are asking prove its
>> better).
>>>>>>>>> They can work for non_passthrough mediated mode.
>>>>>>>>>
>>>>>>>>> A member device may do admin commands using registers. Michael
>>>>>>>>> and I are
>>>>>>>> discussing presently in the same thread.
>>>>>>>>> Since there are multiple things to be done for device migration,
>>>>>>>>> dedicated
>>>>>>>> register set for each functionality do not scale well, hard to
>>>>>>>> maintain and extend.
>>>>>>>>> A register holding a command content make sense.
>>>>>>>>>
>>>>>>>>> Now, with that, if this can be useful only for non_passthrough,
>>>>>>>>> I made humble
>>>>>>>> request to transport them using AQ, this way, you get all benefits of AQ.
>>>>>>>>> And trying to understand, why AQ cannot possible or inferior?
>>>>>>>>>
>>>>>>>>> If you have commands like suspend/resume device, register or
>>>>>>>>> queue
>>>>>>>> transport simply don’t work, because it's wrong to bifurcate the
>>>>>>>> device with such weird API.
>>>>>>>>> If you want to biferacate for mediation software, it probably
>>>>>>>>> makes sense to
>>>>>>>> operate at each VQ level, config space level. Such are very
>>>>>>>> different commands than passthrough.
>>>>>>>>> I think vdpa has demonstrated that very well on how to do
>>>>>>>>> specific work for
>>>>>>>> specific device type. So some of those work can be done using AQ.
>>>>>>>>> [1]
>>>>>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f-b
>>>>>>>>> d1
>>>>>>>>> 03
>>>>>>>>> 36
>>>>>>>>>
>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>>>>>>>> We have been through your statement for many times.
>>>>>>>> This is not about how many times you repeated, if you think this
>>>>>>>> is true, you need to prove that with solid evidence.
>>>>>>>>
>>>>>>> I will not respond to this comment anymore.
>>>>>> Ok if you choose not to respond.
>>>>>>>> For pass-through, I still recommend you to take a reference of
>>>>>>>> current virito-pci implementation, it works for pass-through, right?
>>>>>>> What do you mean by current virtio-pci implementation?
>>>>>> current virito-pci works for pass-through
>>>>> I still don’t understand what is "current virtio-pci".
>>>>> Do you mean qemu implementation of emulated virtio-pci or you mean
>>>> virtio-pci specification for passthrough?
>>>>> What do you want me to refer to for passthrough? Please clarify.
>>>> you know guest vcpu and its vRC can not access host side devices, and
>>>> there must be a driver helping the pass-through use cases, like vDPA
>>>> and vfio
>>> I am not sure how to corelate this answer to the question of "virtio-pci for
>> passthrough".
>>> :(
>>>
>>> Today when a virtio-pci member device is passthrough to the guest VM,
>> hypervisor is not involved in virtio interface such as config space, cvq, data vq
>> etc.
>>> Do you agree?
> You didn’t respond yet to this question.
> Can you please respond?
Not sure which question you refer to that not answered, Agree what? 
Please don't cut off the thread until
the issue closed.

If you are asking whether hypervisor is involved in accessing virtio 
interfaces.
For passthrough, guest needs a host side help driver to access hardware, 
and explained below.
>
>> Can vCPU access host side device config space? It needs a pass-through helper
>> driver like vfio, right?
> Right.
> And if you are implying that, because generic pci config space is intercepted hence, all virtio common and device specific things MUST BE ALWAYS intercepted as well.
> Then I do not agree with such derivation.
This is not only virtio, guest needs a helper to access host devices.
>
> The main reasons are:
> 1. It breaks the future TDISP model
Not sure why you bring TDISP again, I thought we agree this is closed.

How it break TDISP? Can you let guest driver access host device whiteout 
a host side helper like VFIO?

And TDISP says you should not trust PF, thus you should not use admin vq 
on PF for live migration.
> 2. Without hypervisor getting involved, all the member device MMIO space is accessible which follows the efficiency and equivalency principle of Jason listed paper
>
> I hope you are not implying to trap+emulate virtio interfaces (which is not listed in the pci-spec) in hypervisor for member passthrough devices.
Do you agree mmap the bars(interfaces) without doing anything is also a 
type of "trap and emulate"?
>
>>>>>>>> For scale, I already told you for many times that they are
>>>>>>>> per-device facilities. How can a per-device facility not scale?
>>>>>>> Each VF device must implement new set of on-chip memory-based
>>>>>>> registers
>>>>>> which demands more power, die area which does not scale efficiently
>>>>>> to thousands of VFs.
>>>>>> that can be fpga gates or SOC implementing new features, you think
>>>>>> that is a waste?
>>>>> It is waste in hw, if there is a better approach possible to not
>>>>> burn them as
>>>> gates and save on resources for rarely used items.
>>>> Is a new entry in MSIX table a waste of HW?
>>> Not as must as existing MSI-X table entries which requires linear amount of
>> on-chip memory.
>> anyway, even only one MSIX entry cost my HW resource than the amount of
>> new registers in my proposal.
> Yes, this is why new MSI-X proposals are on table to improve, the first known approach to me was from Intel using IMS.
> Hence, virtio already learnt it seen in the Appendix to not keep adding non init time registers.
non-sense to me, IMS still uses MSI
>
>>>> Can I say implementing admin vq in SOC is a waste of cores?
>>> Which cores in the SoC?
>>> If it is on the PF, there is only handful of AQs for scale of N VFs.
>> I see you got the point anyway, new features cost extra resource
>>>>>>>> vDPA works fine on config space.
>>>>>>>>
>>>>>>>> So, if you still insist admin vq is better than config space like
>>>>>>>> in other thread you have concluded, you may imply that config
>>>>>>>> space interfaces should be re-factored to admin vq.
>>>>>>> Whatever is done in past is done, there is no way to change history.
>>>>>>> An new non init time registers should not be placed in device
>>>>>>> specific config
>>>>>> space as virtio spec has clear guideline on it for good.
>>>>>>> Device context reading, dirty page address reading, changing vf
>>>>>>> device modes,
>>>>>> all of these are clearly not a init time settings.
>>>>>>> Hence, they do not belong to the registers.
>>>>>> reset vq? and you get it from Appendix B. Creating New Device
>>>>>> Types, are we implementing a new type of device???
>>>>> I don’t understand your question.
>>>>> I replied the history of reset_vq.
>>>>> Take good examples to follow, reset_vq clearly is not the one.
>>>> so again, we are not implementing new device type, so your citation
>>>> doesn't apply.
>>> I disagree.
>>> I am engineer to build practical systems considering limitations and
>>> also advancements of the transport; while listening to other industry efforts, I
>> am no from legal department.
>>> Hence, Appendix B makes a sense to me to apply to the existing device which
>> also has the section for "device improvements".
>> it titled as "new device", and I think this discussion is non-sense. So if you want
>> to fix this statement, works for me.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  3:58                                                                             ` Parav Pandit
@ 2023-10-30  3:59                                                                               ` Jason Wang
  2023-10-30  4:49                                                                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-30  3:59 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


在 2023/10/26 11:58, Parav Pandit 写道:
>
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Thursday, October 26, 2023 6:27 AM
>>
>> On Wed, Oct 25, 2023 at 4:33 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Tue, Oct 24, 2023 at 06:27:04PM +0800, Zhu, Lingshan wrote:
>>>>
>>>> On 10/23/2023 7:32 PM, Michael S. Tsirkin wrote:
>>>>
>>>>      On Mon, Oct 23, 2023 at 06:03:10PM +0800, Zhu, Lingshan wrote:
>>>>
>>>>          config space, MMIO, registers work for years, what is wrong with them?
>>>>
>>>>      Nothing as such. They don't seem to be appropriate for all use-case
>>>>      where people want to utilize virtio. I think a new transport
>>>>      will be needed to address these.
>>>>
>>>> New transport for new type of devices for sure, like transport vq for SIOV.
>>>>
>>>> I agree admin vq or admin cmds are useful in some use cases, that is
>>>> another story, should be case by case.
>>>>
>>>> For now, let's don't talk about all-use cases, just for current
>>>> task, for live migration.
>>>>
>>>> So IMHO, I still think we should use config space registers to
>>>> control live migration process.
>>>>
>>>>
>>> No because it forces integrating migration process with device driver.
>>> Which is ok for some use-cases but not all of them.  Find some other
>>> control plane for this.
>>>
>>>
>>>>                  Config space is control path, DMA is data-path, let's better not mix
>> them,
>>>>                  we never expect to use config space to transfer data.
>>>>
>>>>                  So we need DMA to transfer data, for example I take advantages of
>> device DMA
>>>>                  to logging dirty pages, This also applies to in-flight descriptors.
>>>>
>>>>              As long as you do, I personally see little benefit to retrieve parts of
>>>>              state with memory mapped accesses.
>>>>
>>>>          registers only control, and I personally believe a single register is much
>>>>          better
>>>>          than processing admin commands, more light-weight, more reliable,
>> working
>>>>          for years.
>>>>
>>>>      Yea. It would be, if we could do everything through that register.
>>>>      But we can't really. Migration has too much data to pass around
>>>>      for that to be reasonable.
>>>>
>>>> data are not transferred by registers, they only control.
>>>>
>>>> We transfer data by DMA, the device writes DMA dirty pages
>>>> information(bitmap) to host isolated memory region.
>>>>
>>>
>>> If you do that then I don't see any reason not to use admin commands
>>> for that - either through a vq or a simpler interface.
>> I think we need to agree that admin commands are the only interface for any
>> future features before we can have an agreement here.
>>
> For passthrough member devices, this admin commands to be transported via the owner device by the migration driver which sits outside of the guest VM.
>
> If this basic requirement is not clear it's pointless to discuss further.


So my question is not limited to devices that is modeled as group/owner.


> How is the passthrough device defined? It is defined as [1].
>
> Can non passthrough device modes such as vdpa also use these admin commands from the owner device?
> Sure, why not?


That's fine but it's not my question again.

Thanks


>
> [1] https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  6:22                                                                             ` Michael S. Tsirkin
@ 2023-10-30  4:02                                                                               ` Jason Wang
  2023-11-01  0:33                                                                               ` Jason Wang
  1 sibling, 0 replies; 341+ messages in thread
From: Jason Wang @ 2023-10-30  4:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


在 2023/10/26 14:22, Michael S. Tsirkin 写道:
> On Thu, Oct 26, 2023 at 08:56:47AM +0800, Jason Wang wrote:
>>>> We transfer data by DMA, the device writes DMA dirty pages information(bitmap)
>>>> to host isolated memory region.
>>>>
>>>
>>> If you do that then I don't see any reason not to use admin
>>> commands for that - either through a vq or a simpler
>>> interface.
>> I think we need to agree that admin commands are the only interface
>> for any future features before we can have an agreement here.
> I don't think that needs to be the case. I do think that if
> your goal is a separate channel from normal device operation
> then this is what admin commands have been designed for.
>
>> My understanding is that it is optional for the transport that
>> requires administrative commands like provisioning etc. It is not
>> necessarily the interface for new features.
> Yes. And migration is IMO sufficiently "like provisioning".


To me, this looks more like a transport. Hypervisors can trap and 
forward it to admin virtqueue. This somehow duplicates with the idea of 
transport q over admin.


>
>>>
>>>>          Config space interfaces are fundamental for virtio-pci.
>>>>
>>>>
>>>>      They are in fact fundamental to virtio. Multiple transports to
>>>>      use config space are also fundamental.
>>>>
>>>> I agree. So I also agree to build admin vq live migration solution based on our
>>>> basic facilities, as Jason ever proposed.
>>>
>>> I'm not sure it's even a vq. I suggest a minimal interface to send
>>> admin commands. Could be used by migration, as transport, and more.
>>>
>> It's better if we can do that below the layer of admin commands. For
>> example, we don't stick device status with any specific interface. We
>> can keep doing things like this.
>>
>> Thanks
> Could go either way, but complex functionality like live migration
> can benefit from a rich interface.


That's my understanding as well.

Thanks


>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  3:50                                                               ` Parav Pandit
@ 2023-10-30  4:04                                                                 ` Jason Wang
  2023-10-30  4:27                                                                   ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-30  4:04 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


在 2023/10/26 11:50, Parav Pandit 写道:
>> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
>> open.org> On Behalf Of Jason Wang
>> For example, you still haven't succeeded in defining passthrough.
> It was defined on 19th Oct in [1].
> What part is not clear to you in definition of passthrough device?
>
> [1] https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/


Let me copy-paste it again:

For example, assuming you are correct, you still fail to explain

1) what is trapped and what's not, or what's the boundary
2) if the hypervisor is not developed with those assumptions, things
can work or not

Thanks


>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-26  3:45                                                         ` Parav Pandit
@ 2023-10-30  4:06                                                           ` Jason Wang
  2023-10-30  4:46                                                             ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-30  4:06 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, October 26, 2023 6:16 AM
> >
> > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > For passthrough PASID assignment vq is not needed.
> > > >
> > > > How do you know that?
> > > Because for passthrough, the hypervisor is not involved in dealing with VQ at
> > all.
> >
> > Ok, so if I understand correctly, you are saying your design can't work for the
> > case of PASID assignment.
> >
> No. PASID assignment will happen from the guest for its own use and device migration will just work fine because device context will capture this.

It's not about device context. We're discussing "passthrough", no?

You want all virtio stuff to be "passthrough", but assigning a PASID
to a specific virtqueue in the guest must be trapped.

>
> > >
> > > > There are works ongoing to make vPASID work for the guest like vSVA.
> > > > Virtio doesn't differ from other devices.
> > > Passthrough do not run like SVA.
> >
> > Great, you find another limitation of "passthrough" by yourself.
> >
> No. it is not the limitation it is just the way it does not need complex SVA to split the device for unrelated usage.

How can you limit the user in the guest to not use vSVA?

>
> > > Each passthrough device has PASID from its own space fully managed by the
> > guest.
> > > Some cpu required vPASID and SIOV is not going this way anmore.
> >
> > Then how to migrate? Invent a full set of something else through another giant
> > series like this to migrate to the SIOV thing? That's a mess for sure.
> >
> SIOV will for sure reuse most or all parts of this work, almost entirely as_is.
> vPASID is cpu/platform specific things not part of the SIOV devices.
>
> > >
> > > >
> > > > > If at all it is done, it will be done from the guest by the driver
> > > > > using virtio
> > > > interface.
> > > >
> > > > Then you need to trap. Such things couldn't be passed through to guests
> > directly.
> > > >
> > > Only PASID capability is trapped. PASID allocation and usage is directly from
> > guest.
> >
> > How can you achieve this? Assigning a PAISD to a device is completely
> > device(virtio) specific. How can you use a general layer without the knowledge
> > of virtio to trap that?
> When one wants to map vPASID to pPASID a platform needs to be involved.

I'm not talking about how to map vPASID to pPASID, it's out of the
scope of virtio. I'm talking about assigning a vPASID to a specific
virtqueue or other virtio function in the guest.

You need a virtio specific queue or capability to assign a PASID to a
specific virtqueue, and that can't be done without trapping and
without virito specific knowledge.

> When virtio passthrough device is in guest, it has all its PASID accessible.
>
> All these is large deviation from current discussion of this series, so I will keep it short.
>
> >
> > > Regardless it is not relevant to passthrough mode as PASID is yet another
> > resource.
> > > And for some cpu if it is trapped, it is generic layer, that does not require virtio
> > involvement.
> > > So virtio interface asking to trap something because generic facility has done
> > in not the approach.
> >
> > This misses the point of PASID. How to use PASID is totally device specific.
> Sure, and how to virtualize vPASID/pPASID is platform specific as single PASID can be used by multiple devices and process.

See above, I think we're talking about different things.

>
> >
> > >
> > > > > Capabilities of #2 is generic across all pci devices, so it will
> > > > > be handled by the
> > > > HV.
> > > > > ATS/PRI cap is also generic manner handled by the HV and PCI device.
> > > >
> > > > No, ATS/PRI requires the cooperation from the vIOMMU. You can simply
> > > > do ATS/PRI passthrough but with an emulated vIOMMU.
> > > And that is not the reason for virtio device to build trap+emulation for
> > passthrough member devices.
> >
> > vIOMMU is emulated by hypervisor with a PRI queue,
> PRI requests arrive on the PF for the VF.

Shouldn't it arrive at platform IOMMU first? The path should be PRI ->
RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest IOMMU.

And things will be more complicated when (v)PASID is used. So you
can't simply let PRI go directly to the guest with the current
architecture.

>
> > how can you pass
> > through a hardware PRI request to a guest directly without trapping it then?
> > What's more, PCIE allows the PRI to be done in a vendor (virtio) specific way, so
> > you want to break this rule? Or you want to blacklist ATS/PRI for virtio?
> >
> I was aware of only pci-sig way of PRI.
> Do you have a reference to the ECN that enables vendor specific way of PRI? I would like to read it.

I mean it doesn't forbid us to build a virtio specific interface for
I/O page fault report and recovery.

> This will be very good to eliminate IOMMU PRI limitations.

Probably.

> PRI will directly go to the guest driver, and guest would interact with IOMMU to service the paging request through IOMMU APIs.

With PASID, it can't go directly.

> For PRI in vendor specific way needs a separate discussion. It is not related to live migration.

PRI itself is not related. But the point is, you can't simply pass
through ATS/PRI now.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30  3:44                                                                           ` Zhu, Lingshan
@ 2023-10-30  4:17                                                                             ` Parav Pandit
  2023-10-30 10:02                                                                               ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-30  4:17 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, October 30, 2023 9:15 AM
> 
> On 10/26/2023 3:04 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Thursday, October 26, 2023 12:14 PM
> >>
> >>
> >> On 10/24/2023 6:37 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Tuesday, October 24, 2023 4:00 PM
> >>>>
> >>>> On 10/23/2023 6:14 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Monday, October 23, 2023 3:39 PM
> >>>>>>
> >>>>>> On 10/20/2023 8:54 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Friday, October 20, 2023 3:01 PM
> >>>>>>>>
> >>>>>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
> >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
> >>>>>>>>>>
> >>>>>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
> >>>>>>>>>>>>> Oh, really? Quite interesting, do you want to move all
> >>>>>>>>>>>>> config space fields in VF to admin vq? Have a plan?
> >>>>>>>>>>>> Not in my plan for spec 1.4 time frame.
> >>>>>>>>>>>> I do not want to divert the discussion, would like to focus
> >>>>>>>>>>>> on device
> >>>>>>>>>> migration phases.
> >>>>>>>>>>>> Lets please discuss in some other dedicated thread.
> >>>>>>>>>>> Possibly, if there's a way to send admin commands to vf
> >>>>>>>>>>> itself then Lingshan will be happy?
> >>>>>>>>>> still need to prove why admin commands are better than registers.
> >>>>>>>>> Virtio spec development is not proof based approach. Please
> >>>>>>>>> stop asking for
> >>>>>> it.
> >>>>>>>>> I tried my best to have technical answer in [1].
> >>>>>>>>> I explained that registers simply do not work for passthrough
> >>>>>>>>> mode (if this is what you are asking when you are asking prove
> >>>>>>>>> its
> >> better).
> >>>>>>>>> They can work for non_passthrough mediated mode.
> >>>>>>>>>
> >>>>>>>>> A member device may do admin commands using registers. Michael
> >>>>>>>>> and I are
> >>>>>>>> discussing presently in the same thread.
> >>>>>>>>> Since there are multiple things to be done for device
> >>>>>>>>> migration, dedicated
> >>>>>>>> register set for each functionality do not scale well, hard to
> >>>>>>>> maintain and extend.
> >>>>>>>>> A register holding a command content make sense.
> >>>>>>>>>
> >>>>>>>>> Now, with that, if this can be useful only for
> >>>>>>>>> non_passthrough, I made humble
> >>>>>>>> request to transport them using AQ, this way, you get all benefits of
> AQ.
> >>>>>>>>> And trying to understand, why AQ cannot possible or inferior?
> >>>>>>>>>
> >>>>>>>>> If you have commands like suspend/resume device, register or
> >>>>>>>>> queue
> >>>>>>>> transport simply don’t work, because it's wrong to bifurcate
> >>>>>>>> the device with such weird API.
> >>>>>>>>> If you want to biferacate for mediation software, it probably
> >>>>>>>>> makes sense to
> >>>>>>>> operate at each VQ level, config space level. Such are very
> >>>>>>>> different commands than passthrough.
> >>>>>>>>> I think vdpa has demonstrated that very well on how to do
> >>>>>>>>> specific work for
> >>>>>>>> specific device type. So some of those work can be done using AQ.
> >>>>>>>>> [1]
> >>>>>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f
> >>>>>>>>> -b
> >>>>>>>>> d1
> >>>>>>>>> 03
> >>>>>>>>> 36
> >>>>>>>>>
> >> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> >>>>>>>> We have been through your statement for many times.
> >>>>>>>> This is not about how many times you repeated, if you think
> >>>>>>>> this is true, you need to prove that with solid evidence.
> >>>>>>>>
> >>>>>>> I will not respond to this comment anymore.
> >>>>>> Ok if you choose not to respond.
> >>>>>>>> For pass-through, I still recommend you to take a reference of
> >>>>>>>> current virito-pci implementation, it works for pass-through, right?
> >>>>>>> What do you mean by current virtio-pci implementation?
> >>>>>> current virito-pci works for pass-through
> >>>>> I still don’t understand what is "current virtio-pci".
> >>>>> Do you mean qemu implementation of emulated virtio-pci or you mean
> >>>> virtio-pci specification for passthrough?
> >>>>> What do you want me to refer to for passthrough? Please clarify.
> >>>> you know guest vcpu and its vRC can not access host side devices,
> >>>> and there must be a driver helping the pass-through use cases, like
> >>>> vDPA and vfio
> >>> I am not sure how to corelate this answer to the question of
> >>> "virtio-pci for
> >> passthrough".
> >>> :(
> >>>
> >>> Today when a virtio-pci member device is passthrough to the guest
> >>> VM,
> >> hypervisor is not involved in virtio interface such as config space,
> >> cvq, data vq etc.
> >>> Do you agree?
> > You didn’t respond yet to this question.
> > Can you please respond?
> Not sure which question you refer to that not answered, Agree what?
What is listed above.

> Please don't cut off the thread until
> the issue closed.
>
I didn’t cut off the thread. Please check your email client.
 
> If you are asking whether hypervisor is involved in accessing virtio interfaces.
> For passthrough, guest needs a host side help driver to access hardware, and
> explained below.
Not an accurate answer. Please answer above.
Repeating the question again.
For passthrough device virtio interfaces such as common and device config space, cvq, data vqs, are NOT accessed by the hypervisor.
Do you agree?


> >
> >> Can vCPU access host side device config space? It needs a
> >> pass-through helper driver like vfio, right?
> > Right.
> > And if you are implying that, because generic pci config space is intercepted
> hence, all virtio common and device specific things MUST BE ALWAYS
> intercepted as well.
> > Then I do not agree with such derivation.
> This is not only virtio, guest needs a helper to access host devices.
Does not make sense until you reply above.

> >
> > The main reasons are:
> > 1. It breaks the future TDISP model
> Not sure why you bring TDISP again, I thought we agree this is closed.
> 
Not to include TDISP in current spec, but the mechanism/infrastructure built applies to the future mode as well.

> How it break TDISP? Can you let guest driver access host device whiteout a host
> side helper like VFIO?
Yes, once it is passthrough virtio interface is a secure channel.
In TDISP config space is still communicated via hypervisor and it contains all the data that is not critical.
Hence, there must not be any virtio registers to place in there.
In future if one discovers config space as problematic, one will find generic solution for all the pci devices, not just virtio.

> And TDISP says you should not trust PF, thus you should not use admin vq on PF
> for live migration.
There are few options which will evolve.
1. PF will be handed over to the TVM instead of hypervisor
2. PF aq communication will be encrypted hence not visible to hypervisor, also supported by PCI-SIG already
3. Some other options
Since this is the generic solution across virtio and non_virtio, we can rely on wider wisdom of PCI-SIG.

> > 2. Without hypervisor getting involved, all the member device MMIO
> > space is accessible which follows the efficiency and equivalency
> > principle of Jason listed paper
> >
> > I hope you are not implying to trap+emulate virtio interfaces (which is not
> listed in the pci-spec) in hypervisor for member passthrough devices.
> Do you agree mmap the bars(interfaces) without doing anything is also a type of
> "trap and emulate"?
Certainly not.
Memory mapping enables guest to _directly_ communicate with the device without any VMEXITS.
In TDISP world this is also even secured already.
So no, it is not trap and emulate.

> >
> >>>>>>>> For scale, I already told you for many times that they are
> >>>>>>>> per-device facilities. How can a per-device facility not scale?
> >>>>>>> Each VF device must implement new set of on-chip memory-based
> >>>>>>> registers
> >>>>>> which demands more power, die area which does not scale
> >>>>>> efficiently to thousands of VFs.
> >>>>>> that can be fpga gates or SOC implementing new features, you
> >>>>>> think that is a waste?
> >>>>> It is waste in hw, if there is a better approach possible to not
> >>>>> burn them as
> >>>> gates and save on resources for rarely used items.
> >>>> Is a new entry in MSIX table a waste of HW?
> >>> Not as must as existing MSI-X table entries which requires linear
> >>> amount of
> >> on-chip memory.
> >> anyway, even only one MSIX entry cost my HW resource than the amount
> >> of new registers in my proposal.
> > Yes, this is why new MSI-X proposals are on table to improve, the first known
> approach to me was from Intel using IMS.
> > Hence, virtio already learnt it seen in the Appendix to not keep adding non init
> time registers.
> non-sense to me, IMS still uses MSI

Clearly not.
May be you missed something.

IMS enables once to use non registers for the interrupt store unlike MSI/MSI-X.

Please see the commit log comment, snippet here about "queue memory".

       - The interrupt chip must provide the following optional callbacks
         when the irq_mask(), irq_unmask() and irq_write_msi_msg() callbacks
         cannot operate directly on hardware, e.g. in the case that the
         interrupt message store is in queue memory:

IRQ chips callback irq_write_msi_msg() has no such limitation to store in registers.

> >
> >>>> Can I say implementing admin vq in SOC is a waste of cores?
> >>> Which cores in the SoC?
> >>> If it is on the PF, there is only handful of AQs for scale of N VFs.
> >> I see you got the point anyway, new features cost extra resource
> >>>>>>>> vDPA works fine on config space.
> >>>>>>>>
> >>>>>>>> So, if you still insist admin vq is better than config space
> >>>>>>>> like in other thread you have concluded, you may imply that
> >>>>>>>> config space interfaces should be re-factored to admin vq.
> >>>>>>> Whatever is done in past is done, there is no way to change history.
> >>>>>>> An new non init time registers should not be placed in device
> >>>>>>> specific config
> >>>>>> space as virtio spec has clear guideline on it for good.
> >>>>>>> Device context reading, dirty page address reading, changing vf
> >>>>>>> device modes,
> >>>>>> all of these are clearly not a init time settings.
> >>>>>>> Hence, they do not belong to the registers.
> >>>>>> reset vq? and you get it from Appendix B. Creating New Device
> >>>>>> Types, are we implementing a new type of device???
> >>>>> I don’t understand your question.
> >>>>> I replied the history of reset_vq.
> >>>>> Take good examples to follow, reset_vq clearly is not the one.
> >>>> so again, we are not implementing new device type, so your citation
> >>>> doesn't apply.
> >>> I disagree.
> >>> I am engineer to build practical systems considering limitations and
> >>> also advancements of the transport; while listening to other
> >>> industry efforts, I
> >> am no from legal department.
> >>> Hence, Appendix B makes a sense to me to apply to the existing
> >>> device which
> >> also has the section for "device improvements".
> >> it titled as "new device", and I think this discussion is non-sense.
> >> So if you want to fix this statement, works for me.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30  4:04                                                                 ` Jason Wang
@ 2023-10-30  4:27                                                                   ` Parav Pandit
  2023-10-31  1:36                                                                     ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-30  4:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 30, 2023 9:35 AM
> 
> 在 2023/10/26 11:50, Parav Pandit 写道:
> >> From: virtio-comment@lists.oasis-open.org
> >> <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang For
> >> example, you still haven't succeeded in defining passthrough.
> > It was defined on 19th Oct in [1].
> > What part is not clear to you in definition of passthrough device?
> >
> > [1]
> > https://lore.kernel.org/virtio-
> comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> 
> 
> Let me copy-paste it again:
> 
> For example, assuming you are correct, you still fail to explain
> 
> 1) what is trapped and what's not, or what's the boundary
Passthrough definition was replied few times.
One of them is here, https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
I don’t know what you mean by 'explain'. What do you want to be explained?
What is trapped is listed in https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
What is not trapped is also listed in https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
So what more do you want to explain in there?

> 2) if the hypervisor is not developed with those assumptions, things can work
What to explain in #2. :)
Things can expand when such hypervisor is born.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-30  4:06                                                           ` Jason Wang
@ 2023-10-30  4:46                                                             ` Parav Pandit
  2023-10-31  1:34                                                               ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-30  4:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Jason Wang
> 
> On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, October 26, 2023 6:16 AM
> > >
> > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > For passthrough PASID assignment vq is not needed.
> > > > >
> > > > > How do you know that?
> > > > Because for passthrough, the hypervisor is not involved in dealing
> > > > with VQ at
> > > all.
> > >
> > > Ok, so if I understand correctly, you are saying your design can't
> > > work for the case of PASID assignment.
> > >
> > No. PASID assignment will happen from the guest for its own use and device
> migration will just work fine because device context will capture this.
> 
> It's not about device context. We're discussing "passthrough", no?
> 
Not sure, we are discussing same.
A member device is passthrough to the guest, dealing with its own PASIDs and virtio interface for some VQ assignment to PASID.
So VQ context captured by the hypervisor, will have some PASID attached to this VQ.
Device context will be updated.

> You want all virtio stuff to be "passthrough", but assigning a PASID to a specific
> virtqueue in the guest must be trapped.
>
No. PASID assignment to a specific virtqueue in the guest must go directly from guest to device.
When guest iommu may need to communicate anything for this PASID, it will come through its proper IOMMU channel/hypercall.
Virtio device is not the conduit for this exchange.
 
> >
> > > >
> > > > > There are works ongoing to make vPASID work for the guest like vSVA.
> > > > > Virtio doesn't differ from other devices.
> > > > Passthrough do not run like SVA.
> > >
> > > Great, you find another limitation of "passthrough" by yourself.
> > >
> > No. it is not the limitation it is just the way it does not need complex SVA to
> split the device for unrelated usage.
> 
> How can you limit the user in the guest to not use vSVA?
> 
He he, I am not limiting, again misunderstanding or wrong attribution.
I explained that hypervisor for passthrough does not need SVA.
Guest can do anything it wants from the guest OS with the member device.

> >
> > > > Each passthrough device has PASID from its own space fully managed
> > > > by the
> > > guest.
> > > > Some cpu required vPASID and SIOV is not going this way anmore.
> > >
> > > Then how to migrate? Invent a full set of something else through
> > > another giant series like this to migrate to the SIOV thing? That's a mess for
> sure.
> > >
> > SIOV will for sure reuse most or all parts of this work, almost entirely as_is.
> > vPASID is cpu/platform specific things not part of the SIOV devices.
> >
> > > >
> > > > >
> > > > > > If at all it is done, it will be done from the guest by the
> > > > > > driver using virtio
> > > > > interface.
> > > > >
> > > > > Then you need to trap. Such things couldn't be passed through to
> > > > > guests
> > > directly.
> > > > >
> > > > Only PASID capability is trapped. PASID allocation and usage is
> > > > directly from
> > > guest.
> > >
> > > How can you achieve this? Assigning a PAISD to a device is
> > > completely
> > > device(virtio) specific. How can you use a general layer without the
> > > knowledge of virtio to trap that?
> > When one wants to map vPASID to pPASID a platform needs to be involved.
> 
> I'm not talking about how to map vPASID to pPASID, it's out of the scope of
> virtio. I'm talking about assigning a vPASID to a specific virtqueue or other virtio
> function in the guest.
>
That can be done in the guest. The key is guest wont know that it is dealing with vPASID.
It will follow the same principle from your paper of equivalency, where virtio software layer will assign PASID to VQ and communicate to device.

Anyway, all of this just digression from current series.
 
> You need a virtio specific queue or capability to assign a PASID to a specific
> virtqueue, and that can't be done without trapping and without virito specific
> knowledge.
> 
I disagree. PASID assignment to a virqueue in future from guest virtio driver to device is uniform method.
Whether its PF assigning PASID to VQ of self,
Or 
VF driver in the guest assigning PASID to VQ.

All same.
Only IOMMU layer hypercalls will know how to deal with PASID assignment at platform layer to setup the domain etc table.

And this is way beyond our device migration discussion.
By any means, if you were implying that somehow vq to PASID assignment _may_ need trap+emulation, hence whole device migration to depend on some trap+emulation, than surely, than I do not agree to it.

PASID equivalent in mlx5 world is ODP_MR+PD isolating the guest process and all of that just works on efficiency and equivalence principle already for a decade now without any trap+emulation.

> > When virtio passthrough device is in guest, it has all its PASID accessible.
> >
> > All these is large deviation from current discussion of this series, so I will keep
> it short.
> >
> > >
> > > > Regardless it is not relevant to passthrough mode as PASID is yet
> > > > another
> > > resource.
> > > > And for some cpu if it is trapped, it is generic layer, that does
> > > > not require virtio
> > > involvement.
> > > > So virtio interface asking to trap something because generic
> > > > facility has done
> > > in not the approach.
> > >
> > > This misses the point of PASID. How to use PASID is totally device specific.
> > Sure, and how to virtualize vPASID/pPASID is platform specific as single PASID
> can be used by multiple devices and process.
> 
> See above, I think we're talking about different things.
> 
> >
> > >
> > > >
> > > > > > Capabilities of #2 is generic across all pci devices, so it
> > > > > > will be handled by the
> > > > > HV.
> > > > > > ATS/PRI cap is also generic manner handled by the HV and PCI device.
> > > > >
> > > > > No, ATS/PRI requires the cooperation from the vIOMMU. You can
> > > > > simply do ATS/PRI passthrough but with an emulated vIOMMU.
> > > > And that is not the reason for virtio device to build
> > > > trap+emulation for
> > > passthrough member devices.
> > >
> > > vIOMMU is emulated by hypervisor with a PRI queue,
> > PRI requests arrive on the PF for the VF.
> 
> Shouldn't it arrive at platform IOMMU first? The path should be PRI -> RC ->
> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest IOMMU.
>
Above sequence seems write.
 
> And things will be more complicated when (v)PASID is used. So you can't simply
> let PRI go directly to the guest with the current architecture.
> 
In current architecture of the pci VF, PRI does not go directly to the guest.
(and that is not reason to trap and emulate other things).

> >
> > > how can you pass
> > > through a hardware PRI request to a guest directly without trapping it then?
> > > What's more, PCIE allows the PRI to be done in a vendor (virtio)
> > > specific way, so you want to break this rule? Or you want to blacklist ATS/PRI
> for virtio?
> > >
> > I was aware of only pci-sig way of PRI.
> > Do you have a reference to the ECN that enables vendor specific way of PRI? I
> would like to read it.
> 
> I mean it doesn't forbid us to build a virtio specific interface for I/O page fault
> report and recovery.
> 
So PRI of PCI does not allow. It is ODP kind of technique you meant above.
Yes one can build.
Ok. unrelated to device migration, so I will park this good discussion for later.

> > This will be very good to eliminate IOMMU PRI limitations.
> 
> Probably.
> 
> > PRI will directly go to the guest driver, and guest would interact with IOMMU
> to service the paging request through IOMMU APIs.
> 
> With PASID, it can't go directly.
>
When the request consist of PASID in it, it can.
But again these PCI-SIG extensions of PASID are not related to device migration, so I am differing it.
 
> > For PRI in vendor specific way needs a separate discussion. It is not related to
> live migration.
> 
> PRI itself is not related. But the point is, you can't simply pass through ATS/PRI
> now.
> 
Ah ok. the whole 4K PCI config space where ATS/PRI capabilities are located are trapped+emulated by hypervisor.
So?
So do we start emulating virito interfaces too for passthrough? 
No.
Can one still continue to trap+emulate?
Sure why not?

Can one use AQ of this proposal to do so?
Sure, why not?

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30  3:59                                                                               ` Jason Wang
@ 2023-10-30  4:49                                                                                 ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-30  4:49 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 30, 2023 9:30 AM
> 
> 
> 在 2023/10/26 11:58, Parav Pandit 写道:
> >
> >> From: Jason Wang <jasowang@redhat.com>
> >> Sent: Thursday, October 26, 2023 6:27 AM
> >>
> >> On Wed, Oct 25, 2023 at 4:33 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> >>> On Tue, Oct 24, 2023 at 06:27:04PM +0800, Zhu, Lingshan wrote:
> >>>>
> >>>> On 10/23/2023 7:32 PM, Michael S. Tsirkin wrote:
> >>>>
> >>>>      On Mon, Oct 23, 2023 at 06:03:10PM +0800, Zhu, Lingshan wrote:
> >>>>
> >>>>          config space, MMIO, registers work for years, what is wrong with
> them?
> >>>>
> >>>>      Nothing as such. They don't seem to be appropriate for all use-case
> >>>>      where people want to utilize virtio. I think a new transport
> >>>>      will be needed to address these.
> >>>>
> >>>> New transport for new type of devices for sure, like transport vq for SIOV.
> >>>>
> >>>> I agree admin vq or admin cmds are useful in some use cases, that
> >>>> is another story, should be case by case.
> >>>>
> >>>> For now, let's don't talk about all-use cases, just for current
> >>>> task, for live migration.
> >>>>
> >>>> So IMHO, I still think we should use config space registers to
> >>>> control live migration process.
> >>>>
> >>>>
> >>> No because it forces integrating migration process with device driver.
> >>> Which is ok for some use-cases but not all of them.  Find some other
> >>> control plane for this.
> >>>
> >>>
> >>>>                  Config space is control path, DMA is data-path,
> >>>> let's better not mix
> >> them,
> >>>>                  we never expect to use config space to transfer data.
> >>>>
> >>>>                  So we need DMA to transfer data, for example I
> >>>> take advantages of
> >> device DMA
> >>>>                  to logging dirty pages, This also applies to in-flight descriptors.
> >>>>
> >>>>              As long as you do, I personally see little benefit to retrieve parts of
> >>>>              state with memory mapped accesses.
> >>>>
> >>>>          registers only control, and I personally believe a single register is
> much
> >>>>          better
> >>>>          than processing admin commands, more light-weight, more
> >>>> reliable,
> >> working
> >>>>          for years.
> >>>>
> >>>>      Yea. It would be, if we could do everything through that register.
> >>>>      But we can't really. Migration has too much data to pass around
> >>>>      for that to be reasonable.
> >>>>
> >>>> data are not transferred by registers, they only control.
> >>>>
> >>>> We transfer data by DMA, the device writes DMA dirty pages
> >>>> information(bitmap) to host isolated memory region.
> >>>>
> >>>
> >>> If you do that then I don't see any reason not to use admin commands
> >>> for that - either through a vq or a simpler interface.
> >> I think we need to agree that admin commands are the only interface
> >> for any future features before we can have an agreement here.
> >>
> > For passthrough member devices, this admin commands to be transported via
> the owner device by the migration driver which sits outside of the guest VM.
> >
> > If this basic requirement is not clear it's pointless to discuss further.
> 
> 
> So my question is not limited to devices that is modeled as group/owner.
>
Current proposal is for the group/owner model hw based devices supported by virtio-spec.
I will not debate this further.
Mmio can invent that model when a user really needs it.

> 
> > How is the passthrough device defined? It is defined as [1].
> >
> > Can non passthrough device modes such as vdpa also use these admin
> commands from the owner device?
> > Sure, why not?
> 
> 
> That's fine but it's not my question again.
>
Ok. so one interface is able to serve two use cases now.
Good.
Thanks.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30  4:17                                                                             ` Parav Pandit
@ 2023-10-30 10:02                                                                               ` Zhu, Lingshan
  2023-10-30 10:23                                                                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-30 10:02 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/30/2023 12:17 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, October 30, 2023 9:15 AM
>>
>> On 10/26/2023 3:04 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Thursday, October 26, 2023 12:14 PM
>>>>
>>>>
>>>> On 10/24/2023 6:37 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Tuesday, October 24, 2023 4:00 PM
>>>>>>
>>>>>> On 10/23/2023 6:14 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Monday, October 23, 2023 3:39 PM
>>>>>>>>
>>>>>>>> On 10/20/2023 8:54 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Friday, October 20, 2023 3:01 PM
>>>>>>>>>>
>>>>>>>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit wrote:
>>>>>>>>>>>>>>> Oh, really? Quite interesting, do you want to move all
>>>>>>>>>>>>>>> config space fields in VF to admin vq? Have a plan?
>>>>>>>>>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>>>>>>>>>> I do not want to divert the discussion, would like to focus
>>>>>>>>>>>>>> on device
>>>>>>>>>>>> migration phases.
>>>>>>>>>>>>>> Lets please discuss in some other dedicated thread.
>>>>>>>>>>>>> Possibly, if there's a way to send admin commands to vf
>>>>>>>>>>>>> itself then Lingshan will be happy?
>>>>>>>>>>>> still need to prove why admin commands are better than registers.
>>>>>>>>>>> Virtio spec development is not proof based approach. Please
>>>>>>>>>>> stop asking for
>>>>>>>> it.
>>>>>>>>>>> I tried my best to have technical answer in [1].
>>>>>>>>>>> I explained that registers simply do not work for passthrough
>>>>>>>>>>> mode (if this is what you are asking when you are asking prove
>>>>>>>>>>> its
>>>> better).
>>>>>>>>>>> They can work for non_passthrough mediated mode.
>>>>>>>>>>>
>>>>>>>>>>> A member device may do admin commands using registers. Michael
>>>>>>>>>>> and I are
>>>>>>>>>> discussing presently in the same thread.
>>>>>>>>>>> Since there are multiple things to be done for device
>>>>>>>>>>> migration, dedicated
>>>>>>>>>> register set for each functionality do not scale well, hard to
>>>>>>>>>> maintain and extend.
>>>>>>>>>>> A register holding a command content make sense.
>>>>>>>>>>>
>>>>>>>>>>> Now, with that, if this can be useful only for
>>>>>>>>>>> non_passthrough, I made humble
>>>>>>>>>> request to transport them using AQ, this way, you get all benefits of
>> AQ.
>>>>>>>>>>> And trying to understand, why AQ cannot possible or inferior?
>>>>>>>>>>>
>>>>>>>>>>> If you have commands like suspend/resume device, register or
>>>>>>>>>>> queue
>>>>>>>>>> transport simply don’t work, because it's wrong to bifurcate
>>>>>>>>>> the device with such weird API.
>>>>>>>>>>> If you want to biferacate for mediation software, it probably
>>>>>>>>>>> makes sense to
>>>>>>>>>> operate at each VQ level, config space level. Such are very
>>>>>>>>>> different commands than passthrough.
>>>>>>>>>>> I think vdpa has demonstrated that very well on how to do
>>>>>>>>>>> specific work for
>>>>>>>>>> specific device type. So some of those work can be done using AQ.
>>>>>>>>>>> [1]
>>>>>>>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-932f
>>>>>>>>>>> -b
>>>>>>>>>>> d1
>>>>>>>>>>> 03
>>>>>>>>>>> 36
>>>>>>>>>>>
>>>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>>>>>>>>>> We have been through your statement for many times.
>>>>>>>>>> This is not about how many times you repeated, if you think
>>>>>>>>>> this is true, you need to prove that with solid evidence.
>>>>>>>>>>
>>>>>>>>> I will not respond to this comment anymore.
>>>>>>>> Ok if you choose not to respond.
>>>>>>>>>> For pass-through, I still recommend you to take a reference of
>>>>>>>>>> current virito-pci implementation, it works for pass-through, right?
>>>>>>>>> What do you mean by current virtio-pci implementation?
>>>>>>>> current virito-pci works for pass-through
>>>>>>> I still don’t understand what is "current virtio-pci".
>>>>>>> Do you mean qemu implementation of emulated virtio-pci or you mean
>>>>>> virtio-pci specification for passthrough?
>>>>>>> What do you want me to refer to for passthrough? Please clarify.
>>>>>> you know guest vcpu and its vRC can not access host side devices,
>>>>>> and there must be a driver helping the pass-through use cases, like
>>>>>> vDPA and vfio
>>>>> I am not sure how to corelate this answer to the question of
>>>>> "virtio-pci for
>>>> passthrough".
>>>>> :(
>>>>>
>>>>> Today when a virtio-pci member device is passthrough to the guest
>>>>> VM,
>>>> hypervisor is not involved in virtio interface such as config space,
>>>> cvq, data vq etc.
>>>>> Do you agree?
>>> You didn’t respond yet to this question.
>>> Can you please respond?
>> Not sure which question you refer to that not answered, Agree what?
> What is listed above.
>
>> Please don't cut off the thread until
>> the issue closed.
>>
> I didn’t cut off the thread. Please check your email client.
>   
>> If you are asking whether hypervisor is involved in accessing virtio interfaces.
>> For passthrough, guest needs a host side help driver to access hardware, and
>> explained below.
> Not an accurate answer. Please answer above.
> Repeating the question again.
> For passthrough device virtio interfaces such as common and device config space, cvq, data vqs, are NOT accessed by the hypervisor.
> Do you agree?
Did you failed to process the answer?

Let me repeat again, for the last time.

The guest can not access any host side devices without a "pass-through 
helper driver".

And the helper driver could be considered as a part of the hypervisor, 
or the guest vCPU can not access the host side devices.

For example, the path is hw-->vfio_pci-->qemu-->guest. IT IS NOT hw-->guest.

You can try to take a loot at how virtio-pci work for QEMU.

If you failed to understand this, then I don't see any necessities to 
discuss on this topic anymore.
>
>
>>>> Can vCPU access host side device config space? It needs a
>>>> pass-through helper driver like vfio, right?
>>> Right.
>>> And if you are implying that, because generic pci config space is intercepted
>> hence, all virtio common and device specific things MUST BE ALWAYS
>> intercepted as well.
>>> Then I do not agree with such derivation.
>> This is not only virtio, guest needs a helper to access host devices.
> Does not make sense until you reply above.
see above
>
>>> The main reasons are:
>>> 1. It breaks the future TDISP model
>> Not sure why you bring TDISP again, I thought we agree this is closed.
>>
> Not to include TDISP in current spec, but the mechanism/infrastructure built applies to the future mode as well.
then discuss in future.
>
>> How it break TDISP? Can you let guest driver access host device whiteout a host
>> side helper like VFIO?
> Yes, once it is passthrough virtio interface is a secure channel.
> In TDISP config space is still communicated via hypervisor and it contains all the data that is not critical.
> Hence, there must not be any virtio registers to place in there.
> In future if one discovers config space as problematic, one will find generic solution for all the pci devices, not just virtio.
interesting, how does guest vCPU access host side devices without a 
helper driver, even a secure channel?

So do you mean even queue_enable should not be there??? really interesting.

>
>> And TDISP says you should not trust PF, thus you should not use admin vq on PF
>> for live migration.
> There are few options which will evolve.
> 1. PF will be handed over to the TVM instead of hypervisor
> 2. PF aq communication will be encrypted hence not visible to hypervisor, also supported by PCI-SIG already
> 3. Some other options
> Since this is the generic solution across virtio and non_virtio, we can rely on wider wisdom of PCI-SIG.
TDISP says don't trust PF anyway.
>
>>> 2. Without hypervisor getting involved, all the member device MMIO
>>> space is accessible which follows the efficiency and equivalency
>>> principle of Jason listed paper
>>>
>>> I hope you are not implying to trap+emulate virtio interfaces (which is not
>> listed in the pci-spec) in hypervisor for member passthrough devices.
>> Do you agree mmap the bars(interfaces) without doing anything is also a type of
>> "trap and emulate"?
> Certainly not.
> Memory mapping enables guest to _directly_ communicate with the device without any VMEXITS.
> In TDISP world this is also even secured already.
> So no, it is not trap and emulate.
Interesting. do you know virtualization is built on "trap and emulate"?
and pass-through is a special case of "trap and emulate"

If you want to discuss TDISP, then TDISP secured device only accepts
TLP from the owner, that means it only support the special case, and
that is an limitation of TDISP.

But generally speaking, you can always choose to trap and emulate any fields
in the bar.

Why is TDISP related to current live migration proposal?
Why we are discussing this?
>
>>>>>>>>>> For scale, I already told you for many times that they are
>>>>>>>>>> per-device facilities. How can a per-device facility not scale?
>>>>>>>>> Each VF device must implement new set of on-chip memory-based
>>>>>>>>> registers
>>>>>>>> which demands more power, die area which does not scale
>>>>>>>> efficiently to thousands of VFs.
>>>>>>>> that can be fpga gates or SOC implementing new features, you
>>>>>>>> think that is a waste?
>>>>>>> It is waste in hw, if there is a better approach possible to not
>>>>>>> burn them as
>>>>>> gates and save on resources for rarely used items.
>>>>>> Is a new entry in MSIX table a waste of HW?
>>>>> Not as must as existing MSI-X table entries which requires linear
>>>>> amount of
>>>> on-chip memory.
>>>> anyway, even only one MSIX entry cost my HW resource than the amount
>>>> of new registers in my proposal.
>>> Yes, this is why new MSI-X proposals are on table to improve, the first known
>> approach to me was from Intel using IMS.
>>> Hence, virtio already learnt it seen in the Appendix to not keep adding non init
>> time registers.
>> non-sense to me, IMS still uses MSI
> Clearly not.
> May be you missed something.
>
> IMS enables once to use non registers for the interrupt store unlike MSI/MSI-X.
>
> Please see the commit log comment, snippet here about "queue memory".
>
>         - The interrupt chip must provide the following optional callbacks
>           when the irq_mask(), irq_unmask() and irq_write_msi_msg() callbacks
>           cannot operate directly on hardware, e.g. in the case that the
>           interrupt message store is in queue memory:
>
> IRQ chips callback irq_write_msi_msg() has no such limitation to store in registers.
Please re-read my answer, I said: IMS uses MSI, I didn't say re-use PCI 
msi entries.
>
>>>>>> Can I say implementing admin vq in SOC is a waste of cores?
>>>>> Which cores in the SoC?
>>>>> If it is on the PF, there is only handful of AQs for scale of N VFs.
>>>> I see you got the point anyway, new features cost extra resource
>>>>>>>>>> vDPA works fine on config space.
>>>>>>>>>>
>>>>>>>>>> So, if you still insist admin vq is better than config space
>>>>>>>>>> like in other thread you have concluded, you may imply that
>>>>>>>>>> config space interfaces should be re-factored to admin vq.
>>>>>>>>> Whatever is done in past is done, there is no way to change history.
>>>>>>>>> An new non init time registers should not be placed in device
>>>>>>>>> specific config
>>>>>>>> space as virtio spec has clear guideline on it for good.
>>>>>>>>> Device context reading, dirty page address reading, changing vf
>>>>>>>>> device modes,
>>>>>>>> all of these are clearly not a init time settings.
>>>>>>>>> Hence, they do not belong to the registers.
>>>>>>>> reset vq? and you get it from Appendix B. Creating New Device
>>>>>>>> Types, are we implementing a new type of device???
>>>>>>> I don’t understand your question.
>>>>>>> I replied the history of reset_vq.
>>>>>>> Take good examples to follow, reset_vq clearly is not the one.
>>>>>> so again, we are not implementing new device type, so your citation
>>>>>> doesn't apply.
>>>>> I disagree.
>>>>> I am engineer to build practical systems considering limitations and
>>>>> also advancements of the transport; while listening to other
>>>>> industry efforts, I
>>>> am no from legal department.
>>>>> Hence, Appendix B makes a sense to me to apply to the existing
>>>>> device which
>>>> also has the section for "device improvements".
>>>> it titled as "new device", and I think this discussion is non-sense.
>>>> So if you want to fix this statement, works for me.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30 10:02                                                                               ` Zhu, Lingshan
@ 2023-10-30 10:23                                                                                 ` Parav Pandit
  2023-10-30 11:34                                                                                   ` Michael S. Tsirkin
  2023-10-31  9:42                                                                                   ` Zhu, Lingshan
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-30 10:23 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, October 30, 2023 3:33 PM
> 
> On 10/30/2023 12:17 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, October 30, 2023 9:15 AM
> >>
> >> On 10/26/2023 3:04 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Thursday, October 26, 2023 12:14 PM
> >>>>
> >>>>
> >>>> On 10/24/2023 6:37 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Tuesday, October 24, 2023 4:00 PM
> >>>>>>
> >>>>>> On 10/23/2023 6:14 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Monday, October 23, 2023 3:39 PM
> >>>>>>>>
> >>>>>>>> On 10/20/2023 8:54 PM, Parav Pandit wrote:
> >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> Sent: Friday, October 20, 2023 3:01 PM
> >>>>>>>>>>
> >>>>>>>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit
> wrote:
> >>>>>>>>>>>>>>> Oh, really? Quite interesting, do you want to move all
> >>>>>>>>>>>>>>> config space fields in VF to admin vq? Have a plan?
> >>>>>>>>>>>>>> Not in my plan for spec 1.4 time frame.
> >>>>>>>>>>>>>> I do not want to divert the discussion, would like to
> >>>>>>>>>>>>>> focus on device
> >>>>>>>>>>>> migration phases.
> >>>>>>>>>>>>>> Lets please discuss in some other dedicated thread.
> >>>>>>>>>>>>> Possibly, if there's a way to send admin commands to vf
> >>>>>>>>>>>>> itself then Lingshan will be happy?
> >>>>>>>>>>>> still need to prove why admin commands are better than
> registers.
> >>>>>>>>>>> Virtio spec development is not proof based approach. Please
> >>>>>>>>>>> stop asking for
> >>>>>>>> it.
> >>>>>>>>>>> I tried my best to have technical answer in [1].
> >>>>>>>>>>> I explained that registers simply do not work for
> >>>>>>>>>>> passthrough mode (if this is what you are asking when you
> >>>>>>>>>>> are asking prove its
> >>>> better).
> >>>>>>>>>>> They can work for non_passthrough mediated mode.
> >>>>>>>>>>>
> >>>>>>>>>>> A member device may do admin commands using registers.
> >>>>>>>>>>> Michael and I are
> >>>>>>>>>> discussing presently in the same thread.
> >>>>>>>>>>> Since there are multiple things to be done for device
> >>>>>>>>>>> migration, dedicated
> >>>>>>>>>> register set for each functionality do not scale well, hard
> >>>>>>>>>> to maintain and extend.
> >>>>>>>>>>> A register holding a command content make sense.
> >>>>>>>>>>>
> >>>>>>>>>>> Now, with that, if this can be useful only for
> >>>>>>>>>>> non_passthrough, I made humble
> >>>>>>>>>> request to transport them using AQ, this way, you get all
> >>>>>>>>>> benefits of
> >> AQ.
> >>>>>>>>>>> And trying to understand, why AQ cannot possible or inferior?
> >>>>>>>>>>>
> >>>>>>>>>>> If you have commands like suspend/resume device, register or
> >>>>>>>>>>> queue
> >>>>>>>>>> transport simply don’t work, because it's wrong to bifurcate
> >>>>>>>>>> the device with such weird API.
> >>>>>>>>>>> If you want to biferacate for mediation software, it
> >>>>>>>>>>> probably makes sense to
> >>>>>>>>>> operate at each VQ level, config space level. Such are very
> >>>>>>>>>> different commands than passthrough.
> >>>>>>>>>>> I think vdpa has demonstrated that very well on how to do
> >>>>>>>>>>> specific work for
> >>>>>>>>>> specific device type. So some of those work can be done using AQ.
> >>>>>>>>>>> [1]
> >>>>>>>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-93
> >>>>>>>>>>> 2f
> >>>>>>>>>>> -b
> >>>>>>>>>>> d1
> >>>>>>>>>>> 03
> >>>>>>>>>>> 36
> >>>>>>>>>>>
> >>>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
> >>>>>>>>>> We have been through your statement for many times.
> >>>>>>>>>> This is not about how many times you repeated, if you think
> >>>>>>>>>> this is true, you need to prove that with solid evidence.
> >>>>>>>>>>
> >>>>>>>>> I will not respond to this comment anymore.
> >>>>>>>> Ok if you choose not to respond.
> >>>>>>>>>> For pass-through, I still recommend you to take a reference
> >>>>>>>>>> of current virito-pci implementation, it works for pass-through,
> right?
> >>>>>>>>> What do you mean by current virtio-pci implementation?
> >>>>>>>> current virito-pci works for pass-through
> >>>>>>> I still don’t understand what is "current virtio-pci".
> >>>>>>> Do you mean qemu implementation of emulated virtio-pci or you
> >>>>>>> mean
> >>>>>> virtio-pci specification for passthrough?
> >>>>>>> What do you want me to refer to for passthrough? Please clarify.
> >>>>>> you know guest vcpu and its vRC can not access host side devices,
> >>>>>> and there must be a driver helping the pass-through use cases,
> >>>>>> like vDPA and vfio
> >>>>> I am not sure how to corelate this answer to the question of
> >>>>> "virtio-pci for
> >>>> passthrough".
> >>>>> :(
> >>>>>
> >>>>> Today when a virtio-pci member device is passthrough to the guest
> >>>>> VM,
> >>>> hypervisor is not involved in virtio interface such as config
> >>>> space, cvq, data vq etc.
> >>>>> Do you agree?
> >>> You didn’t respond yet to this question.
> >>> Can you please respond?
> >> Not sure which question you refer to that not answered, Agree what?
> > What is listed above.
> >
> >> Please don't cut off the thread until the issue closed.
> >>
> > I didn’t cut off the thread. Please check your email client.
> >
> >> If you are asking whether hypervisor is involved in accessing virtio
> interfaces.
> >> For passthrough, guest needs a host side help driver to access
> >> hardware, and explained below.
> > Not an accurate answer. Please answer above.
> > Repeating the question again.
> > For passthrough device virtio interfaces such as common and device config
> space, cvq, data vqs, are NOT accessed by the hypervisor.
> > Do you agree?
> Did you failed to process the answer?
> 
Yes, there was no answer the above simple question.
You did counter question.

> Let me repeat again, for the last time.
> 
> The guest can not access any host side devices without a "pass-through helper
> driver".
He he, now you generalize it as "guest" when the question came to discuss the exact definition of passthrough.
Ofcourse there is helper driver for pci config space.
But you took that granted saying X trap+emulated so X+Y is also trap+emulated.

> 
> And the helper driver could be considered as a part of the hypervisor, or the
> guest vCPU can not access the host side devices.
> 
> For example, the path is hw-->vfio_pci-->qemu-->guest. IT IS NOT hw-->guest.
> 
In virtio spec we only talk about the driver, and the device in context of passthrough device.
So for virtio common config, dev config, cvq, data vqs are guest driver -> device.
There is no other entity inbetween.

> You can try to take a loot at how virtio-pci work for QEMU.
> 
When I asked you what is virtio-pci in context of passthrough, you couldn’t answer what that component is.

> If you failed to understand this, then I don't see any necessities to discuss on this
> topic anymore.
The discussion is about passthrough. 😊
You are talking about some vpci composition.

> >
> >
> >>>> Can vCPU access host side device config space? It needs a
> >>>> pass-through helper driver like vfio, right?
> >>> Right.
> >>> And if you are implying that, because generic pci config space is
> >>> intercepted
> >> hence, all virtio common and device specific things MUST BE ALWAYS
> >> intercepted as well.
> >>> Then I do not agree with such derivation.
> >> This is not only virtio, guest needs a helper to access host devices.
> > Does not make sense until you reply above.
> see above
Still not not make sense with your implied reply.

> >
> >>> The main reasons are:
> >>> 1. It breaks the future TDISP model
> >> Not sure why you bring TDISP again, I thought we agree this is closed.
> >>
> > Not to include TDISP in current spec, but the mechanism/infrastructure built
> applies to the future mode as well.
> then discuss in future.
> >
> >> How it break TDISP? Can you let guest driver access host device
> >> whiteout a host side helper like VFIO?
> > Yes, once it is passthrough virtio interface is a secure channel.
> > In TDISP config space is still communicated via hypervisor and it contains all
> the data that is not critical.
> > Hence, there must not be any virtio registers to place in there.
> > In future if one discovers config space as problematic, one will find generic
> solution for all the pci devices, not just virtio.
> interesting, how does guest vCPU access host side devices without a helper
> driver, even a secure channel?
A helper driver maps the PCI memory of passthrough device to the guest VM for direct access.
And locks the TDISP so that hypervisor cannot change this mapping in the future.

After this control plane driver, it is no longer in the picture.

> 
> So do you mean even queue_enable should not be there??? really interesting.
> 
100% yes, it must not be there.

> >
> >> And TDISP says you should not trust PF, thus you should not use admin
> >> vq on PF for live migration.
> > There are few options which will evolve.
> > 1. PF will be handed over to the TVM instead of hypervisor 2. PF aq
> > communication will be encrypted hence not visible to hypervisor, also
> > supported by PCI-SIG already 3. Some other options Since this is the
> > generic solution across virtio and non_virtio, we can rely on wider wisdom of
> PCI-SIG.
> TDISP says don't trust PF anyway.
Ok. this is why it will be encrypted or will be dedicated to a live migration portion of the device.

> >
> >>> 2. Without hypervisor getting involved, all the member device MMIO
> >>> space is accessible which follows the efficiency and equivalency
> >>> principle of Jason listed paper
> >>>
> >>> I hope you are not implying to trap+emulate virtio interfaces (which
> >>> is not
> >> listed in the pci-spec) in hypervisor for member passthrough devices.
> >> Do you agree mmap the bars(interfaces) without doing anything is also
> >> a type of "trap and emulate"?
> > Certainly not.
> > Memory mapping enables guest to _directly_ communicate with the device
> without any VMEXITS.
> > In TDISP world this is also even secured already.
> > So no, it is not trap and emulate.
> Interesting. do you know virtualization is built on "trap and emulate"?
> and pass-through is a special case of "trap and emulate"
> 
Huh, there is no such definition like that.
And not relevant here anyway, I am not going to discuss anything like this which is outside the scope of this discussion.

As I repeatedly said, you continue to think that trap+emultion is the _only_ way to make progress for member devices.
And that is already out of question as TC has already long ago embraced the rest the industry to have hw based devices.
Sriov member based devices is first of its kind.

> If you want to discuss TDISP, then TDISP secured device only accepts TLP from
> the owner, that means it only support the special case, and that is an limitation
> of TDISP.
> 
> But generally speaking, you can always choose to trap and emulate any fields in
> the bar.
Generally yes.
For passthrough, there is no such need of extra layers when the member device already has it.

> 
> Why is TDISP related to current live migration proposal?
> Why we are discussing this?
The scheme propose here aligns to the future where trap+emulation of bifurcating the device is not there.

> >
> >>>>>>>>>> For scale, I already told you for many times that they are
> >>>>>>>>>> per-device facilities. How can a per-device facility not scale?
> >>>>>>>>> Each VF device must implement new set of on-chip memory-based
> >>>>>>>>> registers
> >>>>>>>> which demands more power, die area which does not scale
> >>>>>>>> efficiently to thousands of VFs.
> >>>>>>>> that can be fpga gates or SOC implementing new features, you
> >>>>>>>> think that is a waste?
> >>>>>>> It is waste in hw, if there is a better approach possible to not
> >>>>>>> burn them as
> >>>>>> gates and save on resources for rarely used items.
> >>>>>> Is a new entry in MSIX table a waste of HW?
> >>>>> Not as must as existing MSI-X table entries which requires linear
> >>>>> amount of
> >>>> on-chip memory.
> >>>> anyway, even only one MSIX entry cost my HW resource than the
> >>>> amount of new registers in my proposal.
> >>> Yes, this is why new MSI-X proposals are on table to improve, the
> >>> first known
> >> approach to me was from Intel using IMS.
> >>> Hence, virtio already learnt it seen in the Appendix to not keep
> >>> adding non init
> >> time registers.
> >> non-sense to me, IMS still uses MSI
> > Clearly not.
> > May be you missed something.
> >
> > IMS enables once to use non registers for the interrupt store unlike MSI/MSI-
> X.
> >
> > Please see the commit log comment, snippet here about "queue memory".
> >
> >         - The interrupt chip must provide the following optional callbacks
> >           when the irq_mask(), irq_unmask() and irq_write_msi_msg() callbacks
> >           cannot operate directly on hardware, e.g. in the case that the
> >           interrupt message store is in queue memory:
> >
> > IRQ chips callback irq_write_msi_msg() has no such limitation to store in
> registers.
> Please re-read my answer, I said: IMS uses MSI, I didn't say re-use PCI msi
> entries.
Your answer is not relevant to this discussion at all.
Why?
Because we were discussing the schemes where registers are not used.
One example of that was IMS. It does not matter MSI or MSIX.
As explained in Intel's commit message, the key to focus for IMS is "queue memory" not some hw register like MSI or MSI-X.

> >
> >>>>>> Can I say implementing admin vq in SOC is a waste of cores?
> >>>>> Which cores in the SoC?
> >>>>> If it is on the PF, there is only handful of AQs for scale of N VFs.
> >>>> I see you got the point anyway, new features cost extra resource
> >>>>>>>>>> vDPA works fine on config space.
> >>>>>>>>>>
> >>>>>>>>>> So, if you still insist admin vq is better than config space
> >>>>>>>>>> like in other thread you have concluded, you may imply that
> >>>>>>>>>> config space interfaces should be re-factored to admin vq.
> >>>>>>>>> Whatever is done in past is done, there is no way to change history.
> >>>>>>>>> An new non init time registers should not be placed in device
> >>>>>>>>> specific config
> >>>>>>>> space as virtio spec has clear guideline on it for good.
> >>>>>>>>> Device context reading, dirty page address reading, changing
> >>>>>>>>> vf device modes,
> >>>>>>>> all of these are clearly not a init time settings.
> >>>>>>>>> Hence, they do not belong to the registers.
> >>>>>>>> reset vq? and you get it from Appendix B. Creating New Device
> >>>>>>>> Types, are we implementing a new type of device???
> >>>>>>> I don’t understand your question.
> >>>>>>> I replied the history of reset_vq.
> >>>>>>> Take good examples to follow, reset_vq clearly is not the one.
> >>>>>> so again, we are not implementing new device type, so your
> >>>>>> citation doesn't apply.
> >>>>> I disagree.
> >>>>> I am engineer to build practical systems considering limitations
> >>>>> and also advancements of the transport; while listening to other
> >>>>> industry efforts, I
> >>>> am no from legal department.
> >>>>> Hence, Appendix B makes a sense to me to apply to the existing
> >>>>> device which
> >>>> also has the section for "device improvements".
> >>>> it titled as "new device", and I think this discussion is non-sense.
> >>>> So if you want to fix this statement, works for me.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  7:04                                                                         ` Parav Pandit
  2023-10-30  3:44                                                                           ` Zhu, Lingshan
@ 2023-10-30 11:27                                                                           ` Michael S. Tsirkin
  2023-10-30 11:48                                                                             ` Parav Pandit
  2023-10-31  9:45                                                                             ` Zhu, Lingshan
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-30 11:27 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 26, 2023 at 07:04:38AM +0000, Parav Pandit wrote:
> 1. It breaks the future TDISP model

I really think brinding in TDISP muddies the waters a lot and
should be avoided. We simply won't know until someone does the
legwork and proposed the necessary spec extensions.
In particular current legacy access commands are I thn

> 2. Without hypervisor getting involved, all the member device MMIO space is accessible which follows the efficiency and equivalency principle of Jason listed paper
> 
> I hope you are not implying to trap+emulate virtio interfaces (which is not listed in the pci-spec) in hypervisor for member passthrough devices.

I feel this discussion will keep meandering because the terminology is
vague. There's no single thing that is called "passthrough" -
vendors just build what is expedient with current hardware and
software. Nvidia has a bunch of people working on vfio so they
call that passthrough, Red Hat has people working on VDPA and
they call that passthrough, etc.

Before I mute this discussion for good, does anyone here have any
feeling progress is made? What kind of progress?

-- 
MST

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30 10:23                                                                                 ` Parav Pandit
@ 2023-10-30 11:34                                                                                   ` Michael S. Tsirkin
  2023-10-30 12:02                                                                                     ` Parav Pandit
  2023-10-31  9:35                                                                                     ` Zhu, Lingshan
  2023-10-31  9:42                                                                                   ` Zhu, Lingshan
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-30 11:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Mon, Oct 30, 2023 at 10:23:14AM +0000, Parav Pandit wrote:
> > And the helper driver could be considered as a part of the hypervisor, or the
> > guest vCPU can not access the host side devices.
> > 
> > For example, the path is hw-->vfio_pci-->qemu-->guest. IT IS NOT hw-->guest.

I think above makes sense. Nvidia decided to standardize on VFIO and
that's ok, but there's no point in calling specifically VFIO "true
passthrough" or whatever the marketing term du jour is. This is
not a VFIO TC here. I do wish one of the sides in this discussion
stopped promoting their architecture and the one true way and
tried to actually build interfaces addressing multiple architectures,
though. Otherwise we'll keep getting stuck.

> In virtio spec we only talk about the driver, and the device in context of passthrough device.
> So for virtio common config, dev config, cvq, data vqs are guest driver -> device.
> There is no other entity inbetween.

No longer true - with admin commands we have 2 devices: owner and
member, and each has its own driver.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30 11:27                                                                           ` Michael S. Tsirkin
@ 2023-10-30 11:48                                                                             ` Parav Pandit
  2023-10-31  9:45                                                                             ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-30 11:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Monday, October 30, 2023 4:58 PM
> 
> On Thu, Oct 26, 2023 at 07:04:38AM +0000, Parav Pandit wrote:
> > 1. It breaks the future TDISP model
> 
> I really think brinding in TDISP muddies the waters a lot and should be avoided.
> We simply won't know until someone does the legwork and proposed the
> necessary spec extensions.
> In particular current legacy access commands are I thn
> 
> 
> > 2. Without hypervisor getting involved, all the member device MMIO
> > space is accessible which follows the efficiency and equivalency
> > principle of Jason listed paper
> >
> > I hope you are not implying to trap+emulate virtio interfaces (which is not
> listed in the pci-spec) in hypervisor for member passthrough devices.
> 
> I feel this discussion will keep meandering because the terminology is vague.
> There's no single thing that is called "passthrough" - vendors just build what is
> expedient with current hardware and software. Nvidia has a bunch of people
> working on vfio so they call that passthrough, Red Hat has people working on
> VDPA and they call that passthrough, etc.
> 
> 
> Before I mute this discussion for good, does anyone here have any feeling
> progress is made? What kind of progress?

I received valuable comments from you and some from Jason, and some were offline.
I will post v3 that fits the current OS use case for vfio and vdpa using the current admin command infrastructure.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30 11:34                                                                                   ` Michael S. Tsirkin
@ 2023-10-30 12:02                                                                                     ` Parav Pandit
  2023-10-31  9:35                                                                                     ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-10-30 12:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, October 30, 2023 5:05 PM
> 
> On Mon, Oct 30, 2023 at 10:23:14AM +0000, Parav Pandit wrote:
> > > And the helper driver could be considered as a part of the
> > > hypervisor, or the guest vCPU can not access the host side devices.
> > >
> > > For example, the path is hw-->vfio_pci-->qemu-->guest. IT IS NOT hw--
> >guest.
> 
> I think above makes sense. Nvidia decided to standardize on VFIO and that's ok,
> but there's no point in calling specifically VFIO "true passthrough" or whatever
> the marketing term du jour is. This is not a VFIO TC here. I do wish one of the
> sides in this discussion stopped promoting their architecture and the one true
> way and tried to actually build interfaces addressing multiple architectures,
> though. Otherwise we'll keep getting stuck.
> 
There are more than Nvidia here.
And yes, for sure I am not promoting vfio_only use case.
From beginning it is listed as one use case as member device and owner device follow the same virtio access interface being present in guest or hypervisor etc.

> > In virtio spec we only talk about the driver, and the device in context of
> passthrough device.
> > So for virtio common config, dev config, cvq, data vqs are guest driver ->
> device.
> > There is no other entity inbetween.
> 
> No longer true - with admin commands we have 2 devices: owner and member,
> and each has its own driver.
Sure, when describing the config space access, cvq, data vq we don't introduce the term as owner and device,
Because from spec and device point of view, it's the driver accessing it.

Only the admin commands or group commands talk about owner and member drivers.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-30  4:46                                                             ` Parav Pandit
@ 2023-10-31  1:34                                                               ` Jason Wang
  2023-10-31  5:30                                                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-31  1:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Jason Wang
> >
> > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > >
> > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > >
> > > > > > How do you know that?
> > > > > Because for passthrough, the hypervisor is not involved in dealing
> > > > > with VQ at
> > > > all.
> > > >
> > > > Ok, so if I understand correctly, you are saying your design can't
> > > > work for the case of PASID assignment.
> > > >
> > > No. PASID assignment will happen from the guest for its own use and device
> > migration will just work fine because device context will capture this.
> >
> > It's not about device context. We're discussing "passthrough", no?
> >
> Not sure, we are discussing same.
> A member device is passthrough to the guest, dealing with its own PASIDs and virtio interface for some VQ assignment to PASID.
> So VQ context captured by the hypervisor, will have some PASID attached to this VQ.
> Device context will be updated.
>
> > You want all virtio stuff to be "passthrough", but assigning a PASID to a specific
> > virtqueue in the guest must be trapped.
> >
> No. PASID assignment to a specific virtqueue in the guest must go directly from guest to device.

This works like setting CR3, you can't simply let it go from guest to host.

Host IOMMU driver needs to know the PASID to program the IO page
tables correctly.

> When guest iommu may need to communicate anything for this PASID, it will come through its proper IOMMU channel/hypercall.

Let's say using PASID X for queue 0, this knowledge is beyond the
IOMMU scope but belongs to virtio. Or please explain how it can work
when it goes directly from guest to device.

> Virtio device is not the conduit for this exchange.
>
> > >
> > > > >
> > > > > > There are works ongoing to make vPASID work for the guest like vSVA.
> > > > > > Virtio doesn't differ from other devices.
> > > > > Passthrough do not run like SVA.
> > > >
> > > > Great, you find another limitation of "passthrough" by yourself.
> > > >
> > > No. it is not the limitation it is just the way it does not need complex SVA to
> > split the device for unrelated usage.
> >
> > How can you limit the user in the guest to not use vSVA?
> >
> He he, I am not limiting, again misunderstanding or wrong attribution.
> I explained that hypervisor for passthrough does not need SVA.
> Guest can do anything it wants from the guest OS with the member device.

Ok, so the point stills, see above.

>
> > >
> > > > > Each passthrough device has PASID from its own space fully managed
> > > > > by the
> > > > guest.
> > > > > Some cpu required vPASID and SIOV is not going this way anmore.
> > > >
> > > > Then how to migrate? Invent a full set of something else through
> > > > another giant series like this to migrate to the SIOV thing? That's a mess for
> > sure.
> > > >
> > > SIOV will for sure reuse most or all parts of this work, almost entirely as_is.
> > > vPASID is cpu/platform specific things not part of the SIOV devices.
> > >
> > > > >
> > > > > >
> > > > > > > If at all it is done, it will be done from the guest by the
> > > > > > > driver using virtio
> > > > > > interface.
> > > > > >
> > > > > > Then you need to trap. Such things couldn't be passed through to
> > > > > > guests
> > > > directly.
> > > > > >
> > > > > Only PASID capability is trapped. PASID allocation and usage is
> > > > > directly from
> > > > guest.
> > > >
> > > > How can you achieve this? Assigning a PAISD to a device is
> > > > completely
> > > > device(virtio) specific. How can you use a general layer without the
> > > > knowledge of virtio to trap that?
> > > When one wants to map vPASID to pPASID a platform needs to be involved.
> >
> > I'm not talking about how to map vPASID to pPASID, it's out of the scope of
> > virtio. I'm talking about assigning a vPASID to a specific virtqueue or other virtio
> > function in the guest.
> >
> That can be done in the guest. The key is guest wont know that it is dealing with vPASID.
> It will follow the same principle from your paper of equivalency, where virtio software layer will assign PASID to VQ and communicate to device.
>
> Anyway, all of this just digression from current series.

It's not, as you mention that only MSI-X is trapped, I give you another one.

>
> > You need a virtio specific queue or capability to assign a PASID to a specific
> > virtqueue, and that can't be done without trapping and without virito specific
> > knowledge.
> >
> I disagree. PASID assignment to a virqueue in future from guest virtio driver to device is uniform method.
> Whether its PF assigning PASID to VQ of self,
> Or
> VF driver in the guest assigning PASID to VQ.
>
> All same.
> Only IOMMU layer hypercalls will know how to deal with PASID assignment at platform layer to setup the domain etc table.
>
> And this is way beyond our device migration discussion.
> By any means, if you were implying that somehow vq to PASID assignment _may_ need trap+emulation, hence whole device migration to depend on some trap+emulation, than surely, than I do not agree to it.

See above.

>
> PASID equivalent in mlx5 world is ODP_MR+PD isolating the guest process and all of that just works on efficiency and equivalence principle already for a decade now without any trap+emulation.
>
> > > When virtio passthrough device is in guest, it has all its PASID accessible.
> > >
> > > All these is large deviation from current discussion of this series, so I will keep
> > it short.
> > >
> > > >
> > > > > Regardless it is not relevant to passthrough mode as PASID is yet
> > > > > another
> > > > resource.
> > > > > And for some cpu if it is trapped, it is generic layer, that does
> > > > > not require virtio
> > > > involvement.
> > > > > So virtio interface asking to trap something because generic
> > > > > facility has done
> > > > in not the approach.
> > > >
> > > > This misses the point of PASID. How to use PASID is totally device specific.
> > > Sure, and how to virtualize vPASID/pPASID is platform specific as single PASID
> > can be used by multiple devices and process.
> >
> > See above, I think we're talking about different things.
> >
> > >
> > > >
> > > > >
> > > > > > > Capabilities of #2 is generic across all pci devices, so it
> > > > > > > will be handled by the
> > > > > > HV.
> > > > > > > ATS/PRI cap is also generic manner handled by the HV and PCI device.
> > > > > >
> > > > > > No, ATS/PRI requires the cooperation from the vIOMMU. You can
> > > > > > simply do ATS/PRI passthrough but with an emulated vIOMMU.
> > > > > And that is not the reason for virtio device to build
> > > > > trap+emulation for
> > > > passthrough member devices.
> > > >
> > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > PRI requests arrive on the PF for the VF.
> >
> > Shouldn't it arrive at platform IOMMU first? The path should be PRI -> RC ->
> > IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest IOMMU.
> >
> Above sequence seems write.
>
> > And things will be more complicated when (v)PASID is used. So you can't simply
> > let PRI go directly to the guest with the current architecture.
> >
> In current architecture of the pci VF, PRI does not go directly to the guest.
> (and that is not reason to trap and emulate other things).

Ok, so beyond MSI-X we need to trap PRI, and we will probably trap
other things in the future like PASID assignment.

>
> > >
> > > > how can you pass
> > > > through a hardware PRI request to a guest directly without trapping it then?
> > > > What's more, PCIE allows the PRI to be done in a vendor (virtio)
> > > > specific way, so you want to break this rule? Or you want to blacklist ATS/PRI
> > for virtio?
> > > >
> > > I was aware of only pci-sig way of PRI.
> > > Do you have a reference to the ECN that enables vendor specific way of PRI? I
> > would like to read it.
> >
> > I mean it doesn't forbid us to build a virtio specific interface for I/O page fault
> > report and recovery.
> >
> So PRI of PCI does not allow. It is ODP kind of technique you meant above.
> Yes one can build.
> Ok. unrelated to device migration, so I will park this good discussion for later.

That's fine.

>
> > > This will be very good to eliminate IOMMU PRI limitations.
> >
> > Probably.
> >
> > > PRI will directly go to the guest driver, and guest would interact with IOMMU
> > to service the paging request through IOMMU APIs.
> >
> > With PASID, it can't go directly.
> >
> When the request consist of PASID in it, it can.
> But again these PCI-SIG extensions of PASID are not related to device migration, so I am differing it.
>
> > > For PRI in vendor specific way needs a separate discussion. It is not related to
> > live migration.
> >
> > PRI itself is not related. But the point is, you can't simply pass through ATS/PRI
> > now.
> >
> Ah ok. the whole 4K PCI config space where ATS/PRI capabilities are located are trapped+emulated by hypervisor.
> So?
> So do we start emulating virito interfaces too for passthrough?
> No.
> Can one still continue to trap+emulate?
> Sure why not?

Then let's not limit your proposal to be used by "passthrough" only?
I've shown you that

1) you can't easily say you can pass through all the virtio facilities
2) how ambiguous for terminology like "passthrough"

Thanks

>
> Can one use AQ of this proposal to do so?
> Sure, why not?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30  4:27                                                                   ` Parav Pandit
@ 2023-10-31  1:36                                                                     ` Jason Wang
  2023-10-31  5:17                                                                       ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-10-31  1:36 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Mon, Oct 30, 2023 at 12:28 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, October 30, 2023 9:35 AM
> >
> > 在 2023/10/26 11:50, Parav Pandit 写道:
> > >> From: virtio-comment@lists.oasis-open.org
> > >> <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang For
> > >> example, you still haven't succeeded in defining passthrough.
> > > It was defined on 19th Oct in [1].
> > > What part is not clear to you in definition of passthrough device?
> > >
> > > [1]
> > > https://lore.kernel.org/virtio-
> > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> >
> >
> > Let me copy-paste it again:
> >
> > For example, assuming you are correct, you still fail to explain
> >
> > 1) what is trapped and what's not, or what's the boundary
> Passthrough definition was replied few times.
> One of them is here, https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> I don’t know what you mean by 'explain'. What do you want to be explained?
> What is trapped is listed in https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> What is not trapped is also listed in https://lore.kernel.org/virtio-comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> So what more do you want to explain in there?

You explained that MSI-X is trapped but not the others. People may
know why. or what's the boundary to choose to trap or not.

>
> > 2) if the hypervisor is not developed with those assumptions, things can work
> What to explain in #2. :)
> Things can expand when such hypervisor is born.

So the point is still, to make your proposal to be useful in more use cases.

That's it.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-31  1:36                                                                     ` Jason Wang
@ 2023-10-31  5:17                                                                       ` Parav Pandit
  2023-11-01  0:33                                                                         ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-31  5:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Jason Wang
> Sent: Tuesday, October 31, 2023 7:07 AM
> 
> On Mon, Oct 30, 2023 at 12:28 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, October 30, 2023 9:35 AM
> > >
> > > 在 2023/10/26 11:50, Parav Pandit 写道:
> > > >> From: virtio-comment@lists.oasis-open.org
> > > >> <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > >> For example, you still haven't succeeded in defining passthrough.
> > > > It was defined on 19th Oct in [1].
> > > > What part is not clear to you in definition of passthrough device?
> > > >
> > > > [1]
> > > > https://lore.kernel.org/virtio-
> > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > >
> > >
> > > Let me copy-paste it again:
> > >
> > > For example, assuming you are correct, you still fail to explain
> > >
> > > 1) what is trapped and what's not, or what's the boundary
> > Passthrough definition was replied few times.
> > One of them is here,
> > https://lore.kernel.org/virtio-
> comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > I don’t know what you mean by 'explain'. What do you want to be explained?
> > What is trapped is listed in
> > https://lore.kernel.org/virtio-
> comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > What is not trapped is also listed in
> > https://lore.kernel.org/virtio-
> comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > So what more do you want to explain in there?
> 
> You explained that MSI-X is trapped but not the others. People may know why.
> or what's the boundary to choose to trap or not.
> 
If a platform can support without trapping, it can be avoided as well and can be added in the future.

> >
> > > 2) if the hypervisor is not developed with those assumptions, things
> > > can work
> > What to explain in #2. :)
> > Things can expand when such hypervisor is born.
> 
> So the point is still, to make your proposal to be useful in more use cases.
>
When a use case arise, device context can be expanded.
No point in making things no one implements or not present in hypervisor.
The infrastructure is extendible so spec is covered for it.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-31  1:34                                                               ` Jason Wang
@ 2023-10-31  5:30                                                                 ` Parav Pandit
  2023-11-01  0:33                                                                   ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-10-31  5:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 31, 2023 7:05 AM
> 
> On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > >
> > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > >
> > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > >
> > > > > > > How do you know that?
> > > > > > Because for passthrough, the hypervisor is not involved in
> > > > > > dealing with VQ at
> > > > > all.
> > > > >
> > > > > Ok, so if I understand correctly, you are saying your design
> > > > > can't work for the case of PASID assignment.
> > > > >
> > > > No. PASID assignment will happen from the guest for its own use
> > > > and device
> > > migration will just work fine because device context will capture this.
> > >
> > > It's not about device context. We're discussing "passthrough", no?
> > >
> > Not sure, we are discussing same.
> > A member device is passthrough to the guest, dealing with its own PASIDs and
> virtio interface for some VQ assignment to PASID.
> > So VQ context captured by the hypervisor, will have some PASID attached to
> this VQ.
> > Device context will be updated.
> >
> > > You want all virtio stuff to be "passthrough", but assigning a PASID
> > > to a specific virtqueue in the guest must be trapped.
> > >
> > No. PASID assignment to a specific virtqueue in the guest must go directly
> from guest to device.
> 
> This works like setting CR3, you can't simply let it go from guest to host.
> 
> Host IOMMU driver needs to know the PASID to program the IO page tables
> correctly.
>
This will be done by the IOMMU.
 
> > When guest iommu may need to communicate anything for this PASID, it will
> come through its proper IOMMU channel/hypercall.
> 
> Let's say using PASID X for queue 0, this knowledge is beyond the IOMMU scope
> but belongs to virtio. Or please explain how it can work when it goes directly
> from guest to device.
> 
We are yet to ever see spec for PASID to VQ assignment.
For ok for theory sake it is there.

Virtio driver will assign the PASID directly from guest driver to device using a create_vq(pasid=X) command.
Same process is somehow attached the PASID by the guest OS.
The whole PASID range is known to the hypervisor when the device is handed over to the guest VM.
So PASID mapping is setup by the hypervisor IOMMU at this point.

> > Virtio device is not the conduit for this exchange.
> >
> > > >
> > > > > >
> > > > > > > There are works ongoing to make vPASID work for the guest like
> vSVA.
> > > > > > > Virtio doesn't differ from other devices.
> > > > > > Passthrough do not run like SVA.
> > > > >
> > > > > Great, you find another limitation of "passthrough" by yourself.
> > > > >
> > > > No. it is not the limitation it is just the way it does not need
> > > > complex SVA to
> > > split the device for unrelated usage.
> > >
> > > How can you limit the user in the guest to not use vSVA?
> > >
> > He he, I am not limiting, again misunderstanding or wrong attribution.
> > I explained that hypervisor for passthrough does not need SVA.
> > Guest can do anything it wants from the guest OS with the member device.
> 
> Ok, so the point stills, see above.

I don’t think so. The guest owns its PASID space and directly communicates like any other device attribute.

> 
> >
> > > >
> > > > > > Each passthrough device has PASID from its own space fully
> > > > > > managed by the
> > > > > guest.
> > > > > > Some cpu required vPASID and SIOV is not going this way anmore.
> > > > >
> > > > > Then how to migrate? Invent a full set of something else through
> > > > > another giant series like this to migrate to the SIOV thing?
> > > > > That's a mess for
> > > sure.
> > > > >
> > > > SIOV will for sure reuse most or all parts of this work, almost entirely as_is.
> > > > vPASID is cpu/platform specific things not part of the SIOV devices.
> > > >
> > > > > >
> > > > > > >
> > > > > > > > If at all it is done, it will be done from the guest by
> > > > > > > > the driver using virtio
> > > > > > > interface.
> > > > > > >
> > > > > > > Then you need to trap. Such things couldn't be passed
> > > > > > > through to guests
> > > > > directly.
> > > > > > >
> > > > > > Only PASID capability is trapped. PASID allocation and usage
> > > > > > is directly from
> > > > > guest.
> > > > >
> > > > > How can you achieve this? Assigning a PAISD to a device is
> > > > > completely
> > > > > device(virtio) specific. How can you use a general layer without
> > > > > the knowledge of virtio to trap that?
> > > > When one wants to map vPASID to pPASID a platform needs to be
> involved.
> > >
> > > I'm not talking about how to map vPASID to pPASID, it's out of the
> > > scope of virtio. I'm talking about assigning a vPASID to a specific
> > > virtqueue or other virtio function in the guest.
> > >
> > That can be done in the guest. The key is guest wont know that it is dealing
> with vPASID.
> > It will follow the same principle from your paper of equivalency, where virtio
> software layer will assign PASID to VQ and communicate to device.
> >
> > Anyway, all of this just digression from current series.
> 
> It's not, as you mention that only MSI-X is trapped, I give you another one.
> 
PASID access from the guest to be done fully by the guest IOMMU.
Not by virtio devices.

> >
> > > You need a virtio specific queue or capability to assign a PASID to
> > > a specific virtqueue, and that can't be done without trapping and
> > > without virito specific knowledge.
> > >
> > I disagree. PASID assignment to a virqueue in future from guest virtio driver to
> device is uniform method.
> > Whether its PF assigning PASID to VQ of self, Or VF driver in the
> > guest assigning PASID to VQ.
> >
> > All same.
> > Only IOMMU layer hypercalls will know how to deal with PASID assignment at
> platform layer to setup the domain etc table.
> >
> > And this is way beyond our device migration discussion.
> > By any means, if you were implying that somehow vq to PASID assignment
> _may_ need trap+emulation, hence whole device migration to depend on some
> trap+emulation, than surely, than I do not agree to it.
> 
> See above.
>
Yeah, I disagree to such implying.
 
> >
> > PASID equivalent in mlx5 world is ODP_MR+PD isolating the guest process and
> all of that just works on efficiency and equivalence principle already for a
> decade now without any trap+emulation.
> >
> > > > When virtio passthrough device is in guest, it has all its PASID accessible.
> > > >
> > > > All these is large deviation from current discussion of this
> > > > series, so I will keep
> > > it short.
> > > >
> > > > >
> > > > > > Regardless it is not relevant to passthrough mode as PASID is
> > > > > > yet another
> > > > > resource.
> > > > > > And for some cpu if it is trapped, it is generic layer, that
> > > > > > does not require virtio
> > > > > involvement.
> > > > > > So virtio interface asking to trap something because generic
> > > > > > facility has done
> > > > > in not the approach.
> > > > >
> > > > > This misses the point of PASID. How to use PASID is totally device
> specific.
> > > > Sure, and how to virtualize vPASID/pPASID is platform specific as
> > > > single PASID
> > > can be used by multiple devices and process.
> > >
> > > See above, I think we're talking about different things.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > > Capabilities of #2 is generic across all pci devices, so
> > > > > > > > it will be handled by the
> > > > > > > HV.
> > > > > > > > ATS/PRI cap is also generic manner handled by the HV and PCI
> device.
> > > > > > >
> > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU. You
> > > > > > > can simply do ATS/PRI passthrough but with an emulated vIOMMU.
> > > > > > And that is not the reason for virtio device to build
> > > > > > trap+emulation for
> > > > > passthrough member devices.
> > > > >
> > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > PRI requests arrive on the PF for the VF.
> > >
> > > Shouldn't it arrive at platform IOMMU first? The path should be PRI
> > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest IOMMU.
> > >
> > Above sequence seems write.
> >
> > > And things will be more complicated when (v)PASID is used. So you
> > > can't simply let PRI go directly to the guest with the current architecture.
> > >
> > In current architecture of the pci VF, PRI does not go directly to the guest.
> > (and that is not reason to trap and emulate other things).
> 
> Ok, so beyond MSI-X we need to trap PRI, and we will probably trap other
> things in the future like PASID assignment.
PRI etc all belong to generic PCI 4K config space region.
Trap+emulation done in generic manner without involving virtio or other device types.

> 
> >
> > > >
> > > > > how can you pass
> > > > > through a hardware PRI request to a guest directly without trapping it
> then?
> > > > > What's more, PCIE allows the PRI to be done in a vendor (virtio)
> > > > > specific way, so you want to break this rule? Or you want to
> > > > > blacklist ATS/PRI
> > > for virtio?
> > > > >
> > > > I was aware of only pci-sig way of PRI.
> > > > Do you have a reference to the ECN that enables vendor specific
> > > > way of PRI? I
> > > would like to read it.
> > >
> > > I mean it doesn't forbid us to build a virtio specific interface for
> > > I/O page fault report and recovery.
> > >
> > So PRI of PCI does not allow. It is ODP kind of technique you meant above.
> > Yes one can build.
> > Ok. unrelated to device migration, so I will park this good discussion for later.
> 
> That's fine.
> 
> >
> > > > This will be very good to eliminate IOMMU PRI limitations.
> > >
> > > Probably.
> > >
> > > > PRI will directly go to the guest driver, and guest would interact
> > > > with IOMMU
> > > to service the paging request through IOMMU APIs.
> > >
> > > With PASID, it can't go directly.
> > >
> > When the request consist of PASID in it, it can.
> > But again these PCI-SIG extensions of PASID are not related to device
> migration, so I am differing it.
> >
> > > > For PRI in vendor specific way needs a separate discussion. It is
> > > > not related to
> > > live migration.
> > >
> > > PRI itself is not related. But the point is, you can't simply pass
> > > through ATS/PRI now.
> > >
> > Ah ok. the whole 4K PCI config space where ATS/PRI capabilities are located
> are trapped+emulated by hypervisor.
> > So?
> > So do we start emulating virito interfaces too for passthrough?
> > No.
> > Can one still continue to trap+emulate?
> > Sure why not?
> 
> Then let's not limit your proposal to be used by "passthrough" only?
One can possibly build some variant of the existing virtio member device using same owner and member scheme.
If for that is some admin commands are missing, may be one can add them.
No need to step on toes of use cases as they are different...

> I've shown you that
> 
> 1) you can't easily say you can pass through all the virtio facilities
> 2) how ambiguous for terminology like "passthrough"
>
It is not, it is well defined in v3, v2.
One can continue to argue and keep defining the variant and still call it data path acceleration and then claim it as passthrough ...
But I won't debate this anymore as its just non-technical aspects of least interest.
We have technical tasks and more improved specs to update going forward.
Working on extension for device specific contexts to enrich it.

> Thanks
> 
> >
> > Can one use AQ of this proposal to do so?
> > Sure, why not?


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30 11:34                                                                                   ` Michael S. Tsirkin
  2023-10-30 12:02                                                                                     ` Parav Pandit
@ 2023-10-31  9:35                                                                                     ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-31  9:35 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/30/2023 7:34 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 30, 2023 at 10:23:14AM +0000, Parav Pandit wrote:
>>> And the helper driver could be considered as a part of the hypervisor, or the
>>> guest vCPU can not access the host side devices.
>>>
>>> For example, the path is hw-->vfio_pci-->qemu-->guest. IT IS NOT hw-->guest.
> I think above makes sense. Nvidia decided to standardize on VFIO and
> that's ok, but there's no point in calling specifically VFIO "true
> passthrough" or whatever the marketing term du jour is. This is
> not a VFIO TC here. I do wish one of the sides in this discussion
> stopped promoting their architecture and the one true way and
> tried to actually build interfaces addressing multiple architectures,
> though. Otherwise we'll keep getting stuck.
I agree, pass-through is clear and I don't think this worth arguing.
>
>> In virtio spec we only talk about the driver, and the device in context of passthrough device.
>> So for virtio common config, dev config, cvq, data vqs are guest driver -> device.
>> There is no other entity inbetween.
> No longer true - with admin commands we have 2 devices: owner and
> member, and each has its own driver.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30 10:23                                                                                 ` Parav Pandit
  2023-10-30 11:34                                                                                   ` Michael S. Tsirkin
@ 2023-10-31  9:42                                                                                   ` Zhu, Lingshan
  2023-10-31 10:14                                                                                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-31  9:42 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/30/2023 6:23 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, October 30, 2023 3:33 PM
>>
>> On 10/30/2023 12:17 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, October 30, 2023 9:15 AM
>>>>
>>>> On 10/26/2023 3:04 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Thursday, October 26, 2023 12:14 PM
>>>>>>
>>>>>>
>>>>>> On 10/24/2023 6:37 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Tuesday, October 24, 2023 4:00 PM
>>>>>>>>
>>>>>>>> On 10/23/2023 6:14 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Monday, October 23, 2023 3:39 PM
>>>>>>>>>>
>>>>>>>>>> On 10/20/2023 8:54 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> Sent: Friday, October 20, 2023 3:01 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/19/2023 6:33 PM, Parav Pandit wrote:
>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>> Sent: Thursday, October 19, 2023 2:48 PM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/19/2023 5:14 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>>>> On Thu, Oct 19, 2023 at 09:13:16AM +0000, Parav Pandit
>> wrote:
>>>>>>>>>>>>>>>>> Oh, really? Quite interesting, do you want to move all
>>>>>>>>>>>>>>>>> config space fields in VF to admin vq? Have a plan?
>>>>>>>>>>>>>>>> Not in my plan for spec 1.4 time frame.
>>>>>>>>>>>>>>>> I do not want to divert the discussion, would like to
>>>>>>>>>>>>>>>> focus on device
>>>>>>>>>>>>>> migration phases.
>>>>>>>>>>>>>>>> Lets please discuss in some other dedicated thread.
>>>>>>>>>>>>>>> Possibly, if there's a way to send admin commands to vf
>>>>>>>>>>>>>>> itself then Lingshan will be happy?
>>>>>>>>>>>>>> still need to prove why admin commands are better than
>> registers.
>>>>>>>>>>>>> Virtio spec development is not proof based approach. Please
>>>>>>>>>>>>> stop asking for
>>>>>>>>>> it.
>>>>>>>>>>>>> I tried my best to have technical answer in [1].
>>>>>>>>>>>>> I explained that registers simply do not work for
>>>>>>>>>>>>> passthrough mode (if this is what you are asking when you
>>>>>>>>>>>>> are asking prove its
>>>>>> better).
>>>>>>>>>>>>> They can work for non_passthrough mediated mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A member device may do admin commands using registers.
>>>>>>>>>>>>> Michael and I are
>>>>>>>>>>>> discussing presently in the same thread.
>>>>>>>>>>>>> Since there are multiple things to be done for device
>>>>>>>>>>>>> migration, dedicated
>>>>>>>>>>>> register set for each functionality do not scale well, hard
>>>>>>>>>>>> to maintain and extend.
>>>>>>>>>>>>> A register holding a command content make sense.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now, with that, if this can be useful only for
>>>>>>>>>>>>> non_passthrough, I made humble
>>>>>>>>>>>> request to transport them using AQ, this way, you get all
>>>>>>>>>>>> benefits of
>>>> AQ.
>>>>>>>>>>>>> And trying to understand, why AQ cannot possible or inferior?
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you have commands like suspend/resume device, register or
>>>>>>>>>>>>> queue
>>>>>>>>>>>> transport simply don’t work, because it's wrong to bifurcate
>>>>>>>>>>>> the device with such weird API.
>>>>>>>>>>>>> If you want to biferacate for mediation software, it
>>>>>>>>>>>>> probably makes sense to
>>>>>>>>>>>> operate at each VQ level, config space level. Such are very
>>>>>>>>>>>> different commands than passthrough.
>>>>>>>>>>>>> I think vdpa has demonstrated that very well on how to do
>>>>>>>>>>>>> specific work for
>>>>>>>>>>>> specific device type. So some of those work can be done using AQ.
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://lore.kernel.org/virtio-comment/870ace02-f99c-4582-93
>>>>>>>>>>>>> 2f
>>>>>>>>>>>>> -b
>>>>>>>>>>>>> d1
>>>>>>>>>>>>> 03
>>>>>>>>>>>>> 36
>>>>>>>>>>>>>
>>>>>> 2dae9@intel.com/T/#m37743aa924536d0256d6b3b8e83a11c750f28794
>>>>>>>>>>>> We have been through your statement for many times.
>>>>>>>>>>>> This is not about how many times you repeated, if you think
>>>>>>>>>>>> this is true, you need to prove that with solid evidence.
>>>>>>>>>>>>
>>>>>>>>>>> I will not respond to this comment anymore.
>>>>>>>>>> Ok if you choose not to respond.
>>>>>>>>>>>> For pass-through, I still recommend you to take a reference
>>>>>>>>>>>> of current virito-pci implementation, it works for pass-through,
>> right?
>>>>>>>>>>> What do you mean by current virtio-pci implementation?
>>>>>>>>>> current virito-pci works for pass-through
>>>>>>>>> I still don’t understand what is "current virtio-pci".
>>>>>>>>> Do you mean qemu implementation of emulated virtio-pci or you
>>>>>>>>> mean
>>>>>>>> virtio-pci specification for passthrough?
>>>>>>>>> What do you want me to refer to for passthrough? Please clarify.
>>>>>>>> you know guest vcpu and its vRC can not access host side devices,
>>>>>>>> and there must be a driver helping the pass-through use cases,
>>>>>>>> like vDPA and vfio
>>>>>>> I am not sure how to corelate this answer to the question of
>>>>>>> "virtio-pci for
>>>>>> passthrough".
>>>>>>> :(
>>>>>>>
>>>>>>> Today when a virtio-pci member device is passthrough to the guest
>>>>>>> VM,
>>>>>> hypervisor is not involved in virtio interface such as config
>>>>>> space, cvq, data vq etc.
>>>>>>> Do you agree?
>>>>> You didn’t respond yet to this question.
>>>>> Can you please respond?
>>>> Not sure which question you refer to that not answered, Agree what?
>>> What is listed above.
>>>
>>>> Please don't cut off the thread until the issue closed.
>>>>
>>> I didn’t cut off the thread. Please check your email client.
>>>
>>>> If you are asking whether hypervisor is involved in accessing virtio
>> interfaces.
>>>> For passthrough, guest needs a host side help driver to access
>>>> hardware, and explained below.
>>> Not an accurate answer. Please answer above.
>>> Repeating the question again.
>>> For passthrough device virtio interfaces such as common and device config
>> space, cvq, data vqs, are NOT accessed by the hypervisor.
>>> Do you agree?
>> Did you failed to process the answer?
>>
> Yes, there was no answer the above simple question.
> You did counter question.
>
>> Let me repeat again, for the last time.
>>
>> The guest can not access any host side devices without a "pass-through helper
>> driver".
> He he, now you generalize it as "guest" when the question came to discuss the exact definition of passthrough.
> Ofcourse there is helper driver for pci config space.
> But you took that granted saying X trap+emulated so X+Y is also trap+emulated.
>
>> And the helper driver could be considered as a part of the hypervisor, or the
>> guest vCPU can not access the host side devices.
>>
>> For example, the path is hw-->vfio_pci-->qemu-->guest. IT IS NOT hw-->guest.
>>
> In virtio spec we only talk about the driver, and the device in context of passthrough device.
> So for virtio common config, dev config, cvq, data vqs are guest driver -> device.
> There is no other entity inbetween.
>
>> You can try to take a loot at how virtio-pci work for QEMU.
>>
> When I asked you what is virtio-pci in context of passthrough, you couldn’t answer what that component is.
well, pci is quite clear here to be a transport.

To all above questions, As MST suggested, we agree the definition and 
usecase of "pass-through" is clear to us,
so I will stop repeating the answers. If you still fail to understand, 
sorry and can not help.
>
>> If you failed to understand this, then I don't see any necessities to discuss on this
>> topic anymore.
> The discussion is about passthrough. 😊
> You are talking about some vpci composition.
see above
>
>>>
>>>>>> Can vCPU access host side device config space? It needs a
>>>>>> pass-through helper driver like vfio, right?
>>>>> Right.
>>>>> And if you are implying that, because generic pci config space is
>>>>> intercepted
>>>> hence, all virtio common and device specific things MUST BE ALWAYS
>>>> intercepted as well.
>>>>> Then I do not agree with such derivation.
>>>> This is not only virtio, guest needs a helper to access host devices.
>>> Does not make sense until you reply above.
>> see above
> Still not not make sense with your implied reply.
>
>>>>> The main reasons are:
>>>>> 1. It breaks the future TDISP model
>>>> Not sure why you bring TDISP again, I thought we agree this is closed.
>>>>
>>> Not to include TDISP in current spec, but the mechanism/infrastructure built
>> applies to the future mode as well.
>> then discuss in future.
>>>> How it break TDISP? Can you let guest driver access host device
>>>> whiteout a host side helper like VFIO?
>>> Yes, once it is passthrough virtio interface is a secure channel.
>>> In TDISP config space is still communicated via hypervisor and it contains all
>> the data that is not critical.
>>> Hence, there must not be any virtio registers to place in there.
>>> In future if one discovers config space as problematic, one will find generic
>> solution for all the pci devices, not just virtio.
>> interesting, how does guest vCPU access host side devices without a helper
>> driver, even a secure channel?
> A helper driver maps the PCI memory of passthrough device to the guest VM for direct access.
> And locks the TDISP so that hypervisor cannot change this mapping in the future.
>
> After this control plane driver, it is no longer in the picture.
>
>> So do you mean even queue_enable should not be there??? really interesting.
>>
> 100% yes, it must not be there.
>
>>>> And TDISP says you should not trust PF, thus you should not use admin
>>>> vq on PF for live migration.
>>> There are few options which will evolve.
>>> 1. PF will be handed over to the TVM instead of hypervisor 2. PF aq
>>> communication will be encrypted hence not visible to hypervisor, also
>>> supported by PCI-SIG already 3. Some other options Since this is the
>>> generic solution across virtio and non_virtio, we can rely on wider wisdom of
>> PCI-SIG.
>> TDISP says don't trust PF anyway.
> Ok. this is why it will be encrypted or will be dedicated to a live migration portion of the device.
>
>>>>> 2. Without hypervisor getting involved, all the member device MMIO
>>>>> space is accessible which follows the efficiency and equivalency
>>>>> principle of Jason listed paper
>>>>>
>>>>> I hope you are not implying to trap+emulate virtio interfaces (which
>>>>> is not
>>>> listed in the pci-spec) in hypervisor for member passthrough devices.
>>>> Do you agree mmap the bars(interfaces) without doing anything is also
>>>> a type of "trap and emulate"?
>>> Certainly not.
>>> Memory mapping enables guest to _directly_ communicate with the device
>> without any VMEXITS.
>>> In TDISP world this is also even secured already.
>>> So no, it is not trap and emulate.
>> Interesting. do you know virtualization is built on "trap and emulate"?
>> and pass-through is a special case of "trap and emulate"
>>
> Huh, there is no such definition like that.
> And not relevant here anyway, I am not going to discuss anything like this which is outside the scope of this discussion.
>
> As I repeatedly said, you continue to think that trap+emultion is the _only_ way to make progress for member devices.
> And that is already out of question as TC has already long ago embraced the rest the industry to have hw based devices.
> Sriov member based devices is first of its kind.
see above.
>
>> If you want to discuss TDISP, then TDISP secured device only accepts TLP from
>> the owner, that means it only support the special case, and that is an limitation
>> of TDISP.
>>
>> But generally speaking, you can always choose to trap and emulate any fields in
>> the bar.
> Generally yes.
> For passthrough, there is no such need of extra layers when the member device already has it.
see above.
>
>> Why is TDISP related to current live migration proposal?
>> Why we are discussing this?
> The scheme propose here aligns to the future where trap+emulation of bifurcating the device is not there.
well, then let's talk in the future when it comes true.
>
>>>>>>>>>>>> For scale, I already told you for many times that they are
>>>>>>>>>>>> per-device facilities. How can a per-device facility not scale?
>>>>>>>>>>> Each VF device must implement new set of on-chip memory-based
>>>>>>>>>>> registers
>>>>>>>>>> which demands more power, die area which does not scale
>>>>>>>>>> efficiently to thousands of VFs.
>>>>>>>>>> that can be fpga gates or SOC implementing new features, you
>>>>>>>>>> think that is a waste?
>>>>>>>>> It is waste in hw, if there is a better approach possible to not
>>>>>>>>> burn them as
>>>>>>>> gates and save on resources for rarely used items.
>>>>>>>> Is a new entry in MSIX table a waste of HW?
>>>>>>> Not as must as existing MSI-X table entries which requires linear
>>>>>>> amount of
>>>>>> on-chip memory.
>>>>>> anyway, even only one MSIX entry cost my HW resource than the
>>>>>> amount of new registers in my proposal.
>>>>> Yes, this is why new MSI-X proposals are on table to improve, the
>>>>> first known
>>>> approach to me was from Intel using IMS.
>>>>> Hence, virtio already learnt it seen in the Appendix to not keep
>>>>> adding non init
>>>> time registers.
>>>> non-sense to me, IMS still uses MSI
>>> Clearly not.
>>> May be you missed something.
>>>
>>> IMS enables once to use non registers for the interrupt store unlike MSI/MSI-
>> X.
>>> Please see the commit log comment, snippet here about "queue memory".
>>>
>>>          - The interrupt chip must provide the following optional callbacks
>>>            when the irq_mask(), irq_unmask() and irq_write_msi_msg() callbacks
>>>            cannot operate directly on hardware, e.g. in the case that the
>>>            interrupt message store is in queue memory:
>>>
>>> IRQ chips callback irq_write_msi_msg() has no such limitation to store in
>> registers.
>> Please re-read my answer, I said: IMS uses MSI, I didn't say re-use PCI msi
>> entries.
> Your answer is not relevant to this discussion at all.
> Why?
> Because we were discussing the schemes where registers are not used.
> One example of that was IMS. It does not matter MSI or MSIX.
> As explained in Intel's commit message, the key to focus for IMS is "queue memory" not some hw register like MSI or MSI-X.
you know the device always need to know a address and the data to send a 
MSI, right?

If you think this is not relevant to the discussion, OK.
>
>>>>>>>> Can I say implementing admin vq in SOC is a waste of cores?
>>>>>>> Which cores in the SoC?
>>>>>>> If it is on the PF, there is only handful of AQs for scale of N VFs.
>>>>>> I see you got the point anyway, new features cost extra resource
>>>>>>>>>>>> vDPA works fine on config space.
>>>>>>>>>>>>
>>>>>>>>>>>> So, if you still insist admin vq is better than config space
>>>>>>>>>>>> like in other thread you have concluded, you may imply that
>>>>>>>>>>>> config space interfaces should be re-factored to admin vq.
>>>>>>>>>>> Whatever is done in past is done, there is no way to change history.
>>>>>>>>>>> An new non init time registers should not be placed in device
>>>>>>>>>>> specific config
>>>>>>>>>> space as virtio spec has clear guideline on it for good.
>>>>>>>>>>> Device context reading, dirty page address reading, changing
>>>>>>>>>>> vf device modes,
>>>>>>>>>> all of these are clearly not a init time settings.
>>>>>>>>>>> Hence, they do not belong to the registers.
>>>>>>>>>> reset vq? and you get it from Appendix B. Creating New Device
>>>>>>>>>> Types, are we implementing a new type of device???
>>>>>>>>> I don’t understand your question.
>>>>>>>>> I replied the history of reset_vq.
>>>>>>>>> Take good examples to follow, reset_vq clearly is not the one.
>>>>>>>> so again, we are not implementing new device type, so your
>>>>>>>> citation doesn't apply.
>>>>>>> I disagree.
>>>>>>> I am engineer to build practical systems considering limitations
>>>>>>> and also advancements of the transport; while listening to other
>>>>>>> industry efforts, I
>>>>>> am no from legal department.
>>>>>>> Hence, Appendix B makes a sense to me to apply to the existing
>>>>>>> device which
>>>>>> also has the section for "device improvements".
>>>>>> it titled as "new device", and I think this discussion is non-sense.
>>>>>> So if you want to fix this statement, works for me.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-30 11:27                                                                           ` Michael S. Tsirkin
  2023-10-30 11:48                                                                             ` Parav Pandit
@ 2023-10-31  9:45                                                                             ` Zhu, Lingshan
  1 sibling, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-10-31  9:45 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/30/2023 7:27 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 26, 2023 at 07:04:38AM +0000, Parav Pandit wrote:
>> 1. It breaks the future TDISP model
> I really think brinding in TDISP muddies the waters a lot and
> should be avoided. We simply won't know until someone does the
> legwork and proposed the necessary spec extensions.
> In particular current legacy access commands are I thn
>
>
>> 2. Without hypervisor getting involved, all the member device MMIO space is accessible which follows the efficiency and equivalency principle of Jason listed paper
>>
>> I hope you are not implying to trap+emulate virtio interfaces (which is not listed in the pci-spec) in hypervisor for member passthrough devices.
> I feel this discussion will keep meandering because the terminology is
> vague. There's no single thing that is called "passthrough" -
> vendors just build what is expedient with current hardware and
> software. Nvidia has a bunch of people working on vfio so they
> call that passthrough, Red Hat has people working on VDPA and
> they call that passthrough, etc.
>
>
> Before I mute this discussion for good, does anyone here have any
> feeling progress is made? What kind of progress?
I agree, please no future discussions of passthrough and TDISP. Really 
not helpful.

I am really feel wasting time here explaining what is pass-through, the 
use-case and
why TDISP is not related here. What is the point talking about something 
not present right now?

Just to keep the discussion open????

Please!!!!
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-31  9:42                                                                                   ` Zhu, Lingshan
@ 2023-10-31 10:14                                                                                     ` Michael S. Tsirkin
  2023-11-01  0:42                                                                                       ` Jason Wang
                                                                                                         ` (2 more replies)
  0 siblings, 3 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-10-31 10:14 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > Your answer is not relevant to this discussion at all.
> > Why?
> > Because we were discussing the schemes where registers are not used.
> > One example of that was IMS. It does not matter MSI or MSIX.
> > As explained in Intel's commit message, the key to focus for IMS is "queue memory" not some hw register like MSI or MSI-X.
> you know the device always need to know a address and the data to send a
> MSI, right?

So if virtio is to use IMS then we'll need to add interfaces to program
IMS, I think. As part of that patch - it's reasonable to assume - we will
also need to add a way to retrieve IMS so it can be migrated.

However, what this example demonstrates is that the approach taken
by this proposal to migrate control path structures - namely, by
defining a structure used just for migration - means that we will
need to come up with a migration interface each time.
And that is unfortunate.

Compare to the trap and emulate approach for config space and we don't
need a new interface, we just make each field R/W.
So I feel this is something to think about, and address.
Ideas?

-- 
MST

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-26  6:22                                                                             ` Michael S. Tsirkin
  2023-10-30  4:02                                                                               ` Jason Wang
@ 2023-11-01  0:33                                                                               ` Jason Wang
  1 sibling, 0 replies; 341+ messages in thread
From: Jason Wang @ 2023-11-01  0:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Oct 26, 2023 at 2:23 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Oct 26, 2023 at 08:56:47AM +0800, Jason Wang wrote:
> > > > We transfer data by DMA, the device writes DMA dirty pages information(bitmap)
> > > > to host isolated memory region.
> > > >
> > >
> > >
> > > If you do that then I don't see any reason not to use admin
> > > commands for that - either through a vq or a simpler
> > > interface.
> >
> > I think we need to agree that admin commands are the only interface
> > for any future features before we can have an agreement here.
>
> I don't think that needs to be the case. I do think that if
> your goal is a separate channel from normal device operation
> then this is what admin commands have been designed for.

Sure, but the question is. If we want to have a new feature X, should
it only go with admin commands or not?

>
> > My understanding is that it is optional for the transport that
> > requires administrative commands like provisioning etc. It is not
> > necessarily the interface for new features.
>
> Yes. And migration is IMO sufficiently "like provisioning".

Well, I basically mean provisioning a virtio device while the proposal
here is to provision state which has been done by the existing
transport.

Not sure if you've noticed or not, this proposal is actually a partial
implementation of a transport.

>
> > >
> > >
> > > >
> > > >         Config space interfaces are fundamental for virtio-pci.
> > > >
> > > >
> > > >     They are in fact fundamental to virtio. Multiple transports to
> > > >     use config space are also fundamental.
> > > >
> > > > I agree. So I also agree to build admin vq live migration solution based on our
> > > > basic facilities, as Jason ever proposed.
> > >
> > >
> > > I'm not sure it's even a vq. I suggest a minimal interface to send
> > > admin commands. Could be used by migration, as transport, and more.
> > >
> >
> > It's better if we can do that below the layer of admin commands. For
> > example, we don't stick device status with any specific interface. We
> > can keep doing things like this.
> >
> > Thanks
>
> Could go either way, but complex functionality like live migration
> can benefit from a rich interface.

Ok.

Thanks


>
> --
> MST
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-31  5:17                                                                       ` Parav Pandit
@ 2023-11-01  0:33                                                                         ` Jason Wang
  2023-11-01  3:07                                                                           ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-01  0:33 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Oct 31, 2023 at 1:17 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Jason Wang
> > Sent: Tuesday, October 31, 2023 7:07 AM
> >
> > On Mon, Oct 30, 2023 at 12:28 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, October 30, 2023 9:35 AM
> > > >
> > > > 在 2023/10/26 11:50, Parav Pandit 写道:
> > > > >> From: virtio-comment@lists.oasis-open.org
> > > > >> <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > >> For example, you still haven't succeeded in defining passthrough.
> > > > > It was defined on 19th Oct in [1].
> > > > > What part is not clear to you in definition of passthrough device?
> > > > >
> > > > > [1]
> > > > > https://lore.kernel.org/virtio-
> > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > >
> > > >
> > > > Let me copy-paste it again:
> > > >
> > > > For example, assuming you are correct, you still fail to explain
> > > >
> > > > 1) what is trapped and what's not, or what's the boundary
> > > Passthrough definition was replied few times.
> > > One of them is here,
> > > https://lore.kernel.org/virtio-
> > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > I don’t know what you mean by 'explain'. What do you want to be explained?
> > > What is trapped is listed in
> > > https://lore.kernel.org/virtio-
> > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > What is not trapped is also listed in
> > > https://lore.kernel.org/virtio-
> > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > So what more do you want to explain in there?
> >
> > You explained that MSI-X is trapped but not the others. People may know why.
> > or what's the boundary to choose to trap or not.
> >
> If a platform can support without trapping, it can be avoided as well and can be added in the future.

Who is going to do that synchronization?

>
> > >
> > > > 2) if the hypervisor is not developed with those assumptions, things
> > > > can work
> > > What to explain in #2. :)
> > > Things can expand when such hypervisor is born.
> >
> > So the point is still, to make your proposal to be useful in more use cases.
> >
> When a use case arise, device context can be expanded.

It's not device context.

> No point in making things no one implements or not present in hypervisor.
> The infrastructure is extendible so spec is covered for it.

It would be problematic if you stick to claim "passthrough" but not.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-10-31  5:30                                                                 ` Parav Pandit
@ 2023-11-01  0:33                                                                   ` Jason Wang
  2023-11-01  3:31                                                                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-01  0:33 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 31, 2023 7:05 AM
> >
> > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: virtio-comment@lists.oasis-open.org
> > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > >
> > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > >
> > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > >
> > > > > > > > How do you know that?
> > > > > > > Because for passthrough, the hypervisor is not involved in
> > > > > > > dealing with VQ at
> > > > > > all.
> > > > > >
> > > > > > Ok, so if I understand correctly, you are saying your design
> > > > > > can't work for the case of PASID assignment.
> > > > > >
> > > > > No. PASID assignment will happen from the guest for its own use
> > > > > and device
> > > > migration will just work fine because device context will capture this.
> > > >
> > > > It's not about device context. We're discussing "passthrough", no?
> > > >
> > > Not sure, we are discussing same.
> > > A member device is passthrough to the guest, dealing with its own PASIDs and
> > virtio interface for some VQ assignment to PASID.
> > > So VQ context captured by the hypervisor, will have some PASID attached to
> > this VQ.
> > > Device context will be updated.
> > >
> > > > You want all virtio stuff to be "passthrough", but assigning a PASID
> > > > to a specific virtqueue in the guest must be trapped.
> > > >
> > > No. PASID assignment to a specific virtqueue in the guest must go directly
> > from guest to device.
> >
> > This works like setting CR3, you can't simply let it go from guest to host.
> >
> > Host IOMMU driver needs to know the PASID to program the IO page tables
> > correctly.
> >
> This will be done by the IOMMU.
>
> > > When guest iommu may need to communicate anything for this PASID, it will
> > come through its proper IOMMU channel/hypercall.
> >
> > Let's say using PASID X for queue 0, this knowledge is beyond the IOMMU scope
> > but belongs to virtio. Or please explain how it can work when it goes directly
> > from guest to device.
> >
> We are yet to ever see spec for PASID to VQ assignment.

It has one.

> For ok for theory sake it is there.
>
> Virtio driver will assign the PASID directly from guest driver to device using a create_vq(pasid=X) command.
> Same process is somehow attached the PASID by the guest OS.
> The whole PASID range is known to the hypervisor when the device is handed over to the guest VM.

How can it know?

> So PASID mapping is setup by the hypervisor IOMMU at this point.

You disallow the PASID to be virtualized here. What's more, such a
PASID passthrough has security implications.

Again, we are talking about different things, I've tried to show you
that there are cases that passthrough can't work but if you think the
only way for migration is to use passthrough in every case, you will
probably fail.

>
> > > Virtio device is not the conduit for this exchange.
> > >
> > > > >
> > > > > > >
> > > > > > > > There are works ongoing to make vPASID work for the guest like
> > vSVA.
> > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > Passthrough do not run like SVA.
> > > > > >
> > > > > > Great, you find another limitation of "passthrough" by yourself.
> > > > > >
> > > > > No. it is not the limitation it is just the way it does not need
> > > > > complex SVA to
> > > > split the device for unrelated usage.
> > > >
> > > > How can you limit the user in the guest to not use vSVA?
> > > >
> > > He he, I am not limiting, again misunderstanding or wrong attribution.
> > > I explained that hypervisor for passthrough does not need SVA.
> > > Guest can do anything it wants from the guest OS with the member device.
> >
> > Ok, so the point stills, see above.
>
> I don’t think so. The guest owns its PASID space

Again, vPASID to PASID can't be done hardware unless I miss some
recent features of IOMMUs.

> and directly communicates like any other device attribute.
>
> >
> > >
> > > > >
> > > > > > > Each passthrough device has PASID from its own space fully
> > > > > > > managed by the
> > > > > > guest.
> > > > > > > Some cpu required vPASID and SIOV is not going this way anmore.
> > > > > >
> > > > > > Then how to migrate? Invent a full set of something else through
> > > > > > another giant series like this to migrate to the SIOV thing?
> > > > > > That's a mess for
> > > > sure.
> > > > > >
> > > > > SIOV will for sure reuse most or all parts of this work, almost entirely as_is.
> > > > > vPASID is cpu/platform specific things not part of the SIOV devices.
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > If at all it is done, it will be done from the guest by
> > > > > > > > > the driver using virtio
> > > > > > > > interface.
> > > > > > > >
> > > > > > > > Then you need to trap. Such things couldn't be passed
> > > > > > > > through to guests
> > > > > > directly.
> > > > > > > >
> > > > > > > Only PASID capability is trapped. PASID allocation and usage
> > > > > > > is directly from
> > > > > > guest.
> > > > > >
> > > > > > How can you achieve this? Assigning a PAISD to a device is
> > > > > > completely
> > > > > > device(virtio) specific. How can you use a general layer without
> > > > > > the knowledge of virtio to trap that?
> > > > > When one wants to map vPASID to pPASID a platform needs to be
> > involved.
> > > >
> > > > I'm not talking about how to map vPASID to pPASID, it's out of the
> > > > scope of virtio. I'm talking about assigning a vPASID to a specific
> > > > virtqueue or other virtio function in the guest.
> > > >
> > > That can be done in the guest. The key is guest wont know that it is dealing
> > with vPASID.
> > > It will follow the same principle from your paper of equivalency, where virtio
> > software layer will assign PASID to VQ and communicate to device.
> > >
> > > Anyway, all of this just digression from current series.
> >
> > It's not, as you mention that only MSI-X is trapped, I give you another one.
> >
> PASID access from the guest to be done fully by the guest IOMMU.
> Not by virtio devices.
>
> > >
> > > > You need a virtio specific queue or capability to assign a PASID to
> > > > a specific virtqueue, and that can't be done without trapping and
> > > > without virito specific knowledge.
> > > >
> > > I disagree. PASID assignment to a virqueue in future from guest virtio driver to
> > device is uniform method.
> > > Whether its PF assigning PASID to VQ of self, Or VF driver in the
> > > guest assigning PASID to VQ.
> > >
> > > All same.
> > > Only IOMMU layer hypercalls will know how to deal with PASID assignment at
> > platform layer to setup the domain etc table.
> > >
> > > And this is way beyond our device migration discussion.
> > > By any means, if you were implying that somehow vq to PASID assignment
> > _may_ need trap+emulation, hence whole device migration to depend on some
> > trap+emulation, than surely, than I do not agree to it.
> >
> > See above.
> >
> Yeah, I disagree to such implying.
>
> > >
> > > PASID equivalent in mlx5 world is ODP_MR+PD isolating the guest process and
> > all of that just works on efficiency and equivalence principle already for a
> > decade now without any trap+emulation.
> > >
> > > > > When virtio passthrough device is in guest, it has all its PASID accessible.
> > > > >
> > > > > All these is large deviation from current discussion of this
> > > > > series, so I will keep
> > > > it short.
> > > > >
> > > > > >
> > > > > > > Regardless it is not relevant to passthrough mode as PASID is
> > > > > > > yet another
> > > > > > resource.
> > > > > > > And for some cpu if it is trapped, it is generic layer, that
> > > > > > > does not require virtio
> > > > > > involvement.
> > > > > > > So virtio interface asking to trap something because generic
> > > > > > > facility has done
> > > > > > in not the approach.
> > > > > >
> > > > > > This misses the point of PASID. How to use PASID is totally device
> > specific.
> > > > > Sure, and how to virtualize vPASID/pPASID is platform specific as
> > > > > single PASID
> > > > can be used by multiple devices and process.
> > > >
> > > > See above, I think we're talking about different things.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > > Capabilities of #2 is generic across all pci devices, so
> > > > > > > > > it will be handled by the
> > > > > > > > HV.
> > > > > > > > > ATS/PRI cap is also generic manner handled by the HV and PCI
> > device.
> > > > > > > >
> > > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU. You
> > > > > > > > can simply do ATS/PRI passthrough but with an emulated vIOMMU.
> > > > > > > And that is not the reason for virtio device to build
> > > > > > > trap+emulation for
> > > > > > passthrough member devices.
> > > > > >
> > > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > > PRI requests arrive on the PF for the VF.
> > > >
> > > > Shouldn't it arrive at platform IOMMU first? The path should be PRI
> > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest IOMMU.
> > > >
> > > Above sequence seems write.
> > >
> > > > And things will be more complicated when (v)PASID is used. So you
> > > > can't simply let PRI go directly to the guest with the current architecture.
> > > >
> > > In current architecture of the pci VF, PRI does not go directly to the guest.
> > > (and that is not reason to trap and emulate other things).
> >
> > Ok, so beyond MSI-X we need to trap PRI, and we will probably trap other
> > things in the future like PASID assignment.
> PRI etc all belong to generic PCI 4K config space region.

It's not about the capability, it's about the whole process of PRI
request handling. We've agreed that the PRI request needs to be
trapped by the hypervisor and then delivered to the vIOMMU.

> Trap+emulation done in generic manner without involving virtio or other device types.
>
> >
> > >
> > > > >
> > > > > > how can you pass
> > > > > > through a hardware PRI request to a guest directly without trapping it
> > then?
> > > > > > What's more, PCIE allows the PRI to be done in a vendor (virtio)
> > > > > > specific way, so you want to break this rule? Or you want to
> > > > > > blacklist ATS/PRI
> > > > for virtio?
> > > > > >
> > > > > I was aware of only pci-sig way of PRI.
> > > > > Do you have a reference to the ECN that enables vendor specific
> > > > > way of PRI? I
> > > > would like to read it.
> > > >
> > > > I mean it doesn't forbid us to build a virtio specific interface for
> > > > I/O page fault report and recovery.
> > > >
> > > So PRI of PCI does not allow. It is ODP kind of technique you meant above.
> > > Yes one can build.
> > > Ok. unrelated to device migration, so I will park this good discussion for later.
> >
> > That's fine.
> >
> > >
> > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > >
> > > > Probably.
> > > >
> > > > > PRI will directly go to the guest driver, and guest would interact
> > > > > with IOMMU
> > > > to service the paging request through IOMMU APIs.
> > > >
> > > > With PASID, it can't go directly.
> > > >
> > > When the request consist of PASID in it, it can.
> > > But again these PCI-SIG extensions of PASID are not related to device
> > migration, so I am differing it.
> > >
> > > > > For PRI in vendor specific way needs a separate discussion. It is
> > > > > not related to
> > > > live migration.
> > > >
> > > > PRI itself is not related. But the point is, you can't simply pass
> > > > through ATS/PRI now.
> > > >
> > > Ah ok. the whole 4K PCI config space where ATS/PRI capabilities are located
> > are trapped+emulated by hypervisor.
> > > So?
> > > So do we start emulating virito interfaces too for passthrough?
> > > No.
> > > Can one still continue to trap+emulate?
> > > Sure why not?
> >
> > Then let's not limit your proposal to be used by "passthrough" only?
> One can possibly build some variant of the existing virtio member device using same owner and member scheme.

It's not about the member/owner, it's about e.g whether the hypervisor
can trap and emulate.

I've pointed out that what you invent here is actually a partial new
transport, for example, a hypervisor can trap and use things like
device context in PF to bypass the registers in VF. This is the idea
of transport commands/q.

> If for that is some admin commands are missing, may be one can add them.

I would then build the device context commands on top of the transport
commands/q, then it would be complete.

> No need to step on toes of use cases as they are different...
>
> > I've shown you that
> >
> > 1) you can't easily say you can pass through all the virtio facilities
> > 2) how ambiguous for terminology like "passthrough"
> >
> It is not, it is well defined in v3, v2.
> One can continue to argue and keep defining the variant and still call it data path acceleration and then claim it as passthrough ...
> But I won't debate this anymore as its just non-technical aspects of least interest.

You use this terminology in the spec which is all about technical, and
you think how to define it is a matter of non-technical. This is
self-contradictory. If you fail, it probably means it's ambiguous.
Let's don't use that terminology.

> We have technical tasks and more improved specs to update going forward.

It's a burden to do the synchronization.

> Working on extension for device specific contexts to enrich it.

Again, making the proposal to be general is much more beneficial.

Thanks





>
> > Thanks
> >
> > >
> > > Can one use AQ of this proposal to do so?
> > > Sure, why not?
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-31 10:14                                                                                     ` Michael S. Tsirkin
@ 2023-11-01  0:42                                                                                       ` Jason Wang
  2023-11-01  1:57                                                                                         ` Zhu, Lingshan
  2023-11-01  1:57                                                                                       ` Zhu, Lingshan
  2023-11-01  2:54                                                                                       ` Parav Pandit
  2 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-01  0:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Tue, Oct 31, 2023 at 6:14 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > > Your answer is not relevant to this discussion at all.
> > > Why?
> > > Because we were discussing the schemes where registers are not used.
> > > One example of that was IMS. It does not matter MSI or MSIX.
> > > As explained in Intel's commit message, the key to focus for IMS is "queue memory" not some hw register like MSI or MSI-X.
> > you know the device always need to know a address and the data to send a
> > MSI, right?
>
> So if virtio is to use IMS then we'll need to add interfaces to program
> IMS, I think. As part of that patch - it's reasonable to assume - we will
> also need to add a way to retrieve IMS so it can be migrated.
>
> However, what this example demonstrates is that the approach taken
> by this proposal to migrate control path structures - namely, by
> defining a structure used just for migration - means that we will
> need to come up with a migration interface each time.
> And that is unfortunate.
>
> Compare to the trap and emulate approach for config space and we don't
> need a new interface, we just make each field R/W.
> So I feel this is something to think about, and address.
> Ideas?

Something like we've done for transportq in the past? (by just adding
the get commands):

https://lore.kernel.org/all/29533940-0345-4a84-fcc7-f42d914dc28d@intel.com/T/#mbfe209b96e1dda88ed7aae04d25b12026a7b5364

Thanks

>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-31 10:14                                                                                     ` Michael S. Tsirkin
  2023-11-01  0:42                                                                                       ` Jason Wang
@ 2023-11-01  1:57                                                                                       ` Zhu, Lingshan
  2023-11-01  2:54                                                                                       ` Parav Pandit
  2 siblings, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-11-01  1:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 10/31/2023 6:14 PM, Michael S. Tsirkin wrote:
> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
>>> Your answer is not relevant to this discussion at all.
>>> Why?
>>> Because we were discussing the schemes where registers are not used.
>>> One example of that was IMS. It does not matter MSI or MSIX.
>>> As explained in Intel's commit message, the key to focus for IMS is "queue memory" not some hw register like MSI or MSI-X.
>> you know the device always need to know a address and the data to send a
>> MSI, right?
> So if virtio is to use IMS then we'll need to add interfaces to program
> IMS, I think. As part of that patch - it's reasonable to assume - we will
> also need to add a way to retrieve IMS so it can be migrated.
>
> However, what this example demonstrates is that the approach taken
> by this proposal to migrate control path structures - namely, by
> defining a structure used just for migration - means that we will
> need to come up with a migration interface each time.
> And that is unfortunate.
>
> Compare to the trap and emulate approach for config space and we don't
> need a new interface, we just make each field R/W.
> So I feel this is something to think about, and address.
> Ideas?
I think this is transport-specific. For example, PCI has a MSI tables, 
driver can access MSI info through the
MSI entries in the table, That is R/W for migration.

For SIOV, we have invented "transport vq", tvq support set/get MSI 
configurations,
I will continue this tvq task once live migration solution merged.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  0:42                                                                                       ` Jason Wang
@ 2023-11-01  1:57                                                                                         ` Zhu, Lingshan
  0 siblings, 0 replies; 341+ messages in thread
From: Zhu, Lingshan @ 2023-11-01  1:57 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/1/2023 8:42 AM, Jason Wang wrote:
> On Tue, Oct 31, 2023 at 6:14 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
>>>> Your answer is not relevant to this discussion at all.
>>>> Why?
>>>> Because we were discussing the schemes where registers are not used.
>>>> One example of that was IMS. It does not matter MSI or MSIX.
>>>> As explained in Intel's commit message, the key to focus for IMS is "queue memory" not some hw register like MSI or MSI-X.
>>> you know the device always need to know a address and the data to send a
>>> MSI, right?
>> So if virtio is to use IMS then we'll need to add interfaces to program
>> IMS, I think. As part of that patch - it's reasonable to assume - we will
>> also need to add a way to retrieve IMS so it can be migrated.
>>
>> However, what this example demonstrates is that the approach taken
>> by this proposal to migrate control path structures - namely, by
>> defining a structure used just for migration - means that we will
>> need to come up with a migration interface each time.
>> And that is unfortunate.
>>
>> Compare to the trap and emulate approach for config space and we don't
>> need a new interface, we just make each field R/W.
>> So I feel this is something to think about, and address.
>> Ideas?
> Something like we've done for transportq in the past? (by just adding
> the get commands):
>
> https://lore.kernel.org/all/29533940-0345-4a84-fcc7-f42d914dc28d@intel.com/T/#mbfe209b96e1dda88ed7aae04d25b12026a7b5364
Exactly! and I will continue this work once our live migration proposal 
merged.
>
> Thanks
>
>> --
>> MST
>>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-31 10:14                                                                                     ` Michael S. Tsirkin
  2023-11-01  0:42                                                                                       ` Jason Wang
  2023-11-01  1:57                                                                                       ` Zhu, Lingshan
@ 2023-11-01  2:54                                                                                       ` Parav Pandit
  2023-11-01  5:31                                                                                         ` Michael S. Tsirkin
  2 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  2:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Zhu, Lingshan
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, October 31, 2023 3:44 PM
> 
> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > > Your answer is not relevant to this discussion at all.
> > > Why?
> > > Because we were discussing the schemes where registers are not used.
> > > One example of that was IMS. It does not matter MSI or MSIX.
> > > As explained in Intel's commit message, the key to focus for IMS is "queue
> memory" not some hw register like MSI or MSI-X.
> > you know the device always need to know a address and the data to send
> > a MSI, right?
> 
> So if virtio is to use IMS then we'll need to add interfaces to program IMS, I
> think. As part of that patch - it's reasonable to assume - we will also need to add
> a way to retrieve IMS so it can be migrated.
> 
> However, what this example demonstrates is that the approach taken by this
> proposal to migrate control path structures - namely, by defining a structure
> used just for migration - means that we will need to come up with a migration
> interface each time.
> And that is unfortunate.
>
When the device supports a new feature it has supported new functionality.
Hence the live migration side also got updated.
However, the live migration driver does not have to understand what is inside the control path structures.
It is just byte stream.
Only if the hypervisor live migration drive involved in emulating, it will parse and that is fine as like other control structures.
 

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  0:33                                                                         ` Jason Wang
@ 2023-11-01  3:07                                                                           ` Parav Pandit
  2023-11-02  4:24                                                                             ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  3:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 1, 2023 6:03 AM
> 
> On Tue, Oct 31, 2023 at 1:17 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > Sent: Tuesday, October 31, 2023 7:07 AM
> > >
> > > On Mon, Oct 30, 2023 at 12:28 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, October 30, 2023 9:35 AM
> > > > >
> > > > > 在 2023/10/26 11:50, Parav Pandit 写道:
> > > > > >> From: virtio-comment@lists.oasis-open.org
> > > > > >> <virtio-comment@lists.oasis- open.org> On Behalf Of Jason
> > > > > >> Wang For example, you still haven't succeeded in defining
> passthrough.
> > > > > > It was defined on 19th Oct in [1].
> > > > > > What part is not clear to you in definition of passthrough device?
> > > > > >
> > > > > > [1]
> > > > > > https://lore.kernel.org/virtio-
> > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > >
> > > > >
> > > > > Let me copy-paste it again:
> > > > >
> > > > > For example, assuming you are correct, you still fail to explain
> > > > >
> > > > > 1) what is trapped and what's not, or what's the boundary
> > > > Passthrough definition was replied few times.
> > > > One of them is here,
> > > > https://lore.kernel.org/virtio-
> > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > I don’t know what you mean by 'explain'. What do you want to be
> explained?
> > > > What is trapped is listed in
> > > > https://lore.kernel.org/virtio-
> > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > What is not trapped is also listed in
> > > > https://lore.kernel.org/virtio-
> > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > So what more do you want to explain in there?
> > >
> > > You explained that MSI-X is trapped but not the others. People may know
> why.
> > > or what's the boundary to choose to trap or not.
> > >
> > If a platform can support without trapping, it can be avoided as well and can
> be added in the future.
> 
> Who is going to do that synchronization?
Lets first bring that hypervisor sw design before discussing phantom problem solving.
All necessary modules will be involved in synchronization depending on how its done in future.

> 
> >
> > > >
> > > > > 2) if the hypervisor is not developed with those assumptions,
> > > > > things can work
> > > > What to explain in #2. :)
> > > > Things can expand when such hypervisor is born.
> > >
> > > So the point is still, to make your proposal to be useful in more use cases.
> > >
> > When a use case arise, device context can be expanded.
> 
> It's not device context.
>
I don’t see why not. It is stored in the device.
Remapping part will be hypervisor specific, so it may be stored in platform specific migration data.
 
> > No point in making things no one implements or not present in hypervisor.
> > The infrastructure is extendible so spec is covered for it.
> 
> It would be problematic if you stick to claim "passthrough" but not.

I don’t know what this means. I am not debating passthrough/non-passthrough.
What is inside the device, will be part of device-context.
What is part of the platform content, will be part of platform context.
Since this is generic to all types of PCI devices, I don’t see a need to over-solve it now in virtio.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-01  0:33                                                                   ` Jason Wang
@ 2023-11-01  3:31                                                                     ` Parav Pandit
  2023-11-02  4:25                                                                       ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  3:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 1, 2023 6:04 AM
> 
> On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, October 31, 2023 7:05 AM
> > >
> > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: virtio-comment@lists.oasis-open.org
> > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > >
> > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > >
> > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > >
> > > > > > > > > How do you know that?
> > > > > > > > Because for passthrough, the hypervisor is not involved in
> > > > > > > > dealing with VQ at
> > > > > > > all.
> > > > > > >
> > > > > > > Ok, so if I understand correctly, you are saying your design
> > > > > > > can't work for the case of PASID assignment.
> > > > > > >
> > > > > > No. PASID assignment will happen from the guest for its own
> > > > > > use and device
> > > > > migration will just work fine because device context will capture this.
> > > > >
> > > > > It's not about device context. We're discussing "passthrough", no?
> > > > >
> > > > Not sure, we are discussing same.
> > > > A member device is passthrough to the guest, dealing with its own
> > > > PASIDs and
> > > virtio interface for some VQ assignment to PASID.
> > > > So VQ context captured by the hypervisor, will have some PASID
> > > > attached to
> > > this VQ.
> > > > Device context will be updated.
> > > >
> > > > > You want all virtio stuff to be "passthrough", but assigning a
> > > > > PASID to a specific virtqueue in the guest must be trapped.
> > > > >
> > > > No. PASID assignment to a specific virtqueue in the guest must go
> > > > directly
> > > from guest to device.
> > >
> > > This works like setting CR3, you can't simply let it go from guest to host.
> > >
> > > Host IOMMU driver needs to know the PASID to program the IO page
> > > tables correctly.
> > >
> > This will be done by the IOMMU.
> >
> > > > When guest iommu may need to communicate anything for this PASID,
> > > > it will
> > > come through its proper IOMMU channel/hypercall.
> > >
> > > Let's say using PASID X for queue 0, this knowledge is beyond the
> > > IOMMU scope but belongs to virtio. Or please explain how it can work
> > > when it goes directly from guest to device.
> > >
> > We are yet to ever see spec for PASID to VQ assignment.
> 
> It has one.
> 
> > For ok for theory sake it is there.
> >
> > Virtio driver will assign the PASID directly from guest driver to device using a
> create_vq(pasid=X) command.
> > Same process is somehow attached the PASID by the guest OS.
> > The whole PASID range is known to the hypervisor when the device is handed
> over to the guest VM.
> 
> How can it know?
> 
> > So PASID mapping is setup by the hypervisor IOMMU at this point.
> 
> You disallow the PASID to be virtualized here. What's more, such a PASID
> passthrough has security implications.
>
No. virtio spec is not disallowing. At least for sure, this series is not the one.
My main point is, virtio device interface will not be the source of hypercall to program IOMMU in the hypervisor.
It is something to be done by IOMMU side.

> Again, we are talking about different things, I've tried to show you that there are
> cases that passthrough can't work but if you think the only way for migration is
> to use passthrough in every case, you will probably fail.
> 
I didn't say only way for migration is passthrough.
Passthrough is clearly one way.
Other ways may be possible.

> >
> > > > Virtio device is not the conduit for this exchange.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > There are works ongoing to make vPASID work for the
> > > > > > > > > guest like
> > > vSVA.
> > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > Passthrough do not run like SVA.
> > > > > > >
> > > > > > > Great, you find another limitation of "passthrough" by yourself.
> > > > > > >
> > > > > > No. it is not the limitation it is just the way it does not
> > > > > > need complex SVA to
> > > > > split the device for unrelated usage.
> > > > >
> > > > > How can you limit the user in the guest to not use vSVA?
> > > > >
> > > > He he, I am not limiting, again misunderstanding or wrong attribution.
> > > > I explained that hypervisor for passthrough does not need SVA.
> > > > Guest can do anything it wants from the guest OS with the member
> device.
> > >
> > > Ok, so the point stills, see above.
> >
> > I don’t think so. The guest owns its PASID space
> 
> Again, vPASID to PASID can't be done hardware unless I miss some recent
> features of IOMMUs.
> 
Cpu vendors have different way of doing vPASID to pPASID.
It is still an early space for virtio.

> > and directly communicates like any other device attribute.
> >
> > >
> > > >
> > > > > >
> > > > > > > > Each passthrough device has PASID from its own space fully
> > > > > > > > managed by the
> > > > > > > guest.
> > > > > > > > Some cpu required vPASID and SIOV is not going this way anmore.
> > > > > > >
> > > > > > > Then how to migrate? Invent a full set of something else
> > > > > > > through another giant series like this to migrate to the SIOV thing?
> > > > > > > That's a mess for
> > > > > sure.
> > > > > > >
> > > > > > SIOV will for sure reuse most or all parts of this work, almost entirely
> as_is.
> > > > > > vPASID is cpu/platform specific things not part of the SIOV devices.
> > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > If at all it is done, it will be done from the guest
> > > > > > > > > > by the driver using virtio
> > > > > > > > > interface.
> > > > > > > > >
> > > > > > > > > Then you need to trap. Such things couldn't be passed
> > > > > > > > > through to guests
> > > > > > > directly.
> > > > > > > > >
> > > > > > > > Only PASID capability is trapped. PASID allocation and
> > > > > > > > usage is directly from
> > > > > > > guest.
> > > > > > >
> > > > > > > How can you achieve this? Assigning a PAISD to a device is
> > > > > > > completely
> > > > > > > device(virtio) specific. How can you use a general layer
> > > > > > > without the knowledge of virtio to trap that?
> > > > > > When one wants to map vPASID to pPASID a platform needs to be
> > > involved.
> > > > >
> > > > > I'm not talking about how to map vPASID to pPASID, it's out of
> > > > > the scope of virtio. I'm talking about assigning a vPASID to a
> > > > > specific virtqueue or other virtio function in the guest.
> > > > >
> > > > That can be done in the guest. The key is guest wont know that it
> > > > is dealing
> > > with vPASID.
> > > > It will follow the same principle from your paper of equivalency,
> > > > where virtio
> > > software layer will assign PASID to VQ and communicate to device.
> > > >
> > > > Anyway, all of this just digression from current series.
> > >
> > > It's not, as you mention that only MSI-X is trapped, I give you another one.
> > >
> > PASID access from the guest to be done fully by the guest IOMMU.
> > Not by virtio devices.
> >
> > > >
> > > > > You need a virtio specific queue or capability to assign a PASID
> > > > > to a specific virtqueue, and that can't be done without trapping
> > > > > and without virito specific knowledge.
> > > > >
> > > > I disagree. PASID assignment to a virqueue in future from guest
> > > > virtio driver to
> > > device is uniform method.
> > > > Whether its PF assigning PASID to VQ of self, Or VF driver in the
> > > > guest assigning PASID to VQ.
> > > >
> > > > All same.
> > > > Only IOMMU layer hypercalls will know how to deal with PASID
> > > > assignment at
> > > platform layer to setup the domain etc table.
> > > >
> > > > And this is way beyond our device migration discussion.
> > > > By any means, if you were implying that somehow vq to PASID
> > > > assignment
> > > _may_ need trap+emulation, hence whole device migration to depend on
> > > some
> > > trap+emulation, than surely, than I do not agree to it.
> > >
> > > See above.
> > >
> > Yeah, I disagree to such implying.
> >
> > > >
> > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating the guest
> > > > process and
> > > all of that just works on efficiency and equivalence principle
> > > already for a decade now without any trap+emulation.
> > > >
> > > > > > When virtio passthrough device is in guest, it has all its PASID
> accessible.
> > > > > >
> > > > > > All these is large deviation from current discussion of this
> > > > > > series, so I will keep
> > > > > it short.
> > > > > >
> > > > > > >
> > > > > > > > Regardless it is not relevant to passthrough mode as PASID
> > > > > > > > is yet another
> > > > > > > resource.
> > > > > > > > And for some cpu if it is trapped, it is generic layer,
> > > > > > > > that does not require virtio
> > > > > > > involvement.
> > > > > > > > So virtio interface asking to trap something because
> > > > > > > > generic facility has done
> > > > > > > in not the approach.
> > > > > > >
> > > > > > > This misses the point of PASID. How to use PASID is totally
> > > > > > > device
> > > specific.
> > > > > > Sure, and how to virtualize vPASID/pPASID is platform specific
> > > > > > as single PASID
> > > > > can be used by multiple devices and process.
> > > > >
> > > > > See above, I think we're talking about different things.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > > Capabilities of #2 is generic across all pci devices,
> > > > > > > > > > so it will be handled by the
> > > > > > > > > HV.
> > > > > > > > > > ATS/PRI cap is also generic manner handled by the HV
> > > > > > > > > > and PCI
> > > device.
> > > > > > > > >
> > > > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU.
> > > > > > > > > You can simply do ATS/PRI passthrough but with an emulated
> vIOMMU.
> > > > > > > > And that is not the reason for virtio device to build
> > > > > > > > trap+emulation for
> > > > > > > passthrough member devices.
> > > > > > >
> > > > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > > > PRI requests arrive on the PF for the VF.
> > > > >
> > > > > Shouldn't it arrive at platform IOMMU first? The path should be
> > > > > PRI
> > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest IOMMU.
> > > > >
> > > > Above sequence seems write.
> > > >
> > > > > And things will be more complicated when (v)PASID is used. So
> > > > > you can't simply let PRI go directly to the guest with the current
> architecture.
> > > > >
> > > > In current architecture of the pci VF, PRI does not go directly to the guest.
> > > > (and that is not reason to trap and emulate other things).
> > >
> > > Ok, so beyond MSI-X we need to trap PRI, and we will probably trap
> > > other things in the future like PASID assignment.
> > PRI etc all belong to generic PCI 4K config space region.
> 
> It's not about the capability, it's about the whole process of PRI request
> handling. We've agreed that the PRI request needs to be trapped by the
> hypervisor and then delivered to the vIOMMU.
>
 
> > Trap+emulation done in generic manner without involving virtio or other
> device types.
> >
> > >
> > > >
> > > > > >
> > > > > > > how can you pass
> > > > > > > through a hardware PRI request to a guest directly without
> > > > > > > trapping it
> > > then?
> > > > > > > What's more, PCIE allows the PRI to be done in a vendor
> > > > > > > (virtio) specific way, so you want to break this rule? Or
> > > > > > > you want to blacklist ATS/PRI
> > > > > for virtio?
> > > > > > >
> > > > > > I was aware of only pci-sig way of PRI.
> > > > > > Do you have a reference to the ECN that enables vendor
> > > > > > specific way of PRI? I
> > > > > would like to read it.
> > > > >
> > > > > I mean it doesn't forbid us to build a virtio specific interface
> > > > > for I/O page fault report and recovery.
> > > > >
> > > > So PRI of PCI does not allow. It is ODP kind of technique you meant above.
> > > > Yes one can build.
> > > > Ok. unrelated to device migration, so I will park this good discussion for
> later.
> > >
> > > That's fine.
> > >
> > > >
> > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > >
> > > > > Probably.
> > > > >
> > > > > > PRI will directly go to the guest driver, and guest would
> > > > > > interact with IOMMU
> > > > > to service the paging request through IOMMU APIs.
> > > > >
> > > > > With PASID, it can't go directly.
> > > > >
> > > > When the request consist of PASID in it, it can.
> > > > But again these PCI-SIG extensions of PASID are not related to
> > > > device
> > > migration, so I am differing it.
> > > >
> > > > > > For PRI in vendor specific way needs a separate discussion. It
> > > > > > is not related to
> > > > > live migration.
> > > > >
> > > > > PRI itself is not related. But the point is, you can't simply
> > > > > pass through ATS/PRI now.
> > > > >
> > > > Ah ok. the whole 4K PCI config space where ATS/PRI capabilities
> > > > are located
> > > are trapped+emulated by hypervisor.
> > > > So?
> > > > So do we start emulating virito interfaces too for passthrough?
> > > > No.
> > > > Can one still continue to trap+emulate?
> > > > Sure why not?
> > >
> > > Then let's not limit your proposal to be used by "passthrough" only?
> > One can possibly build some variant of the existing virtio member device
> using same owner and member scheme.
> 
> It's not about the member/owner, it's about e.g whether the hypervisor can
> trap and emulate.
> 
> I've pointed out that what you invent here is actually a partial new transport, for
> example, a hypervisor can trap and use things like device context in PF to bypass
> the registers in VF. This is the idea of transport commands/q.
>
I will not mix transport commands which are mainly useful for actual device operation for SIOV only for backward compatibility that too optionally.
One may still choose to have virtio common and device config in MMIO ofcourse at lower scale.

Anyway, mixing migration context with actual SIOV specific thing is not correct as device context is read/write incremental values.

> > If for that is some admin commands are missing, may be one can add them.
> 
> I would then build the device context commands on top of the transport
> commands/q, then it would be complete.
> 
> > No need to step on toes of use cases as they are different...
> >
> > > I've shown you that
> > >
> > > 1) you can't easily say you can pass through all the virtio
> > > facilities
> > > 2) how ambiguous for terminology like "passthrough"
> > >
> > It is not, it is well defined in v3, v2.
> > One can continue to argue and keep defining the variant and still call it data
> path acceleration and then claim it as passthrough ...
> > But I won't debate this anymore as its just non-technical aspects of least
> interest.
> 
> You use this terminology in the spec which is all about technical, and you think
> how to define it is a matter of non-technical. This is self-contradictory. If you fail,
> it probably means it's ambiguous.
> Let's don't use that terminology.
>
What it means is described in theory of operation.
 
> > We have technical tasks and more improved specs to update going forward.
> 
> It's a burden to do the synchronization.
We have discussed this.
In current proposed the member device is not bifurcated, so it implements the necessary pieces.
Feature != burden.

> 
> > Working on extension for device specific contexts to enrich it.
> 
> Again, making the proposal to be general is much more beneficial.

Yes, it is general and like any other device-type, each has their extensions.
Infrastructure covers in v3.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  2:54                                                                                       ` Parav Pandit
@ 2023-11-01  5:31                                                                                         ` Michael S. Tsirkin
  2023-11-01  5:42                                                                                           ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-01  5:31 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Tuesday, October 31, 2023 3:44 PM
> > 
> > On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > > > Your answer is not relevant to this discussion at all.
> > > > Why?
> > > > Because we were discussing the schemes where registers are not used.
> > > > One example of that was IMS. It does not matter MSI or MSIX.
> > > > As explained in Intel's commit message, the key to focus for IMS is "queue
> > memory" not some hw register like MSI or MSI-X.
> > > you know the device always need to know a address and the data to send
> > > a MSI, right?
> > 
> > So if virtio is to use IMS then we'll need to add interfaces to program IMS, I
> > think. As part of that patch - it's reasonable to assume - we will also need to add
> > a way to retrieve IMS so it can be migrated.
> > 
> > However, what this example demonstrates is that the approach taken by this
> > proposal to migrate control path structures - namely, by defining a structure
> > used just for migration - means that we will need to come up with a migration
> > interface each time.
> > And that is unfortunate.
> >
> When the device supports a new feature it has supported new functionality.
> Hence the live migration side also got updated.
> However, the live migration driver does not have to understand what is inside the control path structures.
> It is just byte stream.
> Only if the hypervisor live migration drive involved in emulating, it will parse and that is fine as like other control structures.

The point is that any new field needs to be added in two places now and
that is not great at all.

We need a stronger compatiblity story here I think.

One way to show how it's designed to work would be to split
the patches. For example, add queue notify data and queue reset
separately.

Another is to add MSIX table migration option for when MSIX table
is passed through to guest.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  5:31                                                                                         ` Michael S. Tsirkin
@ 2023-11-01  5:42                                                                                           ` Parav Pandit
  2023-11-01  6:37                                                                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  5:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 1, 2023 11:01 AM
> 
> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Tuesday, October 31, 2023 3:44 PM
> > >
> > > On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > > > > Your answer is not relevant to this discussion at all.
> > > > > Why?
> > > > > Because we were discussing the schemes where registers are not used.
> > > > > One example of that was IMS. It does not matter MSI or MSIX.
> > > > > As explained in Intel's commit message, the key to focus for IMS
> > > > > is "queue
> > > memory" not some hw register like MSI or MSI-X.
> > > > you know the device always need to know a address and the data to
> > > > send a MSI, right?
> > >
> > > So if virtio is to use IMS then we'll need to add interfaces to
> > > program IMS, I think. As part of that patch - it's reasonable to
> > > assume - we will also need to add a way to retrieve IMS so it can be
> migrated.
> > >
> > > However, what this example demonstrates is that the approach taken
> > > by this proposal to migrate control path structures - namely, by
> > > defining a structure used just for migration - means that we will
> > > need to come up with a migration interface each time.
> > > And that is unfortunate.
> > >
> > When the device supports a new feature it has supported new functionality.
> > Hence the live migration side also got updated.
> > However, the live migration driver does not have to understand what is inside
> the control path structures.
> > It is just byte stream.
> > Only if the hypervisor live migration drive involved in emulating, it will parse
> and that is fine as like other control structures.
> 
> The point is that any new field needs to be added in two places now and that is
> not great at all.
> 
Most control structs are well defined. So only its type field is added to migrating driver side.
This is very low overhead field and handled in generic way for all device types and for all common types.

> We need a stronger compatiblity story here I think.
> 
> One way to show how it's designed to work would be to split the patches. For
> example, add queue notify data and queue reset separately.
I didn't follow the suggestion. Can you explain splitting patches and its relation to the structure?

> 
> Another is to add MSIX table migration option for when MSIX table is passed
> through to guest.
Yes, this will be added in future when there is actual hypervisor for it.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  5:42                                                                                           ` Parav Pandit
@ 2023-11-01  6:37                                                                                             ` Michael S. Tsirkin
  2023-11-01  6:39                                                                                               ` Zhu, Lingshan
  2023-11-01  6:47                                                                                               ` Parav Pandit
  0 siblings, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-01  6:37 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, November 1, 2023 11:01 AM
> > 
> > On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Tuesday, October 31, 2023 3:44 PM
> > > >
> > > > On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > > > > > Your answer is not relevant to this discussion at all.
> > > > > > Why?
> > > > > > Because we were discussing the schemes where registers are not used.
> > > > > > One example of that was IMS. It does not matter MSI or MSIX.
> > > > > > As explained in Intel's commit message, the key to focus for IMS
> > > > > > is "queue
> > > > memory" not some hw register like MSI or MSI-X.
> > > > > you know the device always need to know a address and the data to
> > > > > send a MSI, right?
> > > >
> > > > So if virtio is to use IMS then we'll need to add interfaces to
> > > > program IMS, I think. As part of that patch - it's reasonable to
> > > > assume - we will also need to add a way to retrieve IMS so it can be
> > migrated.
> > > >
> > > > However, what this example demonstrates is that the approach taken
> > > > by this proposal to migrate control path structures - namely, by
> > > > defining a structure used just for migration - means that we will
> > > > need to come up with a migration interface each time.
> > > > And that is unfortunate.
> > > >
> > > When the device supports a new feature it has supported new functionality.
> > > Hence the live migration side also got updated.
> > > However, the live migration driver does not have to understand what is inside
> > the control path structures.
> > > It is just byte stream.
> > > Only if the hypervisor live migration drive involved in emulating, it will parse
> > and that is fine as like other control structures.
> > 
> > The point is that any new field needs to be added in two places now and that is
> > not great at all.
> > 
> Most control structs are well defined. So only its type field is added to migrating driver side.
> This is very low overhead field and handled in generic way for all device types and for all common types.

Weird, not what I see.  E.g. you seem to have a structure duplicating queue
fields. Each new field will have to be added there in addition to the
transport.

> > We need a stronger compatiblity story here I think.
> > 
> > One way to show how it's designed to work would be to split the patches. For
> > example, add queue notify data and queue reset separately.
> I didn't follow the suggestion. Can you explain splitting patches and its relation to the structure?
> 
> > 
> > Another is to add MSIX table migration option for when MSIX table is passed
> > through to guest.
> Yes, this will be added in future when there is actual hypervisor for it.

You are tying the architecture to an extremely implementation specific
detail.

Hypervisors *already* have migrate the MSIX table. Just in a hypervisor
specific way. queue vector is an index into this table. So the index
is migrated through the device but the table itself has to be
trapped and emulated by hypervisor? Give me a break.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  6:37                                                                                             ` Michael S. Tsirkin
@ 2023-11-01  6:39                                                                                               ` Zhu, Lingshan
  2023-11-01  6:50                                                                                                 ` Parav Pandit
  2023-11-01  6:47                                                                                               ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-11-01  6:39 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/1/2023 2:37 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
>>> From: Michael S. Tsirkin <mst@redhat.com>
>>> Sent: Wednesday, November 1, 2023 11:01 AM
>>>
>>> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
>>>>
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Sent: Tuesday, October 31, 2023 3:44 PM
>>>>>
>>>>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
>>>>>>> Your answer is not relevant to this discussion at all.
>>>>>>> Why?
>>>>>>> Because we were discussing the schemes where registers are not used.
>>>>>>> One example of that was IMS. It does not matter MSI or MSIX.
>>>>>>> As explained in Intel's commit message, the key to focus for IMS
>>>>>>> is "queue
>>>>> memory" not some hw register like MSI or MSI-X.
>>>>>> you know the device always need to know a address and the data to
>>>>>> send a MSI, right?
>>>>> So if virtio is to use IMS then we'll need to add interfaces to
>>>>> program IMS, I think. As part of that patch - it's reasonable to
>>>>> assume - we will also need to add a way to retrieve IMS so it can be
>>> migrated.
>>>>> However, what this example demonstrates is that the approach taken
>>>>> by this proposal to migrate control path structures - namely, by
>>>>> defining a structure used just for migration - means that we will
>>>>> need to come up with a migration interface each time.
>>>>> And that is unfortunate.
>>>>>
>>>> When the device supports a new feature it has supported new functionality.
>>>> Hence the live migration side also got updated.
>>>> However, the live migration driver does not have to understand what is inside
>>> the control path structures.
>>>> It is just byte stream.
>>>> Only if the hypervisor live migration drive involved in emulating, it will parse
>>> and that is fine as like other control structures.
>>>
>>> The point is that any new field needs to be added in two places now and that is
>>> not great at all.
>>>
>> Most control structs are well defined. So only its type field is added to migrating driver side.
>> This is very low overhead field and handled in generic way for all device types and for all common types.
> Weird, not what I see.  E.g. you seem to have a structure duplicating queue
> fields. Each new field will have to be added there in addition to the
> transport.
>
>>> We need a stronger compatiblity story here I think.
>>>
>>> One way to show how it's designed to work would be to split the patches. For
>>> example, add queue notify data and queue reset separately.
>> I didn't follow the suggestion. Can you explain splitting patches and its relation to the structure?
>>
>>> Another is to add MSIX table migration option for when MSIX table is passed
>>> through to guest.
>> Yes, this will be added in future when there is actual hypervisor for it.
> You are tying the architecture to an extremely implementation specific
> detail.
>
> Hypervisors *already* have migrate the MSIX table. Just in a hypervisor
> specific way. queue vector is an index into this table. So the index
> is migrated through the device but the table itself has to be
> trapped and emulated by hypervisor? Give me a break.
I agree, the MSI table could be R/W anyway.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  6:37                                                                                             ` Michael S. Tsirkin
  2023-11-01  6:39                                                                                               ` Zhu, Lingshan
@ 2023-11-01  6:47                                                                                               ` Parav Pandit
  2023-11-01  8:28                                                                                                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  6:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 1, 2023 12:07 PM
> 
> On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Wednesday, November 1, 2023 11:01 AM
> > >
> > > On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Tuesday, October 31, 2023 3:44 PM
> > > > >
> > > > > On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > > > > > > Your answer is not relevant to this discussion at all.
> > > > > > > Why?
> > > > > > > Because we were discussing the schemes where registers are not
> used.
> > > > > > > One example of that was IMS. It does not matter MSI or MSIX.
> > > > > > > As explained in Intel's commit message, the key to focus for
> > > > > > > IMS is "queue
> > > > > memory" not some hw register like MSI or MSI-X.
> > > > > > you know the device always need to know a address and the data
> > > > > > to send a MSI, right?
> > > > >
> > > > > So if virtio is to use IMS then we'll need to add interfaces to
> > > > > program IMS, I think. As part of that patch - it's reasonable to
> > > > > assume - we will also need to add a way to retrieve IMS so it
> > > > > can be
> > > migrated.
> > > > >
> > > > > However, what this example demonstrates is that the approach
> > > > > taken by this proposal to migrate control path structures -
> > > > > namely, by defining a structure used just for migration - means
> > > > > that we will need to come up with a migration interface each time.
> > > > > And that is unfortunate.
> > > > >
> > > > When the device supports a new feature it has supported new
> functionality.
> > > > Hence the live migration side also got updated.
> > > > However, the live migration driver does not have to understand
> > > > what is inside
> > > the control path structures.
> > > > It is just byte stream.
> > > > Only if the hypervisor live migration drive involved in emulating,
> > > > it will parse
> > > and that is fine as like other control structures.
> > >
> > > The point is that any new field needs to be added in two places now
> > > and that is not great at all.
> > >
> > Most control structs are well defined. So only its type field is added to
> migrating driver side.
> > This is very low overhead field and handled in generic way for all device types
> and for all common types.
> 
> Weird, not what I see.  E.g. you seem to have a structure duplicating queue
> fields. Each new field will have to be added there in addition to the transport.
> 
Didn't follow.
Which structure is duplicated?
PCI does not even have a structure for q configuration.

> > > We need a stronger compatiblity story here I think.
> > >
> > > One way to show how it's designed to work would be to split the
> > > patches. For example, add queue notify data and queue reset separately.
> > I didn't follow the suggestion. Can you explain splitting patches and its relation
> to the structure?
Did I miss your response? Or it is in below msix?

> >
> > >
> > > Another is to add MSIX table migration option for when MSIX table is
> > > passed through to guest.
> > Yes, this will be added in future when there is actual hypervisor for it.
> 
> You are tying the architecture to an extremely implementation specific detail.
>
Which part?
 
Not really. can you please which software will use MSI-X table migration?

> Hypervisors *already* have migrate the MSIX table. Just in a hypervisor specific
> way. 
MSI-X table is in the PCI BAR memory.

> queue vector is an index into this table. So the index is migrated through
> the device but the table itself has to be trapped and emulated by hypervisor?
Do you have a hypervisor and a platform that has not done MSI-X table emulation for which you are asking to add?
I don't know any.
I am asking to add it when there is a _real_ user of it. Why it cannot be added when the _real_ user arrive?

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  6:39                                                                                               ` Zhu, Lingshan
@ 2023-11-01  6:50                                                                                                 ` Parav Pandit
  2023-11-01  6:56                                                                                                   ` Zhu, Lingshan
  2023-11-01  8:36                                                                                                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  6:50 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, November 1, 2023 12:09 PM
> 
> On 11/1/2023 2:37 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
> >>> From: Michael S. Tsirkin <mst@redhat.com>
> >>> Sent: Wednesday, November 1, 2023 11:01 AM
> >>>
> >>> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> >>>>
> >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>> Sent: Tuesday, October 31, 2023 3:44 PM
> >>>>>
> >>>>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> >>>>>>> Your answer is not relevant to this discussion at all.
> >>>>>>> Why?
> >>>>>>> Because we were discussing the schemes where registers are not used.
> >>>>>>> One example of that was IMS. It does not matter MSI or MSIX.
> >>>>>>> As explained in Intel's commit message, the key to focus for IMS
> >>>>>>> is "queue
> >>>>> memory" not some hw register like MSI or MSI-X.
> >>>>>> you know the device always need to know a address and the data to
> >>>>>> send a MSI, right?
> >>>>> So if virtio is to use IMS then we'll need to add interfaces to
> >>>>> program IMS, I think. As part of that patch - it's reasonable to
> >>>>> assume - we will also need to add a way to retrieve IMS so it can
> >>>>> be
> >>> migrated.
> >>>>> However, what this example demonstrates is that the approach taken
> >>>>> by this proposal to migrate control path structures - namely, by
> >>>>> defining a structure used just for migration - means that we will
> >>>>> need to come up with a migration interface each time.
> >>>>> And that is unfortunate.
> >>>>>
> >>>> When the device supports a new feature it has supported new
> functionality.
> >>>> Hence the live migration side also got updated.
> >>>> However, the live migration driver does not have to understand what
> >>>> is inside
> >>> the control path structures.
> >>>> It is just byte stream.
> >>>> Only if the hypervisor live migration drive involved in emulating,
> >>>> it will parse
> >>> and that is fine as like other control structures.
> >>>
> >>> The point is that any new field needs to be added in two places now
> >>> and that is not great at all.
> >>>
> >> Most control structs are well defined. So only its type field is added to
> migrating driver side.
> >> This is very low overhead field and handled in generic way for all device
> types and for all common types.
> > Weird, not what I see.  E.g. you seem to have a structure duplicating
> > queue fields. Each new field will have to be added there in addition
> > to the transport.
> >
> >>> We need a stronger compatiblity story here I think.
> >>>
> >>> One way to show how it's designed to work would be to split the
> >>> patches. For example, add queue notify data and queue reset separately.
> >> I didn't follow the suggestion. Can you explain splitting patches and its
> relation to the structure?
> >>
> >>> Another is to add MSIX table migration option for when MSIX table is
> >>> passed through to guest.
> >> Yes, this will be added in future when there is actual hypervisor for it.
> > You are tying the architecture to an extremely implementation specific
> > detail.
> >
> > Hypervisors *already* have migrate the MSIX table. Just in a
> > hypervisor specific way. queue vector is an index into this table. So
> > the index is migrated through the device but the table itself has to
> > be trapped and emulated by hypervisor? Give me a break.
> I agree, the MSI table could be R/W anyway.
Please explain the motivation, why it cannot be added when there is _real_ sw which will use it?
I asked to add it when it is needed.
Why it is must right now, if it is.
If there is any software like to use it, please explain which is it and how will it use it.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  6:50                                                                                                 ` Parav Pandit
@ 2023-11-01  6:56                                                                                                   ` Zhu, Lingshan
  2023-11-01  7:03                                                                                                     ` Parav Pandit
  2023-11-01  8:36                                                                                                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-11-01  6:56 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/1/2023 2:50 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, November 1, 2023 12:09 PM
>>
>> On 11/1/2023 2:37 PM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Sent: Wednesday, November 1, 2023 11:01 AM
>>>>>
>>>>> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Sent: Tuesday, October 31, 2023 3:44 PM
>>>>>>>
>>>>>>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
>>>>>>>>> Your answer is not relevant to this discussion at all.
>>>>>>>>> Why?
>>>>>>>>> Because we were discussing the schemes where registers are not used.
>>>>>>>>> One example of that was IMS. It does not matter MSI or MSIX.
>>>>>>>>> As explained in Intel's commit message, the key to focus for IMS
>>>>>>>>> is "queue
>>>>>>> memory" not some hw register like MSI or MSI-X.
>>>>>>>> you know the device always need to know a address and the data to
>>>>>>>> send a MSI, right?
>>>>>>> So if virtio is to use IMS then we'll need to add interfaces to
>>>>>>> program IMS, I think. As part of that patch - it's reasonable to
>>>>>>> assume - we will also need to add a way to retrieve IMS so it can
>>>>>>> be
>>>>> migrated.
>>>>>>> However, what this example demonstrates is that the approach taken
>>>>>>> by this proposal to migrate control path structures - namely, by
>>>>>>> defining a structure used just for migration - means that we will
>>>>>>> need to come up with a migration interface each time.
>>>>>>> And that is unfortunate.
>>>>>>>
>>>>>> When the device supports a new feature it has supported new
>> functionality.
>>>>>> Hence the live migration side also got updated.
>>>>>> However, the live migration driver does not have to understand what
>>>>>> is inside
>>>>> the control path structures.
>>>>>> It is just byte stream.
>>>>>> Only if the hypervisor live migration drive involved in emulating,
>>>>>> it will parse
>>>>> and that is fine as like other control structures.
>>>>>
>>>>> The point is that any new field needs to be added in two places now
>>>>> and that is not great at all.
>>>>>
>>>> Most control structs are well defined. So only its type field is added to
>> migrating driver side.
>>>> This is very low overhead field and handled in generic way for all device
>> types and for all common types.
>>> Weird, not what I see.  E.g. you seem to have a structure duplicating
>>> queue fields. Each new field will have to be added there in addition
>>> to the transport.
>>>
>>>>> We need a stronger compatiblity story here I think.
>>>>>
>>>>> One way to show how it's designed to work would be to split the
>>>>> patches. For example, add queue notify data and queue reset separately.
>>>> I didn't follow the suggestion. Can you explain splitting patches and its
>> relation to the structure?
>>>>> Another is to add MSIX table migration option for when MSIX table is
>>>>> passed through to guest.
>>>> Yes, this will be added in future when there is actual hypervisor for it.
>>> You are tying the architecture to an extremely implementation specific
>>> detail.
>>>
>>> Hypervisors *already* have migrate the MSIX table. Just in a
>>> hypervisor specific way. queue vector is an index into this table. So
>>> the index is migrated through the device but the table itself has to
>>> be trapped and emulated by hypervisor? Give me a break.
>> I agree, the MSI table could be R/W anyway.
> Please explain the motivation, why it cannot be added when there is _real_ sw which will use it?
> I asked to add it when it is needed.
> Why it is must right now, if it is.
> If there is any software like to use it, please explain which is it and how will it use it.
as MST has ever pointed out, hypervisors already have migrate MSI table.
So I suggest you to read QEMU code to find the answer.

We don't want another long long thread but nobody develop their 
knowledge there.

And if this is a vendor specific issue, then should not be relevant to 
the spec.

<EOM>
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  6:56                                                                                                   ` Zhu, Lingshan
@ 2023-11-01  7:03                                                                                                     ` Parav Pandit
  2023-11-01  7:46                                                                                                       ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  7:03 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, November 1, 2023 12:26 PM
> 
> On 11/1/2023 2:50 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Wednesday, November 1, 2023 12:09 PM
> >>
> >> On 11/1/2023 2:37 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
> >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>> Sent: Wednesday, November 1, 2023 11:01 AM
> >>>>>
> >>>>> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>> Sent: Tuesday, October 31, 2023 3:44 PM
> >>>>>>>
> >>>>>>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> >>>>>>>>> Your answer is not relevant to this discussion at all.
> >>>>>>>>> Why?
> >>>>>>>>> Because we were discussing the schemes where registers are not
> used.
> >>>>>>>>> One example of that was IMS. It does not matter MSI or MSIX.
> >>>>>>>>> As explained in Intel's commit message, the key to focus for
> >>>>>>>>> IMS is "queue
> >>>>>>> memory" not some hw register like MSI or MSI-X.
> >>>>>>>> you know the device always need to know a address and the data
> >>>>>>>> to send a MSI, right?
> >>>>>>> So if virtio is to use IMS then we'll need to add interfaces to
> >>>>>>> program IMS, I think. As part of that patch - it's reasonable to
> >>>>>>> assume - we will also need to add a way to retrieve IMS so it
> >>>>>>> can be
> >>>>> migrated.
> >>>>>>> However, what this example demonstrates is that the approach
> >>>>>>> taken by this proposal to migrate control path structures -
> >>>>>>> namely, by defining a structure used just for migration - means
> >>>>>>> that we will need to come up with a migration interface each time.
> >>>>>>> And that is unfortunate.
> >>>>>>>
> >>>>>> When the device supports a new feature it has supported new
> >> functionality.
> >>>>>> Hence the live migration side also got updated.
> >>>>>> However, the live migration driver does not have to understand
> >>>>>> what is inside
> >>>>> the control path structures.
> >>>>>> It is just byte stream.
> >>>>>> Only if the hypervisor live migration drive involved in
> >>>>>> emulating, it will parse
> >>>>> and that is fine as like other control structures.
> >>>>>
> >>>>> The point is that any new field needs to be added in two places
> >>>>> now and that is not great at all.
> >>>>>
> >>>> Most control structs are well defined. So only its type field is
> >>>> added to
> >> migrating driver side.
> >>>> This is very low overhead field and handled in generic way for all
> >>>> device
> >> types and for all common types.
> >>> Weird, not what I see.  E.g. you seem to have a structure
> >>> duplicating queue fields. Each new field will have to be added there
> >>> in addition to the transport.
> >>>
> >>>>> We need a stronger compatiblity story here I think.
> >>>>>
> >>>>> One way to show how it's designed to work would be to split the
> >>>>> patches. For example, add queue notify data and queue reset separately.
> >>>> I didn't follow the suggestion. Can you explain splitting patches
> >>>> and its
> >> relation to the structure?
> >>>>> Another is to add MSIX table migration option for when MSIX table
> >>>>> is passed through to guest.
> >>>> Yes, this will be added in future when there is actual hypervisor for it.
> >>> You are tying the architecture to an extremely implementation
> >>> specific detail.
> >>>
> >>> Hypervisors *already* have migrate the MSIX table. Just in a
> >>> hypervisor specific way. queue vector is an index into this table.
> >>> So the index is migrated through the device but the table itself has
> >>> to be trapped and emulated by hypervisor? Give me a break.
> >> I agree, the MSI table could be R/W anyway.
> > Please explain the motivation, why it cannot be added when there is _real_ sw
> which will use it?
> > I asked to add it when it is needed.
> > Why it is must right now, if it is.
> > If there is any software like to use it, please explain which is it and how will it
> use it.
> as MST has ever pointed out, hypervisors already have migrate MSI table.
> So I suggest you to read QEMU code to find the answer.
Huh, this is not the answer.

Michael asked - " add MSIX table migration option for when MSIX table is passed through to guest."
Currently it is not. When in future hypervisor adds it, it can be added. What prevents this addition in future?

I asked very simple question to explain the use case and hypervisor who wants to transfer MSIX table by using device context?
You don’t answer it...
I assume there is no user software of it.

If there is one, please share and it should be added.

> 
> We don't want another long long thread but nobody develop their knowledge
> there.
> 
Exactly. So explain use case of which software will use it? I don’t see any hypervisor using it today _from_ the device context.

> And if this is a vendor specific issue, then should not be relevant to the spec.
What vendor specific issue are you talking about? By some means are you _implying_ not transferring msix table is proposers vendors limitation?
Hell no.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  7:03                                                                                                     ` Parav Pandit
@ 2023-11-01  7:46                                                                                                       ` Zhu, Lingshan
  2023-11-01  7:54                                                                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-11-01  7:46 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/1/2023 3:03 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, November 1, 2023 12:26 PM
>>
>> On 11/1/2023 2:50 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Wednesday, November 1, 2023 12:09 PM
>>>>
>>>> On 11/1/2023 2:37 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Sent: Wednesday, November 1, 2023 11:01 AM
>>>>>>>
>>>>>>> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>> Sent: Tuesday, October 31, 2023 3:44 PM
>>>>>>>>>
>>>>>>>>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
>>>>>>>>>>> Your answer is not relevant to this discussion at all.
>>>>>>>>>>> Why?
>>>>>>>>>>> Because we were discussing the schemes where registers are not
>> used.
>>>>>>>>>>> One example of that was IMS. It does not matter MSI or MSIX.
>>>>>>>>>>> As explained in Intel's commit message, the key to focus for
>>>>>>>>>>> IMS is "queue
>>>>>>>>> memory" not some hw register like MSI or MSI-X.
>>>>>>>>>> you know the device always need to know a address and the data
>>>>>>>>>> to send a MSI, right?
>>>>>>>>> So if virtio is to use IMS then we'll need to add interfaces to
>>>>>>>>> program IMS, I think. As part of that patch - it's reasonable to
>>>>>>>>> assume - we will also need to add a way to retrieve IMS so it
>>>>>>>>> can be
>>>>>>> migrated.
>>>>>>>>> However, what this example demonstrates is that the approach
>>>>>>>>> taken by this proposal to migrate control path structures -
>>>>>>>>> namely, by defining a structure used just for migration - means
>>>>>>>>> that we will need to come up with a migration interface each time.
>>>>>>>>> And that is unfortunate.
>>>>>>>>>
>>>>>>>> When the device supports a new feature it has supported new
>>>> functionality.
>>>>>>>> Hence the live migration side also got updated.
>>>>>>>> However, the live migration driver does not have to understand
>>>>>>>> what is inside
>>>>>>> the control path structures.
>>>>>>>> It is just byte stream.
>>>>>>>> Only if the hypervisor live migration drive involved in
>>>>>>>> emulating, it will parse
>>>>>>> and that is fine as like other control structures.
>>>>>>>
>>>>>>> The point is that any new field needs to be added in two places
>>>>>>> now and that is not great at all.
>>>>>>>
>>>>>> Most control structs are well defined. So only its type field is
>>>>>> added to
>>>> migrating driver side.
>>>>>> This is very low overhead field and handled in generic way for all
>>>>>> device
>>>> types and for all common types.
>>>>> Weird, not what I see.  E.g. you seem to have a structure
>>>>> duplicating queue fields. Each new field will have to be added there
>>>>> in addition to the transport.
>>>>>
>>>>>>> We need a stronger compatiblity story here I think.
>>>>>>>
>>>>>>> One way to show how it's designed to work would be to split the
>>>>>>> patches. For example, add queue notify data and queue reset separately.
>>>>>> I didn't follow the suggestion. Can you explain splitting patches
>>>>>> and its
>>>> relation to the structure?
>>>>>>> Another is to add MSIX table migration option for when MSIX table
>>>>>>> is passed through to guest.
>>>>>> Yes, this will be added in future when there is actual hypervisor for it.
>>>>> You are tying the architecture to an extremely implementation
>>>>> specific detail.
>>>>>
>>>>> Hypervisors *already* have migrate the MSIX table. Just in a
>>>>> hypervisor specific way. queue vector is an index into this table.
>>>>> So the index is migrated through the device but the table itself has
>>>>> to be trapped and emulated by hypervisor? Give me a break.
>>>> I agree, the MSI table could be R/W anyway.
>>> Please explain the motivation, why it cannot be added when there is _real_ sw
>> which will use it?
>>> I asked to add it when it is needed.
>>> Why it is must right now, if it is.
>>> If there is any software like to use it, please explain which is it and how will it
>> use it.
>> as MST has ever pointed out, hypervisors already have migrate MSI table.
>> So I suggest you to read QEMU code to find the answer.
> Huh, this is not the answer.
>
> Michael asked - " add MSIX table migration option for when MSIX table is passed through to guest."
> Currently it is not. When in future hypervisor adds it, it can be added. What prevents this addition in future?
>
> I asked very simple question to explain the use case and hypervisor who wants to transfer MSIX table by using device context?
> You don’t answer it...
> I assume there is no user software of it.
>
> If there is one, please share and it should be added.
I can give you some hints:
when the VM freeze in the "stop_window" of live migration, the 
hypervisor owns the device, and it can access the
MSI table of the device. So I don't see MSI configurations blocking live 
migration.
>
>> We don't want another long long thread but nobody develop their knowledge
>> there.
>>
> Exactly. So explain use case of which software will use it? I don’t see any hypervisor using it today _from_ the device context.
see above
>
>> And if this is a vendor specific issue, then should not be relevant to the spec.
> What vendor specific issue are you talking about? By some means are you _implying_ not transferring msix table is proposers vendors limitation?
> Hell no.
Hypervisor transfer MSI to the destination as explained above, and this 
routine works, an ref is QEMU


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  7:46                                                                                                       ` Zhu, Lingshan
@ 2023-11-01  7:54                                                                                                         ` Parav Pandit
  2023-11-01  8:55                                                                                                           ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  7:54 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, November 1, 2023 1:17 PM
> 
> On 11/1/2023 3:03 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Wednesday, November 1, 2023 12:26 PM
> >>
> >> On 11/1/2023 2:50 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Wednesday, November 1, 2023 12:09 PM
> >>>>
> >>>> On 11/1/2023 2:37 PM, Michael S. Tsirkin wrote:
> >>>>> On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
> >>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>> Sent: Wednesday, November 1, 2023 11:01 AM
> >>>>>>>
> >>>>>>> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> >>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>>>>> Sent: Tuesday, October 31, 2023 3:44 PM
> >>>>>>>>>
> >>>>>>>>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> >>>>>>>>>>> Your answer is not relevant to this discussion at all.
> >>>>>>>>>>> Why?
> >>>>>>>>>>> Because we were discussing the schemes where registers are
> >>>>>>>>>>> not
> >> used.
> >>>>>>>>>>> One example of that was IMS. It does not matter MSI or MSIX.
> >>>>>>>>>>> As explained in Intel's commit message, the key to focus for
> >>>>>>>>>>> IMS is "queue
> >>>>>>>>> memory" not some hw register like MSI or MSI-X.
> >>>>>>>>>> you know the device always need to know a address and the
> >>>>>>>>>> data to send a MSI, right?
> >>>>>>>>> So if virtio is to use IMS then we'll need to add interfaces
> >>>>>>>>> to program IMS, I think. As part of that patch - it's
> >>>>>>>>> reasonable to assume - we will also need to add a way to
> >>>>>>>>> retrieve IMS so it can be
> >>>>>>> migrated.
> >>>>>>>>> However, what this example demonstrates is that the approach
> >>>>>>>>> taken by this proposal to migrate control path structures -
> >>>>>>>>> namely, by defining a structure used just for migration -
> >>>>>>>>> means that we will need to come up with a migration interface
> each time.
> >>>>>>>>> And that is unfortunate.
> >>>>>>>>>
> >>>>>>>> When the device supports a new feature it has supported new
> >>>> functionality.
> >>>>>>>> Hence the live migration side also got updated.
> >>>>>>>> However, the live migration driver does not have to understand
> >>>>>>>> what is inside
> >>>>>>> the control path structures.
> >>>>>>>> It is just byte stream.
> >>>>>>>> Only if the hypervisor live migration drive involved in
> >>>>>>>> emulating, it will parse
> >>>>>>> and that is fine as like other control structures.
> >>>>>>>
> >>>>>>> The point is that any new field needs to be added in two places
> >>>>>>> now and that is not great at all.
> >>>>>>>
> >>>>>> Most control structs are well defined. So only its type field is
> >>>>>> added to
> >>>> migrating driver side.
> >>>>>> This is very low overhead field and handled in generic way for
> >>>>>> all device
> >>>> types and for all common types.
> >>>>> Weird, not what I see.  E.g. you seem to have a structure
> >>>>> duplicating queue fields. Each new field will have to be added
> >>>>> there in addition to the transport.
> >>>>>
> >>>>>>> We need a stronger compatiblity story here I think.
> >>>>>>>
> >>>>>>> One way to show how it's designed to work would be to split the
> >>>>>>> patches. For example, add queue notify data and queue reset
> separately.
> >>>>>> I didn't follow the suggestion. Can you explain splitting patches
> >>>>>> and its
> >>>> relation to the structure?
> >>>>>>> Another is to add MSIX table migration option for when MSIX
> >>>>>>> table is passed through to guest.
> >>>>>> Yes, this will be added in future when there is actual hypervisor for it.
> >>>>> You are tying the architecture to an extremely implementation
> >>>>> specific detail.
> >>>>>
> >>>>> Hypervisors *already* have migrate the MSIX table. Just in a
> >>>>> hypervisor specific way. queue vector is an index into this table.
> >>>>> So the index is migrated through the device but the table itself
> >>>>> has to be trapped and emulated by hypervisor? Give me a break.
> >>>> I agree, the MSI table could be R/W anyway.
> >>> Please explain the motivation, why it cannot be added when there is
> >>> _real_ sw
> >> which will use it?
> >>> I asked to add it when it is needed.
> >>> Why it is must right now, if it is.
> >>> If there is any software like to use it, please explain which is it
> >>> and how will it
> >> use it.
> >> as MST has ever pointed out, hypervisors already have migrate MSI table.
> >> So I suggest you to read QEMU code to find the answer.
> > Huh, this is not the answer.
> >
> > Michael asked - " add MSIX table migration option for when MSIX table is
> passed through to guest."
> > Currently it is not. When in future hypervisor adds it, it can be added. What
> prevents this addition in future?
> >
> > I asked very simple question to explain the use case and hypervisor who
> wants to transfer MSIX table by using device context?
> > You don’t answer it...
> > I assume there is no user software of it.
> >
> > If there is one, please share and it should be added.
> I can give you some hints:
> when the VM freeze in the "stop_window" of live migration, the hypervisor
> owns the device, and it can access the MSI table of the device. So I don't see
> MSI configurations blocking live migration.
> >
> >> We don't want another long long thread but nobody develop their
> >> knowledge there.
> >>
> > Exactly. So explain use case of which software will use it? I don’t see any
> hypervisor using it today _from_ the device context.
> see above
> >
> >> And if this is a vendor specific issue, then should not be relevant to the spec.
> > What vendor specific issue are you talking about? By some means are you
> _implying_ not transferring msix table is proposers vendors limitation?
> > Hell no.
> Hypervisor transfer MSI to the destination as explained above, and this routine
> works, an ref is QEMU
I have hard time following your suggestion.
So do you want MSIX table as part of device context or not?

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  6:47                                                                                               ` Parav Pandit
@ 2023-11-01  8:28                                                                                                 ` Michael S. Tsirkin
  2023-11-01  8:49                                                                                                   ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-01  8:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 01, 2023 at 06:47:57AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Wednesday, November 1, 2023 12:07 PM
> > 
> > On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Wednesday, November 1, 2023 11:01 AM
> > > >
> > > > On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Tuesday, October 31, 2023 3:44 PM
> > > > > >
> > > > > > On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > > > > > > > Your answer is not relevant to this discussion at all.
> > > > > > > > Why?
> > > > > > > > Because we were discussing the schemes where registers are not
> > used.
> > > > > > > > One example of that was IMS. It does not matter MSI or MSIX.
> > > > > > > > As explained in Intel's commit message, the key to focus for
> > > > > > > > IMS is "queue
> > > > > > memory" not some hw register like MSI or MSI-X.
> > > > > > > you know the device always need to know a address and the data
> > > > > > > to send a MSI, right?
> > > > > >
> > > > > > So if virtio is to use IMS then we'll need to add interfaces to
> > > > > > program IMS, I think. As part of that patch - it's reasonable to
> > > > > > assume - we will also need to add a way to retrieve IMS so it
> > > > > > can be
> > > > migrated.
> > > > > >
> > > > > > However, what this example demonstrates is that the approach
> > > > > > taken by this proposal to migrate control path structures -
> > > > > > namely, by defining a structure used just for migration - means
> > > > > > that we will need to come up with a migration interface each time.
> > > > > > And that is unfortunate.
> > > > > >
> > > > > When the device supports a new feature it has supported new
> > functionality.
> > > > > Hence the live migration side also got updated.
> > > > > However, the live migration driver does not have to understand
> > > > > what is inside
> > > > the control path structures.
> > > > > It is just byte stream.
> > > > > Only if the hypervisor live migration drive involved in emulating,
> > > > > it will parse
> > > > and that is fine as like other control structures.
> > > >
> > > > The point is that any new field needs to be added in two places now
> > > > and that is not great at all.
> > > >
> > > Most control structs are well defined. So only its type field is added to
> > migrating driver side.
> > > This is very low overhead field and handled in generic way for all device types
> > and for all common types.
> > 
> > Weird, not what I see.  E.g. you seem to have a structure duplicating queue
> > fields. Each new field will have to be added there in addition to the transport.
> > 
> Didn't follow.
> Which structure is duplicated?
> PCI does not even have a structure for q configuration.

So this:

+struct virtio_dev_ctx_pci_vq_cfg {
+        le16 vq_index;
+        le16 queue_size;
+        le16 queue_msix_vector;
+        le64 queue_desc;
+        le64 queue_driver;
+        le64 queue_device;
+};


duplicates a bunch of fields from this:

struct virtio_pci_common_cfg {
        /* About the whole device. */
        __le32 device_feature_select;   /* read-write */
        __le32 device_feature;          /* read-only */
        __le32 guest_feature_select;    /* read-write */
        __le32 guest_feature;           /* read-write */
        __le16 msix_config;             /* read-write */
        __le16 num_queues;              /* read-only */
        __u8 device_status;             /* read-write */
        __u8 config_generation;         /* read-only */

        /* About a specific virtqueue. */
        __le16 queue_select;            /* read-write */
        __le16 queue_size;              /* read-write, power of 2. */
        __le16 queue_msix_vector;       /* read-write */
        __le16 queue_enable;            /* read-write */
        __le16 queue_notify_off;        /* read-only */
        __le32 queue_desc_lo;           /* read-write */
        __le32 queue_desc_hi;           /* read-write */
        __le32 queue_avail_lo;          /* read-write */
        __le32 queue_avail_hi;          /* read-write */
        __le32 queue_used_lo;           /* read-write */
        __le32 queue_used_hi;           /* read-write */
};


Except it's incomplete and I suspect that's actually a bug.


Here's an idea: have a record per field. Use transport offsets
as tags.




> > > > We need a stronger compatiblity story here I think.
> > > >
> > > > One way to show how it's designed to work would be to split the
> > > > patches. For example, add queue notify data and queue reset separately.
> > > I didn't follow the suggestion. Can you explain splitting patches and its relation
> > to the structure?
> Did I miss your response? Or it is in below msix?

exactly.

> > >
> > > >
> > > > Another is to add MSIX table migration option for when MSIX table is
> > > > passed through to guest.
> > > Yes, this will be added in future when there is actual hypervisor for it.
> > 
> > You are tying the architecture to an extremely implementation specific detail.
> >
> Which part?
>  
> Not really. can you please which software will use MSI-X table migration?

Like, all and any? All hypervisors migrate the msi-x table.

> > Hypervisors *already* have migrate the MSIX table. Just in a hypervisor specific
> > way. 
> MSI-X table is in the PCI BAR memory.

Exactly. And that means hypervisor should not read it from the device
directly - e.g. with an encrypted device it won't be able to.

> > queue vector is an index into this table. So the index is migrated through
> > the device but the table itself has to be trapped and emulated by hypervisor?
> Do you have a hypervisor and a platform that has not done MSI-X table emulation for which you are asking to add?
> I don't know any.
> I am asking to add it when there is a _real_ user of it. Why it cannot be added when the _real_ user arrive?

Real meaning actual hardware and software implementing it?  By this
definition there's no real user for migration in the spec at all - all
of it can be done by device specific means. What's your point? By now we
have a reasonably good idea what hypervisors need to make migration
portable. Let's either put all of it in the spec or not bother at all.

In other words, we need a bright line. I suggest a simple one: memory is
migrated by device, config space by hypervisor.  If not, suggest another
one - but it needs a reasonable rule based on a hardware not whatever
software found expedient to use.


-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  6:50                                                                                                 ` Parav Pandit
  2023-11-01  6:56                                                                                                   ` Zhu, Lingshan
@ 2023-11-01  8:36                                                                                                   ` Michael S. Tsirkin
  2023-11-01 10:24                                                                                                     ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-01  8:36 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 01, 2023 at 06:50:02AM +0000, Parav Pandit wrote:
> 
> 
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Wednesday, November 1, 2023 12:09 PM
> > 
> > On 11/1/2023 2:37 PM, Michael S. Tsirkin wrote:
> > > On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
> > >>> From: Michael S. Tsirkin <mst@redhat.com>
> > >>> Sent: Wednesday, November 1, 2023 11:01 AM
> > >>>
> > >>> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> > >>>>
> > >>>>> From: Michael S. Tsirkin <mst@redhat.com>
> > >>>>> Sent: Tuesday, October 31, 2023 3:44 PM
> > >>>>>
> > >>>>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > >>>>>>> Your answer is not relevant to this discussion at all.
> > >>>>>>> Why?
> > >>>>>>> Because we were discussing the schemes where registers are not used.
> > >>>>>>> One example of that was IMS. It does not matter MSI or MSIX.
> > >>>>>>> As explained in Intel's commit message, the key to focus for IMS
> > >>>>>>> is "queue
> > >>>>> memory" not some hw register like MSI or MSI-X.
> > >>>>>> you know the device always need to know a address and the data to
> > >>>>>> send a MSI, right?
> > >>>>> So if virtio is to use IMS then we'll need to add interfaces to
> > >>>>> program IMS, I think. As part of that patch - it's reasonable to
> > >>>>> assume - we will also need to add a way to retrieve IMS so it can
> > >>>>> be
> > >>> migrated.
> > >>>>> However, what this example demonstrates is that the approach taken
> > >>>>> by this proposal to migrate control path structures - namely, by
> > >>>>> defining a structure used just for migration - means that we will
> > >>>>> need to come up with a migration interface each time.
> > >>>>> And that is unfortunate.
> > >>>>>
> > >>>> When the device supports a new feature it has supported new
> > functionality.
> > >>>> Hence the live migration side also got updated.
> > >>>> However, the live migration driver does not have to understand what
> > >>>> is inside
> > >>> the control path structures.
> > >>>> It is just byte stream.
> > >>>> Only if the hypervisor live migration drive involved in emulating,
> > >>>> it will parse
> > >>> and that is fine as like other control structures.
> > >>>
> > >>> The point is that any new field needs to be added in two places now
> > >>> and that is not great at all.
> > >>>
> > >> Most control structs are well defined. So only its type field is added to
> > migrating driver side.
> > >> This is very low overhead field and handled in generic way for all device
> > types and for all common types.
> > > Weird, not what I see.  E.g. you seem to have a structure duplicating
> > > queue fields. Each new field will have to be added there in addition
> > > to the transport.
> > >
> > >>> We need a stronger compatiblity story here I think.
> > >>>
> > >>> One way to show how it's designed to work would be to split the
> > >>> patches. For example, add queue notify data and queue reset separately.
> > >> I didn't follow the suggestion. Can you explain splitting patches and its
> > relation to the structure?
> > >>
> > >>> Another is to add MSIX table migration option for when MSIX table is
> > >>> passed through to guest.
> > >> Yes, this will be added in future when there is actual hypervisor for it.
> > > You are tying the architecture to an extremely implementation specific
> > > detail.
> > >
> > > Hypervisors *already* have migrate the MSIX table. Just in a
> > > hypervisor specific way. queue vector is an index into this table. So
> > > the index is migrated through the device but the table itself has to
> > > be trapped and emulated by hypervisor? Give me a break.
> > I agree, the MSI table could be R/W anyway.
> Please explain the motivation, why it cannot be added when there is _real_ sw which will use it?
> I asked to add it when it is needed.
> Why it is must right now, if it is.
> If there is any software like to use it, please explain which is it and how will it use it.

My experience shows we need simple composable and self-contained blocks.
If we are trying to make hypervisor avoid accessing device memory
(which seems to be one of key design points behind what you are building)
then we can't have queue msix vector index migrated in one way
and the vector itself in a completely different way.

For example, one of the advantages of using admin commands
is that they are atomic - so we get a consistent snapshot of the device state.
But, this goes out of the window if we are referencing a table
that is not part of that same command output.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  8:28                                                                                                 ` Michael S. Tsirkin
@ 2023-11-01  8:49                                                                                                   ` Parav Pandit
  2023-11-01  9:06                                                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-01  8:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 1, 2023 1:59 PM
> 
> On Wed, Nov 01, 2023 at 06:47:57AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Wednesday, November 1, 2023 12:07 PM
> > >
> > > On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Wednesday, November 1, 2023 11:01 AM
> > > > >
> > > > > On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Tuesday, October 31, 2023 3:44 PM
> > > > > > >
> > > > > > > On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
> > > > > > > > > Your answer is not relevant to this discussion at all.
> > > > > > > > > Why?
> > > > > > > > > Because we were discussing the schemes where registers
> > > > > > > > > are not
> > > used.
> > > > > > > > > One example of that was IMS. It does not matter MSI or MSIX.
> > > > > > > > > As explained in Intel's commit message, the key to focus
> > > > > > > > > for IMS is "queue
> > > > > > > memory" not some hw register like MSI or MSI-X.
> > > > > > > > you know the device always need to know a address and the
> > > > > > > > data to send a MSI, right?
> > > > > > >
> > > > > > > So if virtio is to use IMS then we'll need to add interfaces
> > > > > > > to program IMS, I think. As part of that patch - it's
> > > > > > > reasonable to assume - we will also need to add a way to
> > > > > > > retrieve IMS so it can be
> > > > > migrated.
> > > > > > >
> > > > > > > However, what this example demonstrates is that the approach
> > > > > > > taken by this proposal to migrate control path structures -
> > > > > > > namely, by defining a structure used just for migration -
> > > > > > > means that we will need to come up with a migration interface each
> time.
> > > > > > > And that is unfortunate.
> > > > > > >
> > > > > > When the device supports a new feature it has supported new
> > > functionality.
> > > > > > Hence the live migration side also got updated.
> > > > > > However, the live migration driver does not have to understand
> > > > > > what is inside
> > > > > the control path structures.
> > > > > > It is just byte stream.
> > > > > > Only if the hypervisor live migration drive involved in
> > > > > > emulating, it will parse
> > > > > and that is fine as like other control structures.
> > > > >
> > > > > The point is that any new field needs to be added in two places
> > > > > now and that is not great at all.
> > > > >
> > > > Most control structs are well defined. So only its type field is
> > > > added to
> > > migrating driver side.
> > > > This is very low overhead field and handled in generic way for all
> > > > device types
> > > and for all common types.
> > >
> > > Weird, not what I see.  E.g. you seem to have a structure
> > > duplicating queue fields. Each new field will have to be added there in
> addition to the transport.
> > >
> > Didn't follow.
> > Which structure is duplicated?
> > PCI does not even have a structure for q configuration.
> 
> So this:
> 
> +struct virtio_dev_ctx_pci_vq_cfg {
> +        le16 vq_index;
> +        le16 queue_size;
> +        le16 queue_msix_vector;
> +        le64 queue_desc;
> +        le64 queue_driver;
> +        le64 queue_device;
> +};
> 
> 
> duplicates a bunch of fields from this:
> 
Not really. Above is current VQ's configuration not visible in the config space directly.
Below is already captured as part of VIRTIO_DEV_CTX_PCI_COMMON_CFG.

> struct virtio_pci_common_cfg {
>         /* About the whole device. */
>         __le32 device_feature_select;   /* read-write */
>         __le32 device_feature;          /* read-only */
>         __le32 guest_feature_select;    /* read-write */
>         __le32 guest_feature;           /* read-write */
>         __le16 msix_config;             /* read-write */
>         __le16 num_queues;              /* read-only */
>         __u8 device_status;             /* read-write */
>         __u8 config_generation;         /* read-only */
> 
>         /* About a specific virtqueue. */
>         __le16 queue_select;            /* read-write */
>         __le16 queue_size;              /* read-write, power of 2. */
>         __le16 queue_msix_vector;       /* read-write */
>         __le16 queue_enable;            /* read-write */
>         __le16 queue_notify_off;        /* read-only */
>         __le32 queue_desc_lo;           /* read-write */
>         __le32 queue_desc_hi;           /* read-write */
>         __le32 queue_avail_lo;          /* read-write */
>         __le32 queue_avail_hi;          /* read-write */
>         __le32 queue_used_lo;           /* read-write */
>         __le32 queue_used_hi;           /* read-write */
> };
> 
> 
> Except it's incomplete and I suspect that's actually a bug.
> 
It is not a bug.
There is some information duplicated. Above struct virtio_pci_common_cfg is the snapshot of registers being updated.
While struct virtio_dev_ctx_pci_vq_cfg is capturing what is not visible in above config snapshot.
i.e. queues which are already configured.
For at that rare instance there is some duplication of some fields. but this is the exception part not be too much worried about.

> 
> Here's an idea: have a record per field. Use transport offsets as tags.
> 
There are just too many of them.
They are logically clubbed in their native structures which are already defined.

> 
> 
> 
> > > > > We need a stronger compatiblity story here I think.
> > > > >
> > > > > One way to show how it's designed to work would be to split the
> > > > > patches. For example, add queue notify data and queue reset separately.
> > > > I didn't follow the suggestion. Can you explain splitting patches
> > > > and its relation
> > > to the structure?
> > Did I miss your response? Or it is in below msix?
> 
> exactly.
> 
> > > >
> > > > >
> > > > > Another is to add MSIX table migration option for when MSIX
> > > > > table is passed through to guest.
> > > > Yes, this will be added in future when there is actual hypervisor for it.
> > >
> > > You are tying the architecture to an extremely implementation specific
> detail.
> > >
> > Which part?
> >
> > Not really. can you please which software will use MSI-X table migration?
> 
> Like, all and any? All hypervisors migrate the msi-x table.
> 
> > > Hypervisors *already* have migrate the MSIX table. Just in a
> > > hypervisor specific way.
> > MSI-X table is in the PCI BAR memory.
> 
> Exactly. And that means hypervisor should not read it from the device directly -
> e.g. with an encrypted device it won't be able to.
> 
Yep.

> > > queue vector is an index into this table. So the index is migrated
> > > through the device but the table itself has to be trapped and emulated by
> hypervisor?
> > Do you have a hypervisor and a platform that has not done MSI-X table
> emulation for which you are asking to add?
> > I don't know any.
> > I am asking to add it when there is a _real_ user of it. Why it cannot be added
> when the _real_ user arrive?
> 
> Real meaning actual hardware and software implementing it?  By this definition
> there's no real user for migration in the spec at all - all of it can be done by
> device specific means. 
This is also an option if vendor specific commands are allowed.

> What's your point? By now we have a reasonably good
> idea what hypervisors need to make migration portable. Let's either put all of it
> in the spec or not bother at all.
> 
Maybe I was not clear.
I am saying lets put the msix table when a user will find it useful in incremental manner.
For example, lets say we put in the spec in 1.4 version Nov 23, will device implement it? mostly yes.
Will existing software use it in 2024-25? mostly no, because most platform has complexity in this area as cpus have hard coded certain values.

So we are asking device to implement something that is not going to be used by any forcible future.

Hence, the request is, when the hypervisor/cpu vendor asks for it, it will be possible to add into the device.

> In other words, we need a bright line. I suggest a simple one: memory is
> migrated by device, config space by hypervisor.  If not, suggest another one -
> but it needs a reasonable rule based on a hardware not whatever software
> found expedient to use.
> 
I really liked this bright line.
Msix table like rest of the memory area of common config and device config will become part of this memory area.

I wish we also draw a good bright line between for near term vs long term to be practical.

Adding MSI-X table in the spec is not hard now or in future, frankly.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  7:54                                                                                                         ` Parav Pandit
@ 2023-11-01  8:55                                                                                                           ` Zhu, Lingshan
  2023-11-01  9:07                                                                                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-11-01  8:55 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/1/2023 3:54 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, November 1, 2023 1:17 PM
>>
>> On 11/1/2023 3:03 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Wednesday, November 1, 2023 12:26 PM
>>>>
>>>> On 11/1/2023 2:50 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Wednesday, November 1, 2023 12:09 PM
>>>>>>
>>>>>> On 11/1/2023 2:37 PM, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Nov 01, 2023 at 05:42:56AM +0000, Parav Pandit wrote:
>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>> Sent: Wednesday, November 1, 2023 11:01 AM
>>>>>>>>>
>>>>>>>>> On Wed, Nov 01, 2023 at 02:54:47AM +0000, Parav Pandit wrote:
>>>>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>>>> Sent: Tuesday, October 31, 2023 3:44 PM
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 31, 2023 at 05:42:29PM +0800, Zhu, Lingshan wrote:
>>>>>>>>>>>>> Your answer is not relevant to this discussion at all.
>>>>>>>>>>>>> Why?
>>>>>>>>>>>>> Because we were discussing the schemes where registers are
>>>>>>>>>>>>> not
>>>> used.
>>>>>>>>>>>>> One example of that was IMS. It does not matter MSI or MSIX.
>>>>>>>>>>>>> As explained in Intel's commit message, the key to focus for
>>>>>>>>>>>>> IMS is "queue
>>>>>>>>>>> memory" not some hw register like MSI or MSI-X.
>>>>>>>>>>>> you know the device always need to know a address and the
>>>>>>>>>>>> data to send a MSI, right?
>>>>>>>>>>> So if virtio is to use IMS then we'll need to add interfaces
>>>>>>>>>>> to program IMS, I think. As part of that patch - it's
>>>>>>>>>>> reasonable to assume - we will also need to add a way to
>>>>>>>>>>> retrieve IMS so it can be
>>>>>>>>> migrated.
>>>>>>>>>>> However, what this example demonstrates is that the approach
>>>>>>>>>>> taken by this proposal to migrate control path structures -
>>>>>>>>>>> namely, by defining a structure used just for migration -
>>>>>>>>>>> means that we will need to come up with a migration interface
>> each time.
>>>>>>>>>>> And that is unfortunate.
>>>>>>>>>>>
>>>>>>>>>> When the device supports a new feature it has supported new
>>>>>> functionality.
>>>>>>>>>> Hence the live migration side also got updated.
>>>>>>>>>> However, the live migration driver does not have to understand
>>>>>>>>>> what is inside
>>>>>>>>> the control path structures.
>>>>>>>>>> It is just byte stream.
>>>>>>>>>> Only if the hypervisor live migration drive involved in
>>>>>>>>>> emulating, it will parse
>>>>>>>>> and that is fine as like other control structures.
>>>>>>>>>
>>>>>>>>> The point is that any new field needs to be added in two places
>>>>>>>>> now and that is not great at all.
>>>>>>>>>
>>>>>>>> Most control structs are well defined. So only its type field is
>>>>>>>> added to
>>>>>> migrating driver side.
>>>>>>>> This is very low overhead field and handled in generic way for
>>>>>>>> all device
>>>>>> types and for all common types.
>>>>>>> Weird, not what I see.  E.g. you seem to have a structure
>>>>>>> duplicating queue fields. Each new field will have to be added
>>>>>>> there in addition to the transport.
>>>>>>>
>>>>>>>>> We need a stronger compatiblity story here I think.
>>>>>>>>>
>>>>>>>>> One way to show how it's designed to work would be to split the
>>>>>>>>> patches. For example, add queue notify data and queue reset
>> separately.
>>>>>>>> I didn't follow the suggestion. Can you explain splitting patches
>>>>>>>> and its
>>>>>> relation to the structure?
>>>>>>>>> Another is to add MSIX table migration option for when MSIX
>>>>>>>>> table is passed through to guest.
>>>>>>>> Yes, this will be added in future when there is actual hypervisor for it.
>>>>>>> You are tying the architecture to an extremely implementation
>>>>>>> specific detail.
>>>>>>>
>>>>>>> Hypervisors *already* have migrate the MSIX table. Just in a
>>>>>>> hypervisor specific way. queue vector is an index into this table.
>>>>>>> So the index is migrated through the device but the table itself
>>>>>>> has to be trapped and emulated by hypervisor? Give me a break.
>>>>>> I agree, the MSI table could be R/W anyway.
>>>>> Please explain the motivation, why it cannot be added when there is
>>>>> _real_ sw
>>>> which will use it?
>>>>> I asked to add it when it is needed.
>>>>> Why it is must right now, if it is.
>>>>> If there is any software like to use it, please explain which is it
>>>>> and how will it
>>>> use it.
>>>> as MST has ever pointed out, hypervisors already have migrate MSI table.
>>>> So I suggest you to read QEMU code to find the answer.
>>> Huh, this is not the answer.
>>>
>>> Michael asked - " add MSIX table migration option for when MSIX table is
>> passed through to guest."
>>> Currently it is not. When in future hypervisor adds it, it can be added. What
>> prevents this addition in future?
>>> I asked very simple question to explain the use case and hypervisor who
>> wants to transfer MSIX table by using device context?
>>> You don’t answer it...
>>> I assume there is no user software of it.
>>>
>>> If there is one, please share and it should be added.
>> I can give you some hints:
>> when the VM freeze in the "stop_window" of live migration, the hypervisor
>> owns the device, and it can access the MSI table of the device. So I don't see
>> MSI configurations blocking live migration.
>>>> We don't want another long long thread but nobody develop their
>>>> knowledge there.
>>>>
>>> Exactly. So explain use case of which software will use it? I don’t see any
>> hypervisor using it today _from_ the device context.
>> see above
>>>> And if this is a vendor specific issue, then should not be relevant to the spec.
>>> What vendor specific issue are you talking about? By some means are you
>> _implying_ not transferring msix table is proposers vendors limitation?
>>> Hell no.
>> Hypervisor transfer MSI to the destination as explained above, and this routine
>> works, an ref is QEMU
> I have hard time following your suggestion.
> So do you want MSIX table as part of device context or not?
MSI is a part of device context as I suggest you to read QEMU code for 
the answers.

You will also found that part is already done.

So I don't know why this is a topic worthy discussions.

If you have questions, I still suggest you read QEMU code.



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  8:49                                                                                                   ` Parav Pandit
@ 2023-11-01  9:06                                                                                                     ` Michael S. Tsirkin
  2023-11-01 10:01                                                                                                       ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-01  9:06 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 01, 2023 at 08:49:40AM +0000, Parav Pandit wrote:
> > So this:
> > 
> > +struct virtio_dev_ctx_pci_vq_cfg {
> > +        le16 vq_index;
> > +        le16 queue_size;
> > +        le16 queue_msix_vector;
> > +        le64 queue_desc;
> > +        le64 queue_driver;
> > +        le64 queue_device;
> > +};
> > 
> > 
> > duplicates a bunch of fields from this:
> > 
> Not really. Above is current VQ's configuration not visible in the config space directly.
> Below is already captured as part of VIRTIO_DEV_CTX_PCI_COMMON_CFG.

I really wanted to help you understand what I mean when I say
that the spec effort is duplicated.


> > struct virtio_pci_common_cfg {
> >         /* About the whole device. */
> >         __le32 device_feature_select;   /* read-write */
> >         __le32 device_feature;          /* read-only */
> >         __le32 guest_feature_select;    /* read-write */
> >         __le32 guest_feature;           /* read-write */
> >         __le16 msix_config;             /* read-write */
> >         __le16 num_queues;              /* read-only */
> >         __u8 device_status;             /* read-write */
> >         __u8 config_generation;         /* read-only */
> > 
> >         /* About a specific virtqueue. */
> >         __le16 queue_select;            /* read-write */
> >         __le16 queue_size;              /* read-write, power of 2. */
> >         __le16 queue_msix_vector;       /* read-write */
> >         __le16 queue_enable;            /* read-write */
> >         __le16 queue_notify_off;        /* read-only */
> >         __le32 queue_desc_lo;           /* read-write */
> >         __le32 queue_desc_hi;           /* read-write */
> >         __le32 queue_avail_lo;          /* read-write */
> >         __le32 queue_avail_hi;          /* read-write */
> >         __le32 queue_used_lo;           /* read-write */
> >         __le32 queue_used_hi;           /* read-write */
> > };
> > 
> > 
> > Except it's incomplete and I suspect that's actually a bug.
> > 
> It is not a bug.

For example. queue_enable is not there.

> There is some information duplicated. Above struct virtio_pci_common_cfg is the snapshot of registers being updated.
> While struct virtio_dev_ctx_pci_vq_cfg is capturing what is not visible in above config snapshot.
> i.e. queues which are already configured.
> For at that rare instance there is some duplication of some fields. but this is the exception part not be too much worried about.

Seems more of a rule than an exception - I don't see any fields that are
not also in config space.

> > 
> > Here's an idea: have a record per field. Use transport offsets as tags.
> > 
> There are just too many of them.
> They are logically clubbed in their native structures which are already defined.

Too many for what? If you don't like what I suggested find some way to
avoid duplicating everything please.


> > 
> > 
> > 
> > > > > > We need a stronger compatiblity story here I think.
> > > > > >
> > > > > > One way to show how it's designed to work would be to split the
> > > > > > patches. For example, add queue notify data and queue reset separately.
> > > > > I didn't follow the suggestion. Can you explain splitting patches
> > > > > and its relation
> > > > to the structure?
> > > Did I miss your response? Or it is in below msix?
> > 
> > exactly.
> > 
> > > > >
> > > > > >
> > > > > > Another is to add MSIX table migration option for when MSIX
> > > > > > table is passed through to guest.
> > > > > Yes, this will be added in future when there is actual hypervisor for it.
> > > >
> > > > You are tying the architecture to an extremely implementation specific
> > detail.
> > > >
> > > Which part?
> > >
> > > Not really. can you please which software will use MSI-X table migration?
> > 
> > Like, all and any? All hypervisors migrate the msi-x table.
> > 
> > > > Hypervisors *already* have migrate the MSIX table. Just in a
> > > > hypervisor specific way.
> > > MSI-X table is in the PCI BAR memory.
> > 
> > Exactly. And that means hypervisor should not read it from the device directly -
> > e.g. with an encrypted device it won't be able to.
> > 
> Yep.

And it follows device needs to include 

> > > > queue vector is an index into this table. So the index is migrated
> > > > through the device but the table itself has to be trapped and emulated by
> > hypervisor?
> > > Do you have a hypervisor and a platform that has not done MSI-X table
> > emulation for which you are asking to add?
> > > I don't know any.
> > > I am asking to add it when there is a _real_ user of it. Why it cannot be added
> > when the _real_ user arrive?
> > 
> > Real meaning actual hardware and software implementing it?  By this definition
> > there's no real user for migration in the spec at all - all of it can be done by
> > device specific means. 
> This is also an option if vendor specific commands are allowed.

Once you do that you can just throw all this spec effort out the window,
and do vdpa. I thought the point was to allow control plane in standard
hardware, so vendors can compete on best hardware and software is shared?

> > What's your point? By now we have a reasonably good
> > idea what hypervisors need to make migration portable. Let's either put all of it
> > in the spec or not bother at all.
> > 
> Maybe I was not clear.
> I am saying lets put the msix table when a user will find it useful in incremental manner.
> For example, lets say we put in the spec in 1.4 version Nov 23, will device implement it? mostly yes.
> Will existing software use it in 2024-25? mostly no, because most platform has complexity in this area as cpus have hard coded certain values.
> 
> So we are asking device to implement something that is not going to be used by any forcible future.
> 
> Hence, the request is, when the hypervisor/cpu vendor asks for it, it will be possible to add into the device.

If it is there then I expect hypervisors will use it, yes.

> > In other words, we need a bright line. I suggest a simple one: memory is
> > migrated by device, config space by hypervisor.  If not, suggest another one -
> > but it needs a reasonable rule based on a hardware not whatever software
> > found expedient to use.
> > 
> I really liked this bright line.
> Msix table like rest of the memory area of common config and device config will become part of this memory area.
> 
> I wish we also draw a good bright line between for near term vs long term to be practical.
>
> Adding MSI-X table in the spec is not hard now or in future, frankly.

Show me the patch.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  8:55                                                                                                           ` Zhu, Lingshan
@ 2023-11-01  9:07                                                                                                             ` Michael S. Tsirkin
  2023-11-01  9:42                                                                                                               ` Zhu, Lingshan
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-01  9:07 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 01, 2023 at 04:55:01PM +0800, Zhu, Lingshan wrote:
> > So do you want MSIX table as part of device context or not?
> MSI is a part of device context as I suggest you to read QEMU code for the
> answers.
> 
> You will also found that part is already done.
> 
> So I don't know why this is a topic worthy discussions.
> 
> If you have questions, I still suggest you read QEMU code.

Frankly at this point I don't know what you are trying to say either.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  9:07                                                                                                             ` Michael S. Tsirkin
@ 2023-11-01  9:42                                                                                                               ` Zhu, Lingshan
  2023-11-01 10:23                                                                                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-11-01  9:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/1/2023 5:07 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 01, 2023 at 04:55:01PM +0800, Zhu, Lingshan wrote:
>>> So do you want MSIX table as part of device context or not?
>> MSI is a part of device context as I suggest you to read QEMU code for the
>> answers.
>>
>> You will also found that part is already done.
>>
>> So I don't know why this is a topic worthy discussions.
>>
>> If you have questions, I still suggest you read QEMU code.
> Frankly at this point I don't know what you are trying to say either.
MSI table is R/W, so the hypervisor can migrate it now
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  9:06                                                                                                     ` Michael S. Tsirkin
@ 2023-11-01 10:01                                                                                                       ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-01 10:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 1, 2023 2:36 PM
> 
> On Wed, Nov 01, 2023 at 08:49:40AM +0000, Parav Pandit wrote:
> > > So this:
> > >
> > > +struct virtio_dev_ctx_pci_vq_cfg {
> > > +        le16 vq_index;
> > > +        le16 queue_size;
> > > +        le16 queue_msix_vector;
> > > +        le64 queue_desc;
> > > +        le64 queue_driver;
> > > +        le64 queue_device;
> > > +};
> > >
> > >
> > > duplicates a bunch of fields from this:
> > >
> > Not really. Above is current VQ's configuration not visible in the config space
> directly.
> > Below is already captured as part of VIRTIO_DEV_CTX_PCI_COMMON_CFG.
> 
> I really wanted to help you understand what I mean when I say that the spec
> effort is duplicated.
> 
Yes, I will try my best to grasp your ideas and improve it v4.
> 
> > > struct virtio_pci_common_cfg {
> > >         /* About the whole device. */
> > >         __le32 device_feature_select;   /* read-write */
> > >         __le32 device_feature;          /* read-only */
> > >         __le32 guest_feature_select;    /* read-write */
> > >         __le32 guest_feature;           /* read-write */
> > >         __le16 msix_config;             /* read-write */
> > >         __le16 num_queues;              /* read-only */
> > >         __u8 device_status;             /* read-write */
> > >         __u8 config_generation;         /* read-only */
> > >
> > >         /* About a specific virtqueue. */
> > >         __le16 queue_select;            /* read-write */
> > >         __le16 queue_size;              /* read-write, power of 2. */
> > >         __le16 queue_msix_vector;       /* read-write */
> > >         __le16 queue_enable;            /* read-write */
> > >         __le16 queue_notify_off;        /* read-only */
> > >         __le32 queue_desc_lo;           /* read-write */
> > >         __le32 queue_desc_hi;           /* read-write */
> > >         __le32 queue_avail_lo;          /* read-write */
> > >         __le32 queue_avail_hi;          /* read-write */
> > >         __le32 queue_used_lo;           /* read-write */
> > >         __le32 queue_used_hi;           /* read-write */
> > > };
> > >
> > >
> > > Except it's incomplete and I suspect that's actually a bug.
> > >
> > It is not a bug.
> 
> For example. queue_enable is not there.
> 
The queue enable you mentioned rightly, is present in struct virtio_dev_ctx_vq_split_runtime.
But thinking more, I agree to move it to struct virtio_dev_ctx_pci_vq_cfg.
This way packed can also utilize the same struct.
Will move it to virtio_dev_ctx_pci_vq_cfg in v4.

> > There is some information duplicated. Above struct virtio_pci_common_cfg is
> the snapshot of registers being updated.
> > While struct virtio_dev_ctx_pci_vq_cfg is capturing what is not visible in
> above config snapshot.
> > i.e. queues which are already configured.
> > For at that rare instance there is some duplication of some fields. but this is
> the exception part not be too much worried about.
> 
> Seems more of a rule than an exception - I don't see any fields that are not also
> in config space.
Just to avoid any confusion would like to sync on terminology.

We have 
a. virtio PCI config space of 4K.
b. virtio common config area in BAR
c. virtio device config area in BAR

when you say config space above, you meant #b and #c given about discussion, right?
Same question in the bright line too where you wrote config space.

> 
> > >
> > > Here's an idea: have a record per field. Use transport offsets as tags.
> > >
> > There are just too many of them.
> > They are logically clubbed in their native structures which are already defined.
> 
> Too many for what? If you don't like what I suggested find some way to avoid
> duplicating everything please.
Too many definitions for each individual fields when the holding structure is already defined.
So why to split?

Majority fields are not duplicated, and design goal is to not duplicate; rather reuse the structs.
All the defined structures are migrated as_is without redefinition.
In current v3, virtio common config, and virtio device config are the examples of it.

In subsequent work, vq stats, mac table, vlan table, rss config, flow filter entries, will be migrated as their default structures.

Between management plane and guest driver, there will be some duplication as all structs may not be drafted as migration context format.

> > >
> > > > > > > We need a stronger compatiblity story here I think.
> > > > > > >
> > > > > > > One way to show how it's designed to work would be to split
> > > > > > > the patches. For example, add queue notify data and queue reset
> separately.
> > > > > > I didn't follow the suggestion. Can you explain splitting
> > > > > > patches and its relation
> > > > > to the structure?
> > > > Did I miss your response? Or it is in below msix?
> > >
> > > exactly.
> > >
> > > > > >
> > > > > > >
> > > > > > > Another is to add MSIX table migration option for when MSIX
> > > > > > > table is passed through to guest.
> > > > > > Yes, this will be added in future when there is actual hypervisor for it.
> > > > >
> > > > > You are tying the architecture to an extremely implementation
> > > > > specific
> > > detail.
> > > > >
> > > > Which part?
> > > >
> > > > Not really. can you please which software will use MSI-X table migration?
> > >
> > > Like, all and any? All hypervisors migrate the msi-x table.
> > >
> > > > > Hypervisors *already* have migrate the MSIX table. Just in a
> > > > > hypervisor specific way.
> > > > MSI-X table is in the PCI BAR memory.
> > >
> > > Exactly. And that means hypervisor should not read it from the
> > > device directly - e.g. with an encrypted device it won't be able to.
> > >
> > Yep.
> 
> And it follows device needs to include
> 
> > > > > queue vector is an index into this table. So the index is
> > > > > migrated through the device but the table itself has to be
> > > > > trapped and emulated by
> > > hypervisor?
> > > > Do you have a hypervisor and a platform that has not done MSI-X
> > > > table
> > > emulation for which you are asking to add?
> > > > I don't know any.
> > > > I am asking to add it when there is a _real_ user of it. Why it
> > > > cannot be added
> > > when the _real_ user arrive?
> > >
> > > Real meaning actual hardware and software implementing it?  By this
> > > definition there's no real user for migration in the spec at all -
> > > all of it can be done by device specific means.
> > This is also an option if vendor specific commands are allowed.
> 
> Once you do that you can just throw all this spec effort out the window, and do
> vdpa. I thought the point was to allow control plane in standard hardware, so
> vendors can compete on best hardware and software is shared?
> 
I think device context definition can be still kept vendor specific so a cloud operator can work that way who has the virtio devices.
The whole infrastructure and commands are still well defined.
It is reasonable approach..

But I personally like this approach of well defined and extendible device context as many liked this offline too.

> > > What's your point? By now we have a reasonably good idea what
> > > hypervisors need to make migration portable. Let's either put all of
> > > it in the spec or not bother at all.
> > >
> > Maybe I was not clear.
> > I am saying lets put the msix table when a user will find it useful in
> incremental manner.
> > For example, lets say we put in the spec in 1.4 version Nov 23, will device
> implement it? mostly yes.
> > Will existing software use it in 2024-25? mostly no, because most platform
> has complexity in this area as cpus have hard coded certain values.
> >
> > So we are asking device to implement something that is not going to be used
> by any forcible future.
> >
> > Hence, the request is, when the hypervisor/cpu vendor asks for it, it will be
> possible to add into the device.
> 
> If it is there then I expect hypervisors will use it, yes.
> 
That would be really good if they use it.
Last time few years back I encountered a cpu limitation on encoded information and close link to remapping table content.

Do you know any efforts in this area to support this?
In 2021 when I read about it, it was almost dead-end discussion for x86_64.
Don’t have the link handy anymore.

> > > In other words, we need a bright line. I suggest a simple one:
> > > memory is migrated by device, config space by hypervisor.  If not,
> > > suggest another one - but it needs a reasonable rule based on a
> > > hardware not whatever software found expedient to use.
> > >
> > I really liked this bright line.
> > Msix table like rest of the memory area of common config and device config
> will become part of this memory area.
> >
> > I wish we also draw a good bright line between for near term vs long term to
> be practical.
> >
> > Adding MSI-X table in the spec is not hard now or in future, frankly.
> 
> Show me the patch.
Frankly I don’t think it is a reasonable ask at this point to write such patch.

An parallel reasoning would be for one to show a hypervisor patch for at least one cpu/platform which can do msix bypass and remapping programming.
Would one supply?

My question was just a through exercise to stay practical to have good bright line for what we ask.
Not really asking you to write one.

I agree to your suggestion that, it is good for platform/cpu vendor to think about it.

One can see MSIX table like below patch.
Same can be done for the PBA structure.

diff --git a/device-context.tex b/device-context.tex
index ab19fc9..72f626e 100644
--- a/device-context.tex
+++ b/device-context.tex
@@ -52,7 +52,9 @@ \section{Device Context}\label{sec:Basic Facilities of a Virtio Device / Device
 \hline
 0x105 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of virtqueue descriptors owned by device  \\
 \hline
-0x106 - 0xFFF & - & Generic device agnostic range reserved for future \\
+0x106 & VIRTIO_DEV_CTX_PCI_MSIX_TABLE & Provides MSI-X table \\
+\hline
+0x107 - 0xFFF & - & Generic device agnostic range reserved for future \\
 \hline
 \hline
 0x1000 & VIRTIO_DEV_CTX_DEV_CFG & Provides device specific configuration \\
@@ -197,6 +199,12 @@ \subsubsection{Virtqueue Split Mode Device owned Descriptors Context}
 One or multiple entries of \field{struct virtio_dev_ctx_vq_split_dev_descs} may exist, each such
 entry corresponds to a virtqueue identified by the \field{vq_index}.

+\subsubsection{PCI Device MSIX Table Context}
+
+For the field VIRTIO_DEV_CTX_PCI_MSIX_TABLE, \field{type} is set to 0x106.
+The \field{value} is in format of MSI-X table structure as defined in PCIe spec.
+The \field{length} is the length of whole MSI-X table structure in bytes.
+

^ permalink raw reply related	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  9:42                                                                                                               ` Zhu, Lingshan
@ 2023-11-01 10:23                                                                                                                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-01 10:23 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 01, 2023 at 05:42:43PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/1/2023 5:07 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 01, 2023 at 04:55:01PM +0800, Zhu, Lingshan wrote:
> > > > So do you want MSIX table as part of device context or not?
> > > MSI is a part of device context as I suggest you to read QEMU code for the
> > > answers.
> > > 
> > > You will also found that part is already done.
> > > 
> > > So I don't know why this is a topic worthy discussions.
> > > 
> > > If you have questions, I still suggest you read QEMU code.
> > Frankly at this point I don't know what you are trying to say either.
> MSI table is R/W, so the hypervisor can migrate it now

Yes. And the same applies to most of the state that is part of device
context here.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  8:36                                                                                                   ` Michael S. Tsirkin
@ 2023-11-01 10:24                                                                                                     ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-01 10:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, Jason Wang, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 1, 2023 2:06 PM

> My experience shows we need simple composable and self-contained blocks.
> If we are trying to make hypervisor avoid accessing device memory (which
> seems to be one of key design points behind what you are building) then we
> can't have queue msix vector index migrated in one way and the vector itself in
> a completely different way.
> 
Make sense to include as long as one can use it.

> For example, one of the advantages of using admin commands is that they are
> atomic - so we get a consistent snapshot of the device state.
> But, this goes out of the window if we are referencing a table that is not part of
> that same command output.

To make the device active at destination, one needs both the fields restored as consistent.
Only after that one can make the mode change from stop->active.
Usually, vq is indexing in the table without reading the table content, right?
I can see some device implementation may try to read this table when writing the device context, 
but that would be really hard for device to keep synchronizing VQ with msix entry as in theory vector can be masked.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-01  3:07                                                                           ` Parav Pandit
@ 2023-11-02  4:24                                                                             ` Jason Wang
  2023-11-02  6:10                                                                               ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-02  4:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 1, 2023 at 11:07 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 1, 2023 6:03 AM
> >
> > On Tue, Oct 31, 2023 at 1:17 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: virtio-comment@lists.oasis-open.org
> > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > Sent: Tuesday, October 31, 2023 7:07 AM
> > > >
> > > > On Mon, Oct 30, 2023 at 12:28 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, October 30, 2023 9:35 AM
> > > > > >
> > > > > > 在 2023/10/26 11:50, Parav Pandit 写道:
> > > > > > >> From: virtio-comment@lists.oasis-open.org
> > > > > > >> <virtio-comment@lists.oasis- open.org> On Behalf Of Jason
> > > > > > >> Wang For example, you still haven't succeeded in defining
> > passthrough.
> > > > > > > It was defined on 19th Oct in [1].
> > > > > > > What part is not clear to you in definition of passthrough device?
> > > > > > >
> > > > > > > [1]
> > > > > > > https://lore.kernel.org/virtio-
> > > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > >
> > > > > >
> > > > > > Let me copy-paste it again:
> > > > > >
> > > > > > For example, assuming you are correct, you still fail to explain
> > > > > >
> > > > > > 1) what is trapped and what's not, or what's the boundary
> > > > > Passthrough definition was replied few times.
> > > > > One of them is here,
> > > > > https://lore.kernel.org/virtio-
> > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > I don’t know what you mean by 'explain'. What do you want to be
> > explained?
> > > > > What is trapped is listed in
> > > > > https://lore.kernel.org/virtio-
> > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > What is not trapped is also listed in
> > > > > https://lore.kernel.org/virtio-
> > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > So what more do you want to explain in there?
> > > >
> > > > You explained that MSI-X is trapped but not the others. People may know
> > why.
> > > > or what's the boundary to choose to trap or not.
> > > >
> > > If a platform can support without trapping, it can be avoided as well and can
> > be added in the future.
> >
> > Who is going to do that synchronization?
> Lets first bring that hypervisor sw design before discussing phantom problem solving.
> All necessary modules will be involved in synchronization depending on how its done in future.

It's not the charge of the virtio spec to mandate any type of
hypervisor design. But it looks to me you want to do that.

>
> >
> > >
> > > > >
> > > > > > 2) if the hypervisor is not developed with those assumptions,
> > > > > > things can work
> > > > > What to explain in #2. :)
> > > > > Things can expand when such hypervisor is born.
> > > >
> > > > So the point is still, to make your proposal to be useful in more use cases.
> > > >
> > > When a use case arise, device context can be expanded.
> >
> > It's not device context.
> >
> I don’t see why not. It is stored in the device.
> Remapping part will be hypervisor specific, so it may be stored in platform specific migration data.

The point is, device context should work for all type of hypervisors.
You can't claim it can only work with your "passthrough" model.

>
> > > No point in making things no one implements or not present in hypervisor.
> > > The infrastructure is extendible so spec is covered for it.
> >
> > It would be problematic if you stick to claim "passthrough" but not.
>
> I don’t know what this means. I am not debating passthrough/non-passthrough.
> What is inside the device, will be part of device-context.
> What is part of the platform content, will be part of platform context.
> Since this is generic to all types of PCI devices, I don’t see a need to over-solve it now in virtio.

Ok, so you agree it can work even if hypervisor want to trap?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-01  3:31                                                                     ` Parav Pandit
@ 2023-11-02  4:25                                                                       ` Jason Wang
  2023-11-02  6:10                                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-02  4:25 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 1, 2023 6:04 AM
> >
> > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > >
> > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > > >
> > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > >
> > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > > >
> > > > > > > > > > How do you know that?
> > > > > > > > > Because for passthrough, the hypervisor is not involved in
> > > > > > > > > dealing with VQ at
> > > > > > > > all.
> > > > > > > >
> > > > > > > > Ok, so if I understand correctly, you are saying your design
> > > > > > > > can't work for the case of PASID assignment.
> > > > > > > >
> > > > > > > No. PASID assignment will happen from the guest for its own
> > > > > > > use and device
> > > > > > migration will just work fine because device context will capture this.
> > > > > >
> > > > > > It's not about device context. We're discussing "passthrough", no?
> > > > > >
> > > > > Not sure, we are discussing same.
> > > > > A member device is passthrough to the guest, dealing with its own
> > > > > PASIDs and
> > > > virtio interface for some VQ assignment to PASID.
> > > > > So VQ context captured by the hypervisor, will have some PASID
> > > > > attached to
> > > > this VQ.
> > > > > Device context will be updated.
> > > > >
> > > > > > You want all virtio stuff to be "passthrough", but assigning a
> > > > > > PASID to a specific virtqueue in the guest must be trapped.
> > > > > >
> > > > > No. PASID assignment to a specific virtqueue in the guest must go
> > > > > directly
> > > > from guest to device.
> > > >
> > > > This works like setting CR3, you can't simply let it go from guest to host.
> > > >
> > > > Host IOMMU driver needs to know the PASID to program the IO page
> > > > tables correctly.
> > > >
> > > This will be done by the IOMMU.
> > >
> > > > > When guest iommu may need to communicate anything for this PASID,
> > > > > it will
> > > > come through its proper IOMMU channel/hypercall.
> > > >
> > > > Let's say using PASID X for queue 0, this knowledge is beyond the
> > > > IOMMU scope but belongs to virtio. Or please explain how it can work
> > > > when it goes directly from guest to device.
> > > >
> > > We are yet to ever see spec for PASID to VQ assignment.
> >
> > It has one.
> >
> > > For ok for theory sake it is there.
> > >
> > > Virtio driver will assign the PASID directly from guest driver to device using a
> > create_vq(pasid=X) command.
> > > Same process is somehow attached the PASID by the guest OS.
> > > The whole PASID range is known to the hypervisor when the device is handed
> > over to the guest VM.
> >
> > How can it know?
> >
> > > So PASID mapping is setup by the hypervisor IOMMU at this point.
> >
> > You disallow the PASID to be virtualized here. What's more, such a PASID
> > passthrough has security implications.
> >
> No. virtio spec is not disallowing. At least for sure, this series is not the one.
> My main point is, virtio device interface will not be the source of hypercall to program IOMMU in the hypervisor.
> It is something to be done by IOMMU side.

So unless vPASID can be used by the hardware you need to trap the
mapping from a PASID to a virtqueue. Then you need virtio specific
knowledge.

>
> > Again, we are talking about different things, I've tried to show you that there are
> > cases that passthrough can't work but if you think the only way for migration is
> > to use passthrough in every case, you will probably fail.
> >
> I didn't say only way for migration is passthrough.
> Passthrough is clearly one way.
> Other ways may be possible.
>
> > >
> > > > > Virtio device is not the conduit for this exchange.
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > There are works ongoing to make vPASID work for the
> > > > > > > > > > guest like
> > > > vSVA.
> > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > Passthrough do not run like SVA.
> > > > > > > >
> > > > > > > > Great, you find another limitation of "passthrough" by yourself.
> > > > > > > >
> > > > > > > No. it is not the limitation it is just the way it does not
> > > > > > > need complex SVA to
> > > > > > split the device for unrelated usage.
> > > > > >
> > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > >
> > > > > He he, I am not limiting, again misunderstanding or wrong attribution.
> > > > > I explained that hypervisor for passthrough does not need SVA.
> > > > > Guest can do anything it wants from the guest OS with the member
> > device.
> > > >
> > > > Ok, so the point stills, see above.
> > >
> > > I don’t think so. The guest owns its PASID space
> >
> > Again, vPASID to PASID can't be done hardware unless I miss some recent
> > features of IOMMUs.
> >
> Cpu vendors have different way of doing vPASID to pPASID.

At least for the current version of major IOMMU vendors, such
translation (aka PASID remapping) is not implemented in the hardware
so it needs to be trapped first.

> It is still an early space for virtio.
>
> > > and directly communicates like any other device attribute.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > > Each passthrough device has PASID from its own space fully
> > > > > > > > > managed by the
> > > > > > > > guest.
> > > > > > > > > Some cpu required vPASID and SIOV is not going this way anmore.
> > > > > > > >
> > > > > > > > Then how to migrate? Invent a full set of something else
> > > > > > > > through another giant series like this to migrate to the SIOV thing?
> > > > > > > > That's a mess for
> > > > > > sure.
> > > > > > > >
> > > > > > > SIOV will for sure reuse most or all parts of this work, almost entirely
> > as_is.
> > > > > > > vPASID is cpu/platform specific things not part of the SIOV devices.
> > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > If at all it is done, it will be done from the guest
> > > > > > > > > > > by the driver using virtio
> > > > > > > > > > interface.
> > > > > > > > > >
> > > > > > > > > > Then you need to trap. Such things couldn't be passed
> > > > > > > > > > through to guests
> > > > > > > > directly.
> > > > > > > > > >
> > > > > > > > > Only PASID capability is trapped. PASID allocation and
> > > > > > > > > usage is directly from
> > > > > > > > guest.
> > > > > > > >
> > > > > > > > How can you achieve this? Assigning a PAISD to a device is
> > > > > > > > completely
> > > > > > > > device(virtio) specific. How can you use a general layer
> > > > > > > > without the knowledge of virtio to trap that?
> > > > > > > When one wants to map vPASID to pPASID a platform needs to be
> > > > involved.
> > > > > >
> > > > > > I'm not talking about how to map vPASID to pPASID, it's out of
> > > > > > the scope of virtio. I'm talking about assigning a vPASID to a
> > > > > > specific virtqueue or other virtio function in the guest.
> > > > > >
> > > > > That can be done in the guest. The key is guest wont know that it
> > > > > is dealing
> > > > with vPASID.
> > > > > It will follow the same principle from your paper of equivalency,
> > > > > where virtio
> > > > software layer will assign PASID to VQ and communicate to device.
> > > > >
> > > > > Anyway, all of this just digression from current series.
> > > >
> > > > It's not, as you mention that only MSI-X is trapped, I give you another one.
> > > >
> > > PASID access from the guest to be done fully by the guest IOMMU.
> > > Not by virtio devices.
> > >
> > > > >
> > > > > > You need a virtio specific queue or capability to assign a PASID
> > > > > > to a specific virtqueue, and that can't be done without trapping
> > > > > > and without virito specific knowledge.
> > > > > >
> > > > > I disagree. PASID assignment to a virqueue in future from guest
> > > > > virtio driver to
> > > > device is uniform method.
> > > > > Whether its PF assigning PASID to VQ of self, Or VF driver in the
> > > > > guest assigning PASID to VQ.
> > > > >
> > > > > All same.
> > > > > Only IOMMU layer hypercalls will know how to deal with PASID
> > > > > assignment at
> > > > platform layer to setup the domain etc table.
> > > > >
> > > > > And this is way beyond our device migration discussion.
> > > > > By any means, if you were implying that somehow vq to PASID
> > > > > assignment
> > > > _may_ need trap+emulation, hence whole device migration to depend on
> > > > some
> > > > trap+emulation, than surely, than I do not agree to it.
> > > >
> > > > See above.
> > > >
> > > Yeah, I disagree to such implying.
> > >
> > > > >
> > > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating the guest
> > > > > process and
> > > > all of that just works on efficiency and equivalence principle
> > > > already for a decade now without any trap+emulation.
> > > > >
> > > > > > > When virtio passthrough device is in guest, it has all its PASID
> > accessible.
> > > > > > >
> > > > > > > All these is large deviation from current discussion of this
> > > > > > > series, so I will keep
> > > > > > it short.
> > > > > > >
> > > > > > > >
> > > > > > > > > Regardless it is not relevant to passthrough mode as PASID
> > > > > > > > > is yet another
> > > > > > > > resource.
> > > > > > > > > And for some cpu if it is trapped, it is generic layer,
> > > > > > > > > that does not require virtio
> > > > > > > > involvement.
> > > > > > > > > So virtio interface asking to trap something because
> > > > > > > > > generic facility has done
> > > > > > > > in not the approach.
> > > > > > > >
> > > > > > > > This misses the point of PASID. How to use PASID is totally
> > > > > > > > device
> > > > specific.
> > > > > > > Sure, and how to virtualize vPASID/pPASID is platform specific
> > > > > > > as single PASID
> > > > > > can be used by multiple devices and process.
> > > > > >
> > > > > > See above, I think we're talking about different things.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > > Capabilities of #2 is generic across all pci devices,
> > > > > > > > > > > so it will be handled by the
> > > > > > > > > > HV.
> > > > > > > > > > > ATS/PRI cap is also generic manner handled by the HV
> > > > > > > > > > > and PCI
> > > > device.
> > > > > > > > > >
> > > > > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU.
> > > > > > > > > > You can simply do ATS/PRI passthrough but with an emulated
> > vIOMMU.
> > > > > > > > > And that is not the reason for virtio device to build
> > > > > > > > > trap+emulation for
> > > > > > > > passthrough member devices.
> > > > > > > >
> > > > > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > > > > PRI requests arrive on the PF for the VF.
> > > > > >
> > > > > > Shouldn't it arrive at platform IOMMU first? The path should be
> > > > > > PRI
> > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest IOMMU.
> > > > > >
> > > > > Above sequence seems write.
> > > > >
> > > > > > And things will be more complicated when (v)PASID is used. So
> > > > > > you can't simply let PRI go directly to the guest with the current
> > architecture.
> > > > > >
> > > > > In current architecture of the pci VF, PRI does not go directly to the guest.
> > > > > (and that is not reason to trap and emulate other things).
> > > >
> > > > Ok, so beyond MSI-X we need to trap PRI, and we will probably trap
> > > > other things in the future like PASID assignment.
> > > PRI etc all belong to generic PCI 4K config space region.
> >
> > It's not about the capability, it's about the whole process of PRI request
> > handling. We've agreed that the PRI request needs to be trapped by the
> > hypervisor and then delivered to the vIOMMU.
> >
>
> > > Trap+emulation done in generic manner without involving virtio or other
> > device types.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > how can you pass
> > > > > > > > through a hardware PRI request to a guest directly without
> > > > > > > > trapping it
> > > > then?
> > > > > > > > What's more, PCIE allows the PRI to be done in a vendor
> > > > > > > > (virtio) specific way, so you want to break this rule? Or
> > > > > > > > you want to blacklist ATS/PRI
> > > > > > for virtio?
> > > > > > > >
> > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > Do you have a reference to the ECN that enables vendor
> > > > > > > specific way of PRI? I
> > > > > > would like to read it.
> > > > > >
> > > > > > I mean it doesn't forbid us to build a virtio specific interface
> > > > > > for I/O page fault report and recovery.
> > > > > >
> > > > > So PRI of PCI does not allow. It is ODP kind of technique you meant above.
> > > > > Yes one can build.
> > > > > Ok. unrelated to device migration, so I will park this good discussion for
> > later.
> > > >
> > > > That's fine.
> > > >
> > > > >
> > > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > > >
> > > > > > Probably.
> > > > > >
> > > > > > > PRI will directly go to the guest driver, and guest would
> > > > > > > interact with IOMMU
> > > > > > to service the paging request through IOMMU APIs.
> > > > > >
> > > > > > With PASID, it can't go directly.
> > > > > >
> > > > > When the request consist of PASID in it, it can.
> > > > > But again these PCI-SIG extensions of PASID are not related to
> > > > > device
> > > > migration, so I am differing it.
> > > > >
> > > > > > > For PRI in vendor specific way needs a separate discussion. It
> > > > > > > is not related to
> > > > > > live migration.
> > > > > >
> > > > > > PRI itself is not related. But the point is, you can't simply
> > > > > > pass through ATS/PRI now.
> > > > > >
> > > > > Ah ok. the whole 4K PCI config space where ATS/PRI capabilities
> > > > > are located
> > > > are trapped+emulated by hypervisor.
> > > > > So?
> > > > > So do we start emulating virito interfaces too for passthrough?
> > > > > No.
> > > > > Can one still continue to trap+emulate?
> > > > > Sure why not?
> > > >
> > > > Then let's not limit your proposal to be used by "passthrough" only?
> > > One can possibly build some variant of the existing virtio member device
> > using same owner and member scheme.
> >
> > It's not about the member/owner, it's about e.g whether the hypervisor can
> > trap and emulate.
> >
> > I've pointed out that what you invent here is actually a partial new transport, for
> > example, a hypervisor can trap and use things like device context in PF to bypass
> > the registers in VF. This is the idea of transport commands/q.
> >
> I will not mix transport commands which are mainly useful for actual device operation for SIOV only for backward compatibility that too optionally.
> One may still choose to have virtio common and device config in MMIO ofcourse at lower scale.
>
> Anyway, mixing migration context with actual SIOV specific thing is not correct as device context is read/write incremental values.

SIOV is transport level stuff, the transport virtqueue is designed in
a way that is general enough to cover it. Let's not shift concepts.

One thing that you ignore is that, hypervisor can use what you
invented as a transport for VF, no?

>
> > > If for that is some admin commands are missing, may be one can add them.
> >
> > I would then build the device context commands on top of the transport
> > commands/q, then it would be complete.
> >
> > > No need to step on toes of use cases as they are different...
> > >
> > > > I've shown you that
> > > >
> > > > 1) you can't easily say you can pass through all the virtio
> > > > facilities
> > > > 2) how ambiguous for terminology like "passthrough"
> > > >
> > > It is not, it is well defined in v3, v2.
> > > One can continue to argue and keep defining the variant and still call it data
> > path acceleration and then claim it as passthrough ...
> > > But I won't debate this anymore as its just non-technical aspects of least
> > interest.
> >
> > You use this terminology in the spec which is all about technical, and you think
> > how to define it is a matter of non-technical. This is self-contradictory. If you fail,
> > it probably means it's ambiguous.
> > Let's don't use that terminology.
> >
> What it means is described in theory of operation.
>
> > > We have technical tasks and more improved specs to update going forward.
> >
> > It's a burden to do the synchronization.
> We have discussed this.
> In current proposed the member device is not bifurcated,

It is. Part of the functions were carried via the PCI interface, some
are carried via owner. You end up with two drivers to drive the
devices.

Thanks


> so it implements the necessary pieces.
> Feature != burden.
>
> >
> > > Working on extension for device specific contexts to enrich it.
> >
> > Again, making the proposal to be general is much more beneficial.
>
> Yes, it is general and like any other device-type, each has their extensions.
> Infrastructure covers in v3.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-02  4:25                                                                       ` Jason Wang
@ 2023-11-02  6:10                                                                         ` Parav Pandit
  2023-11-06  6:34                                                                           ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-02  6:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 2, 2023 9:56 AM
> 
> On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, November 1, 2023 6:04 AM
> > >
> > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > >
> > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason
> > > > > > > Wang
> > > > > > >
> > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > >
> > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > > > >
> > > > > > > > > > > How do you know that?
> > > > > > > > > > Because for passthrough, the hypervisor is not
> > > > > > > > > > involved in dealing with VQ at
> > > > > > > > > all.
> > > > > > > > >
> > > > > > > > > Ok, so if I understand correctly, you are saying your
> > > > > > > > > design can't work for the case of PASID assignment.
> > > > > > > > >
> > > > > > > > No. PASID assignment will happen from the guest for its
> > > > > > > > own use and device
> > > > > > > migration will just work fine because device context will capture this.
> > > > > > >
> > > > > > > It's not about device context. We're discussing "passthrough", no?
> > > > > > >
> > > > > > Not sure, we are discussing same.
> > > > > > A member device is passthrough to the guest, dealing with its
> > > > > > own PASIDs and
> > > > > virtio interface for some VQ assignment to PASID.
> > > > > > So VQ context captured by the hypervisor, will have some PASID
> > > > > > attached to
> > > > > this VQ.
> > > > > > Device context will be updated.
> > > > > >
> > > > > > > You want all virtio stuff to be "passthrough", but assigning
> > > > > > > a PASID to a specific virtqueue in the guest must be trapped.
> > > > > > >
> > > > > > No. PASID assignment to a specific virtqueue in the guest must
> > > > > > go directly
> > > > > from guest to device.
> > > > >
> > > > > This works like setting CR3, you can't simply let it go from guest to host.
> > > > >
> > > > > Host IOMMU driver needs to know the PASID to program the IO page
> > > > > tables correctly.
> > > > >
> > > > This will be done by the IOMMU.
> > > >
> > > > > > When guest iommu may need to communicate anything for this
> > > > > > PASID, it will
> > > > > come through its proper IOMMU channel/hypercall.
> > > > >
> > > > > Let's say using PASID X for queue 0, this knowledge is beyond
> > > > > the IOMMU scope but belongs to virtio. Or please explain how it
> > > > > can work when it goes directly from guest to device.
> > > > >
> > > > We are yet to ever see spec for PASID to VQ assignment.
> > >
> > > It has one.
> > >
> > > > For ok for theory sake it is there.
> > > >
> > > > Virtio driver will assign the PASID directly from guest driver to
> > > > device using a
> > > create_vq(pasid=X) command.
> > > > Same process is somehow attached the PASID by the guest OS.
> > > > The whole PASID range is known to the hypervisor when the device
> > > > is handed
> > > over to the guest VM.
> > >
> > > How can it know?
> > >
> > > > So PASID mapping is setup by the hypervisor IOMMU at this point.
> > >
> > > You disallow the PASID to be virtualized here. What's more, such a
> > > PASID passthrough has security implications.
> > >
> > No. virtio spec is not disallowing. At least for sure, this series is not the one.
> > My main point is, virtio device interface will not be the source of hypercall to
> program IOMMU in the hypervisor.
> > It is something to be done by IOMMU side.
> 
> So unless vPASID can be used by the hardware you need to trap the mapping
> from a PASID to a virtqueue. Then you need virtio specific knowledge.
> 
vPASID by hardware is unlikely to be used by hw PCI EP devices at least in any near term future.
This requires either vPASID to pPASID table in device or in IOMMU.

> >
> > > Again, we are talking about different things, I've tried to show you
> > > that there are cases that passthrough can't work but if you think
> > > the only way for migration is to use passthrough in every case, you will
> probably fail.
> > >
> > I didn't say only way for migration is passthrough.
> > Passthrough is clearly one way.
> > Other ways may be possible.
> >
> > > >
> > > > > > Virtio device is not the conduit for this exchange.
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > There are works ongoing to make vPASID work for the
> > > > > > > > > > > guest like
> > > > > vSVA.
> > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > >
> > > > > > > > > Great, you find another limitation of "passthrough" by yourself.
> > > > > > > > >
> > > > > > > > No. it is not the limitation it is just the way it does
> > > > > > > > not need complex SVA to
> > > > > > > split the device for unrelated usage.
> > > > > > >
> > > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > > >
> > > > > > He he, I am not limiting, again misunderstanding or wrong attribution.
> > > > > > I explained that hypervisor for passthrough does not need SVA.
> > > > > > Guest can do anything it wants from the guest OS with the
> > > > > > member
> > > device.
> > > > >
> > > > > Ok, so the point stills, see above.
> > > >
> > > > I don’t think so. The guest owns its PASID space
> > >
> > > Again, vPASID to PASID can't be done hardware unless I miss some
> > > recent features of IOMMUs.
> > >
> > Cpu vendors have different way of doing vPASID to pPASID.
> 
> At least for the current version of major IOMMU vendors, such translation (aka
> PASID remapping) is not implemented in the hardware so it needs to be trapped
> first.
> 
Right. So it is really far in future, atleast few years away.

> > It is still an early space for virtio.
> >
> > > > and directly communicates like any other device attribute.
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > Each passthrough device has PASID from its own space
> > > > > > > > > > fully managed by the
> > > > > > > > > guest.
> > > > > > > > > > Some cpu required vPASID and SIOV is not going this way
> anmore.
> > > > > > > > >
> > > > > > > > > Then how to migrate? Invent a full set of something else
> > > > > > > > > through another giant series like this to migrate to the SIOV
> thing?
> > > > > > > > > That's a mess for
> > > > > > > sure.
> > > > > > > > >
> > > > > > > > SIOV will for sure reuse most or all parts of this work,
> > > > > > > > almost entirely
> > > as_is.
> > > > > > > > vPASID is cpu/platform specific things not part of the SIOV devices.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > If at all it is done, it will be done from the
> > > > > > > > > > > > guest by the driver using virtio
> > > > > > > > > > > interface.
> > > > > > > > > > >
> > > > > > > > > > > Then you need to trap. Such things couldn't be
> > > > > > > > > > > passed through to guests
> > > > > > > > > directly.
> > > > > > > > > > >
> > > > > > > > > > Only PASID capability is trapped. PASID allocation and
> > > > > > > > > > usage is directly from
> > > > > > > > > guest.
> > > > > > > > >
> > > > > > > > > How can you achieve this? Assigning a PAISD to a device
> > > > > > > > > is completely
> > > > > > > > > device(virtio) specific. How can you use a general layer
> > > > > > > > > without the knowledge of virtio to trap that?
> > > > > > > > When one wants to map vPASID to pPASID a platform needs to
> > > > > > > > be
> > > > > involved.
> > > > > > >
> > > > > > > I'm not talking about how to map vPASID to pPASID, it's out
> > > > > > > of the scope of virtio. I'm talking about assigning a vPASID
> > > > > > > to a specific virtqueue or other virtio function in the guest.
> > > > > > >
> > > > > > That can be done in the guest. The key is guest wont know that
> > > > > > it is dealing
> > > > > with vPASID.
> > > > > > It will follow the same principle from your paper of
> > > > > > equivalency, where virtio
> > > > > software layer will assign PASID to VQ and communicate to device.
> > > > > >
> > > > > > Anyway, all of this just digression from current series.
> > > > >
> > > > > It's not, as you mention that only MSI-X is trapped, I give you another
> one.
> > > > >
> > > > PASID access from the guest to be done fully by the guest IOMMU.
> > > > Not by virtio devices.
> > > >
> > > > > >
> > > > > > > You need a virtio specific queue or capability to assign a
> > > > > > > PASID to a specific virtqueue, and that can't be done
> > > > > > > without trapping and without virito specific knowledge.
> > > > > > >
> > > > > > I disagree. PASID assignment to a virqueue in future from
> > > > > > guest virtio driver to
> > > > > device is uniform method.
> > > > > > Whether its PF assigning PASID to VQ of self, Or VF driver in
> > > > > > the guest assigning PASID to VQ.
> > > > > >
> > > > > > All same.
> > > > > > Only IOMMU layer hypercalls will know how to deal with PASID
> > > > > > assignment at
> > > > > platform layer to setup the domain etc table.
> > > > > >
> > > > > > And this is way beyond our device migration discussion.
> > > > > > By any means, if you were implying that somehow vq to PASID
> > > > > > assignment
> > > > > _may_ need trap+emulation, hence whole device migration to
> > > > > depend on some
> > > > > trap+emulation, than surely, than I do not agree to it.
> > > > >
> > > > > See above.
> > > > >
> > > > Yeah, I disagree to such implying.
> > > >
> > > > > >
> > > > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating the
> > > > > > guest process and
> > > > > all of that just works on efficiency and equivalence principle
> > > > > already for a decade now without any trap+emulation.
> > > > > >
> > > > > > > > When virtio passthrough device is in guest, it has all its
> > > > > > > > PASID
> > > accessible.
> > > > > > > >
> > > > > > > > All these is large deviation from current discussion of
> > > > > > > > this series, so I will keep
> > > > > > > it short.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > Regardless it is not relevant to passthrough mode as
> > > > > > > > > > PASID is yet another
> > > > > > > > > resource.
> > > > > > > > > > And for some cpu if it is trapped, it is generic
> > > > > > > > > > layer, that does not require virtio
> > > > > > > > > involvement.
> > > > > > > > > > So virtio interface asking to trap something because
> > > > > > > > > > generic facility has done
> > > > > > > > > in not the approach.
> > > > > > > > >
> > > > > > > > > This misses the point of PASID. How to use PASID is
> > > > > > > > > totally device
> > > > > specific.
> > > > > > > > Sure, and how to virtualize vPASID/pPASID is platform
> > > > > > > > specific as single PASID
> > > > > > > can be used by multiple devices and process.
> > > > > > >
> > > > > > > See above, I think we're talking about different things.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > Capabilities of #2 is generic across all pci
> > > > > > > > > > > > devices, so it will be handled by the
> > > > > > > > > > > HV.
> > > > > > > > > > > > ATS/PRI cap is also generic manner handled by the
> > > > > > > > > > > > HV and PCI
> > > > > device.
> > > > > > > > > > >
> > > > > > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU.
> > > > > > > > > > > You can simply do ATS/PRI passthrough but with an
> > > > > > > > > > > emulated
> > > vIOMMU.
> > > > > > > > > > And that is not the reason for virtio device to build
> > > > > > > > > > trap+emulation for
> > > > > > > > > passthrough member devices.
> > > > > > > > >
> > > > > > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > >
> > > > > > > Shouldn't it arrive at platform IOMMU first? The path should
> > > > > > > be PRI
> > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest
> IOMMU.
> > > > > > >
> > > > > > Above sequence seems write.
> > > > > >
> > > > > > > And things will be more complicated when (v)PASID is used.
> > > > > > > So you can't simply let PRI go directly to the guest with
> > > > > > > the current
> > > architecture.
> > > > > > >
> > > > > > In current architecture of the pci VF, PRI does not go directly to the
> guest.
> > > > > > (and that is not reason to trap and emulate other things).
> > > > >
> > > > > Ok, so beyond MSI-X we need to trap PRI, and we will probably
> > > > > trap other things in the future like PASID assignment.
> > > > PRI etc all belong to generic PCI 4K config space region.
> > >
> > > It's not about the capability, it's about the whole process of PRI
> > > request handling. We've agreed that the PRI request needs to be
> > > trapped by the hypervisor and then delivered to the vIOMMU.
> > >
> >
> > > > Trap+emulation done in generic manner without involving virtio or
> > > > Trap+other
> > > device types.
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > how can you pass
> > > > > > > > > through a hardware PRI request to a guest directly
> > > > > > > > > without trapping it
> > > > > then?
> > > > > > > > > What's more, PCIE allows the PRI to be done in a vendor
> > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > for virtio?
> > > > > > > > >
> > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > Do you have a reference to the ECN that enables vendor
> > > > > > > > specific way of PRI? I
> > > > > > > would like to read it.
> > > > > > >
> > > > > > > I mean it doesn't forbid us to build a virtio specific
> > > > > > > interface for I/O page fault report and recovery.
> > > > > > >
> > > > > > So PRI of PCI does not allow. It is ODP kind of technique you meant
> above.
> > > > > > Yes one can build.
> > > > > > Ok. unrelated to device migration, so I will park this good
> > > > > > discussion for
> > > later.
> > > > >
> > > > > That's fine.
> > > > >
> > > > > >
> > > > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > > > >
> > > > > > > Probably.
> > > > > > >
> > > > > > > > PRI will directly go to the guest driver, and guest would
> > > > > > > > interact with IOMMU
> > > > > > > to service the paging request through IOMMU APIs.
> > > > > > >
> > > > > > > With PASID, it can't go directly.
> > > > > > >
> > > > > > When the request consist of PASID in it, it can.
> > > > > > But again these PCI-SIG extensions of PASID are not related to
> > > > > > device
> > > > > migration, so I am differing it.
> > > > > >
> > > > > > > > For PRI in vendor specific way needs a separate
> > > > > > > > discussion. It is not related to
> > > > > > > live migration.
> > > > > > >
> > > > > > > PRI itself is not related. But the point is, you can't
> > > > > > > simply pass through ATS/PRI now.
> > > > > > >
> > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI
> > > > > > capabilities are located
> > > > > are trapped+emulated by hypervisor.
> > > > > > So?
> > > > > > So do we start emulating virito interfaces too for passthrough?
> > > > > > No.
> > > > > > Can one still continue to trap+emulate?
> > > > > > Sure why not?
> > > > >
> > > > > Then let's not limit your proposal to be used by "passthrough" only?
> > > > One can possibly build some variant of the existing virtio member
> > > > device
> > > using same owner and member scheme.
> > >
> > > It's not about the member/owner, it's about e.g whether the
> > > hypervisor can trap and emulate.
> > >
> > > I've pointed out that what you invent here is actually a partial new
> > > transport, for example, a hypervisor can trap and use things like
> > > device context in PF to bypass the registers in VF. This is the idea of
> transport commands/q.
> > >
> > I will not mix transport commands which are mainly useful for actual device
> operation for SIOV only for backward compatibility that too optionally.
> > One may still choose to have virtio common and device config in MMIO
> ofcourse at lower scale.
> >
> > Anyway, mixing migration context with actual SIOV specific thing is not correct
> as device context is read/write incremental values.
> 
> SIOV is transport level stuff, the transport virtqueue is designed in a way that is
> general enough to cover it. Let's not shift concepts.
> 
Such TVQ is only for backward compatible vPCI composition.
For ground up work such TVQ must not be done through the owner device.
Each SIOV device to have its own channel to communicate directly to the device.

> One thing that you ignore is that, hypervisor can use what you invented as a
> transport for VF, no?
> 
No. by design, it is not good idea to overload management commands with actual run time guest commands.
The device context read writes are largely for incremental updates.

For VF driver it has own direct channel via its own BAR to talk to the device. So no need to transport via PF.
For SIOV for backward compat vPCI composition, it may be needed.
Hard to say, if that can be memory mapped as well on the BAR of the PF.
We have seen one device supporting it outside of the virtio.
For scale anyway, one needs to use the device own cvq for complex configuration.

> >
> > > > If for that is some admin commands are missing, may be one can add
> them.
> > >
> > > I would then build the device context commands on top of the
> > > transport commands/q, then it would be complete.
> > >
> > > > No need to step on toes of use cases as they are different...
> > > >
> > > > > I've shown you that
> > > > >
> > > > > 1) you can't easily say you can pass through all the virtio
> > > > > facilities
> > > > > 2) how ambiguous for terminology like "passthrough"
> > > > >
> > > > It is not, it is well defined in v3, v2.
> > > > One can continue to argue and keep defining the variant and still
> > > > call it data
> > > path acceleration and then claim it as passthrough ...
> > > > But I won't debate this anymore as its just non-technical aspects
> > > > of least
> > > interest.
> > >
> > > You use this terminology in the spec which is all about technical,
> > > and you think how to define it is a matter of non-technical. This is
> > > self-contradictory. If you fail, it probably means it's ambiguous.
> > > Let's don't use that terminology.
> > >
> > What it means is described in theory of operation.
> >
> > > > We have technical tasks and more improved specs to update going
> forward.
> > >
> > > It's a burden to do the synchronization.
> > We have discussed this.
> > In current proposed the member device is not bifurcated,
> 
> It is. Part of the functions were carried via the PCI interface, some are carried
> via owner. You end up with two drivers to drive the devices.
> 
Nop.
All admin work of device migration is carried out via the owner device.
All guest triggered work is carried out using VF itself.

> Thanks
> 
> 
> > so it implements the necessary pieces.
> > Feature != burden.
> >
> > >
> > > > Working on extension for device specific contexts to enrich it.
> > >
> > > Again, making the proposal to be general is much more beneficial.
> >
> > Yes, it is general and like any other device-type, each has their extensions.
> > Infrastructure covers in v3.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-02  4:24                                                                             ` Jason Wang
@ 2023-11-02  6:10                                                                               ` Parav Pandit
  2023-11-02 14:01                                                                                 ` Michael S. Tsirkin
  2023-11-06  6:35                                                                                 ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-02  6:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 2, 2023 9:54 AM
> 
> On Wed, Nov 1, 2023 at 11:07 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, November 1, 2023 6:03 AM
> > >
> > > On Tue, Oct 31, 2023 at 1:17 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: virtio-comment@lists.oasis-open.org
> > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > > Sent: Tuesday, October 31, 2023 7:07 AM
> > > > >
> > > > > On Mon, Oct 30, 2023 at 12:28 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Monday, October 30, 2023 9:35 AM
> > > > > > >
> > > > > > > 在 2023/10/26 11:50, Parav Pandit 写道:
> > > > > > > >> From: virtio-comment@lists.oasis-open.org
> > > > > > > >> <virtio-comment@lists.oasis- open.org> On Behalf Of Jason
> > > > > > > >> Wang For example, you still haven't succeeded in defining
> > > passthrough.
> > > > > > > > It was defined on 19th Oct in [1].
> > > > > > > > What part is not clear to you in definition of passthrough device?
> > > > > > > >
> > > > > > > > [1]
> > > > > > > > https://lore.kernel.org/virtio-
> > > > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > > >
> > > > > > >
> > > > > > > Let me copy-paste it again:
> > > > > > >
> > > > > > > For example, assuming you are correct, you still fail to
> > > > > > > explain
> > > > > > >
> > > > > > > 1) what is trapped and what's not, or what's the boundary
> > > > > > Passthrough definition was replied few times.
> > > > > > One of them is here,
> > > > > > https://lore.kernel.org/virtio-
> > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > > I don’t know what you mean by 'explain'. What do you want to
> > > > > > be
> > > explained?
> > > > > > What is trapped is listed in
> > > > > > https://lore.kernel.org/virtio-
> > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > > What is not trapped is also listed in
> > > > > > https://lore.kernel.org/virtio-
> > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > > So what more do you want to explain in there?
> > > > >
> > > > > You explained that MSI-X is trapped but not the others. People
> > > > > may know
> > > why.
> > > > > or what's the boundary to choose to trap or not.
> > > > >
> > > > If a platform can support without trapping, it can be avoided as
> > > > well and can
> > > be added in the future.
> > >
> > > Who is going to do that synchronization?
> > Lets first bring that hypervisor sw design before discussing phantom problem
> solving.
> > All necessary modules will be involved in synchronization depending on how
> its done in future.
> 
> It's not the charge of the virtio spec to mandate any type of hypervisor design.
> But it looks to me you want to do that.
You always attribute is wrong to disregard the proposal which is incorrect.
Virtio spec does not mandate it.
Why?
Because virtio is so late in the cycle of developing features, that it has to fit into the existing hypervisors design to support the feature and proven UAPIs.

So like RSS, flow filters, statistics, provisioning, and more, it is adding the support for UAPIs which are already present for a while across multiple devices.

So attributing it as mandating is simply wrong.
It is addressing the existing use case.

One can always build new hypervisor and demand new features from virtio.
That is perfectly fine.

Your expectation is that device migration framework to work for an undefined hypervisor, which is just silly.

> > > > > >
> > > > > > > 2) if the hypervisor is not developed with those
> > > > > > > assumptions, things can work
> > > > > > What to explain in #2. :)
> > > > > > Things can expand when such hypervisor is born.
> > > > >
> > > > > So the point is still, to make your proposal to be useful in more use
> cases.
> > > > >
> > > > When a use case arise, device context can be expanded.
> > >
> > > It's not device context.
> > >
> > I don’t see why not. It is stored in the device.
> > Remapping part will be hypervisor specific, so it may be stored in platform
> specific migration data.
> 
> The point is, device context should work for all type of hypervisors.
> You can't claim it can only work with your "passthrough" model.
> 
Which other type you specifically have in mind?
The current proposal should work for:
1. passthrough model
2. may be for vdpa model. 
The model seems to work for passthrough and vdpa both cases to me.

If something is missing for #2, either device context can be updated, or new commands can be added.

> >
> > > > No point in making things no one implements or not present in hypervisor.
> > > > The infrastructure is extendible so spec is covered for it.
> > >
> > > It would be problematic if you stick to claim "passthrough" but not.
> >
> > I don’t know what this means. I am not debating passthrough/non-
> passthrough.
> > What is inside the device, will be part of device-context.
> > What is part of the platform content, will be part of platform context.
> > Since this is generic to all types of PCI devices, I don’t see a need to over-solve
> it now in virtio.
> 
> Ok, so you agree it can work even if hypervisor want to trap?

Yes. I believe so, it can work.
If something is missing, we should discuss to enhance it.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-02  6:10                                                                               ` Parav Pandit
@ 2023-11-02 14:01                                                                                 ` Michael S. Tsirkin
  2023-11-06  6:35                                                                                 ` Jason Wang
  1 sibling, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-02 14:01 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Nov 02, 2023 at 06:10:25AM +0000, Parav Pandit wrote:
> > > > > When a use case arise, device context can be expanded.

So I just suggest you scan the comments for suggestions and
just add them all. Will be much faster than all these mega
threads. One of the reasons this discussion keeps getting
stuck is that whatever is not in your proposal you immediately
start asking for proof it's actually needed or dismissing
as not necessary. Spec design is fundamentally a speculative
edeavor we are trying to predict what will make a good
implementation. So what we do it think up theoretical
use-cases, and try to see design can address them, and if
it can then hopefully the completely unexpected that will
actually be needed in real life will be possible.

-- 
MST

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-10-08 11:25 ` [virtio-comment] [PATCH v1 3/8] device-context: Define the device context fields for device migration Parav Pandit
  2023-10-08 11:41   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-02 14:21   ` Michael S. Tsirkin
  2023-11-02 14:40     ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-02 14:21 UTC (permalink / raw)
  To: Parav Pandit; +Cc: virtio-comment, cohuck, sburla, shahafs, maorg, yishaih

On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
> +\begin{lstlisting}
> +struct virtio_dev_ctx_pci_vq_cfg {
> +        le16 vq_index;
> +        le16 queue_size;
> +        le16 queue_msix_vector;
> +        le64 queue_desc;
> +        le64 queue_driver;
> +        le64 queue_device;
> +};
> +\end{lstlisting}
> +
> +One or multiple entries of PCI Virtqueue Configuration Context may exist, each such
> +entry corresponds to a unique virtqueue identified by the \field{vq_index}.

So consider this example. In practice it is quite possible that
driver is in the process of specifying e.g. queue_desc, and it
set queue_desc_hi but not queue_desc_lo. Then what is queue_desc?
Just a combination of a legal value of queue_desc_hi and
illegal one of queue_desc_lo? This makes no sense.
queue_desc is fundamentally undefined until queue is enabled.

This is why I suggest that we have records that match
the transport. Offset in structure can then we used as a tag and
so we do not need to come up with new definitions for each single thing.

And, this is only an instance of the general principle: do not
have two definitions of the same thing. In fact I'd argue our
transport structures are an example of a bad design and the
cost is that less used ones like mmio and ccw sometimes lag behind
on features.

-- 
MST

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] RE: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-02 14:21   ` Michael S. Tsirkin
@ 2023-11-02 14:40     ` Parav Pandit
  2023-11-02 14:53       ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-02 14:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 2, 2023 7:52 PM
> 
> On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
> > +\begin{lstlisting}
> > +struct virtio_dev_ctx_pci_vq_cfg {
> > +        le16 vq_index;
> > +        le16 queue_size;
> > +        le16 queue_msix_vector;
> > +        le64 queue_desc;
> > +        le64 queue_driver;
> > +        le64 queue_device;
> > +};
> > +\end{lstlisting}
> > +
> > +One or multiple entries of PCI Virtqueue Configuration Context may
> > +exist, each such entry corresponds to a unique virtqueue identified by the
> \field{vq_index}.
> 
> So consider this example. In practice it is quite possible that driver is in the
> process of specifying e.g. queue_desc, and it set queue_desc_hi but not
> queue_desc_lo. Then what is queue_desc?
> Just a combination of a legal value of queue_desc_hi and illegal one of
> queue_desc_lo? This makes no sense.
> queue_desc is fundamentally undefined until queue is enabled.

Sure, in next read of the device context the updated value will reflect.
The destination will not work on this vq anyway as the device mode is freeze.
The whole device context is not atomic, so having one field like this as nonatomic similar to others is ok.

In this example queue_enabled with the partial write should be still set to 0.

> 
> This is why I suggest that we have records that match the transport. Offset in
> structure can then we used as a tag and so we do not need to come up with
> new definitions for each single thing.
> 
If I understood you correctly, you prefer to transfer virtio config space as offset and value as tag?
If so, how tag helps if it still transfers partial value?

> And, this is only an instance of the general principle: do not have two definitions
> of the same thing. In fact I'd argue our transport structures are an example of a
> bad design and the cost is that less used ones like mmio and ccw sometimes lag
> behind on features.

I totally agree on not duplicating it.

For 64 VQs their content of struct virtio_dev_ctx_pci_vq_cfg is behind 8 registers.
So for them there has contained in their own struct such as struct virtio_dev_ctx_pci_vq_cfg, right?

And current struct virtio_pci_common_cfg is giving the current snap shot of register file.

This is why there is some duplication.
To avoid duplication of this registers, we will have to bisect each field of it and omit these 3 address registers.

Not able to see the gain of that overhead. Do you?

^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-02 14:40     ` [virtio-comment] " Parav Pandit
@ 2023-11-02 14:53       ` Michael S. Tsirkin
  2023-11-02 15:06         ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-02 14:53 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 02, 2023 at 02:40:57PM +0000, Parav Pandit wrote:
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 2, 2023 7:52 PM
> > 
> > On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
> > > +\begin{lstlisting}
> > > +struct virtio_dev_ctx_pci_vq_cfg {
> > > +        le16 vq_index;
> > > +        le16 queue_size;
> > > +        le16 queue_msix_vector;
> > > +        le64 queue_desc;
> > > +        le64 queue_driver;
> > > +        le64 queue_device;
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +One or multiple entries of PCI Virtqueue Configuration Context may
> > > +exist, each such entry corresponds to a unique virtqueue identified by the
> > \field{vq_index}.
> > 
> > So consider this example. In practice it is quite possible that driver is in the
> > process of specifying e.g. queue_desc, and it set queue_desc_hi but not
> > queue_desc_lo. Then what is queue_desc?
> > Just a combination of a legal value of queue_desc_hi and illegal one of
> > queue_desc_lo? This makes no sense.
> > queue_desc is fundamentally undefined until queue is enabled.
> 
> Sure, in next read of the device context the updated value will reflect.
> The destination will not work on this vq anyway as the device mode is freeze.
> The whole device context is not atomic, so having one field like this as nonatomic similar to others is ok.
> 
> In this example queue_enabled with the partial write should be still set to 0.
> 
> > 
> > This is why I suggest that we have records that match the transport. Offset in
> > structure can then we used as a tag and so we do not need to come up with
> > new definitions for each single thing.
> > 
> If I understood you correctly, you prefer to transfer virtio config space as offset and value as tag?
> If so, how tag helps if it still transfers partial value?

For example:

struct virtio_pci_common_cfg {
        /* About the whole device. */
        __le32 device_feature_select;   /* read-write */
        __le32 device_feature;          /* read-only */
        __le32 guest_feature_select;    /* read-write */
        __le32 guest_feature;           /* read-write */
        __le16 msix_config;             /* read-write */
        __le16 num_queues;              /* read-only */
        __u8 device_status;             /* read-write */
        __u8 config_generation;         /* read-only */

        /* About a specific virtqueue. */
        __le16 queue_select;            /* read-write */

        __le16 queue_size;              /* read-write, power of 2. */
        __le16 queue_msix_vector;       /* read-write */
        __le16 queue_enable;            /* read-write */
        __le16 queue_notify_off;        /* read-only */
        __le32 queue_desc_lo;           /* read-write */
        __le32 queue_desc_hi;           /* read-write */
        __le32 queue_avail_lo;          /* read-write */
        __le32 queue_avail_hi;          /* read-write */
        __le32 queue_used_lo;           /* read-write */
        __le32 queue_used_hi;           /* read-write */
};

We would have:
    tag: 32 (queue_desc_lo), len: 4
    tag: 34 (queue_desc_hi), len: 4


The point is that the values programmed just map 1:1 to
what is exposed in the transport.

This means that the structure is different for different transports btw.


    



> > And, this is only an instance of the general principle: do not have two definitions
> > of the same thing. In fact I'd argue our transport structures are an example of a
> > bad design and the cost is that less used ones like mmio and ccw sometimes lag
> > behind on features.
> 
> I totally agree on not duplicating it.
> 
> For 64 VQs their content of struct virtio_dev_ctx_pci_vq_cfg is behind 8 registers.
> So for them there has contained in their own struct such as struct virtio_dev_ctx_pci_vq_cfg, right?
> 
> And current struct virtio_pci_common_cfg is giving the current snap shot of register file.
> 
> This is why there is some duplication.
> To avoid duplication of this registers, we will have to bisect each field of it and omit these 3 address registers.
> 
> Not able to see the gain of that overhead. Do you?

Not bisect. My idea is to just have each register in its own tag+length
field. And the gain is that we easily add fields without any pain
and any duplication and special documentation effort.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] RE: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-02 14:53       ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-02 15:06         ` Parav Pandit
  2023-11-02 17:05           ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-02 15:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 2, 2023 8:24 PM
> 
> On Thu, Nov 02, 2023 at 02:40:57PM +0000, Parav Pandit wrote:
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 2, 2023 7:52 PM
> > >
> > > On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
> > > > +\begin{lstlisting}
> > > > +struct virtio_dev_ctx_pci_vq_cfg {
> > > > +        le16 vq_index;
> > > > +        le16 queue_size;
> > > > +        le16 queue_msix_vector;
> > > > +        le64 queue_desc;
> > > > +        le64 queue_driver;
> > > > +        le64 queue_device;
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +One or multiple entries of PCI Virtqueue Configuration Context
> > > > +may exist, each such entry corresponds to a unique virtqueue
> > > > +identified by the
> > > \field{vq_index}.
> > >
> > > So consider this example. In practice it is quite possible that
> > > driver is in the process of specifying e.g. queue_desc, and it set
> > > queue_desc_hi but not queue_desc_lo. Then what is queue_desc?
> > > Just a combination of a legal value of queue_desc_hi and illegal one
> > > of queue_desc_lo? This makes no sense.
> > > queue_desc is fundamentally undefined until queue is enabled.
> >
> > Sure, in next read of the device context the updated value will reflect.
> > The destination will not work on this vq anyway as the device mode is freeze.
> > The whole device context is not atomic, so having one field like this as
> nonatomic similar to others is ok.
> >
> > In this example queue_enabled with the partial write should be still set to 0.
> >
> > >
> > > This is why I suggest that we have records that match the transport.
> > > Offset in structure can then we used as a tag and so we do not need
> > > to come up with new definitions for each single thing.
> > >
> > If I understood you correctly, you prefer to transfer virtio config space as offset
> and value as tag?
> > If so, how tag helps if it still transfers partial value?
> 
> For example:
> 
> struct virtio_pci_common_cfg {
>         /* About the whole device. */
>         __le32 device_feature_select;   /* read-write */
>         __le32 device_feature;          /* read-only */
>         __le32 guest_feature_select;    /* read-write */
>         __le32 guest_feature;           /* read-write */
>         __le16 msix_config;             /* read-write */
>         __le16 num_queues;              /* read-only */
>         __u8 device_status;             /* read-write */
>         __u8 config_generation;         /* read-only */
> 
>         /* About a specific virtqueue. */
>         __le16 queue_select;            /* read-write */
> 
>         __le16 queue_size;              /* read-write, power of 2. */
>         __le16 queue_msix_vector;       /* read-write */
>         __le16 queue_enable;            /* read-write */
>         __le16 queue_notify_off;        /* read-only */
>         __le32 queue_desc_lo;           /* read-write */
>         __le32 queue_desc_hi;           /* read-write */
>         __le32 queue_avail_lo;          /* read-write */
>         __le32 queue_avail_hi;          /* read-write */
>         __le32 queue_used_lo;           /* read-write */
>         __le32 queue_used_hi;           /* read-write */
> };
> 
> We would have:
>     tag: 32 (queue_desc_lo), len: 4
>     tag: 34 (queue_desc_hi), len: 4
> 
So tag is offset. Ok.
> 
> The point is that the values programmed just map 1:1 to what is exposed in the
> transport.
> 
For queue addresses, 
> This means that the structure is different for different transports btw.
> 
> 
> 
> 
> 
> 
> > > And, this is only an instance of the general principle: do not have
> > > two definitions of the same thing. In fact I'd argue our transport
> > > structures are an example of a bad design and the cost is that less
> > > used ones like mmio and ccw sometimes lag behind on features.
> >
> > I totally agree on not duplicating it.
> >
> > For 64 VQs their content of struct virtio_dev_ctx_pci_vq_cfg is behind 8
> registers.
> > So for them there has contained in their own struct such as struct
> virtio_dev_ctx_pci_vq_cfg, right?
> >
> > And current struct virtio_pci_common_cfg is giving the current snap shot of
> register file.
> >
> > This is why there is some duplication.
> > To avoid duplication of this registers, we will have to bisect each field of it and
> omit these 3 address registers.
> >
> > Not able to see the gain of that overhead. Do you?
> 
> Not bisect. My idea is to just have each register in its own tag+length field. And
> the gain is that we easily add fields without any pain and any duplication and
> special documentation effort.
We have to transport all the vq fields located behind the common config registers as struct anyway, isn’t it?
And if we use tag for virtio common config space (instead of struct), why there won't be duplication? This is the part I still miss.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-02 15:06         ` [virtio-comment] " Parav Pandit
@ 2023-11-02 17:05           ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-02 17:05 UTC (permalink / raw)
  To: Parav Pandit
  Cc: virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 02, 2023 at 03:06:49PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 2, 2023 8:24 PM
> > 
> > On Thu, Nov 02, 2023 at 02:40:57PM +0000, Parav Pandit wrote:
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 2, 2023 7:52 PM
> > > >
> > > > On Sun, Oct 08, 2023 at 02:25:50PM +0300, Parav Pandit wrote:
> > > > > +\begin{lstlisting}
> > > > > +struct virtio_dev_ctx_pci_vq_cfg {
> > > > > +        le16 vq_index;
> > > > > +        le16 queue_size;
> > > > > +        le16 queue_msix_vector;
> > > > > +        le64 queue_desc;
> > > > > +        le64 queue_driver;
> > > > > +        le64 queue_device;
> > > > > +};
> > > > > +\end{lstlisting}
> > > > > +
> > > > > +One or multiple entries of PCI Virtqueue Configuration Context
> > > > > +may exist, each such entry corresponds to a unique virtqueue
> > > > > +identified by the
> > > > \field{vq_index}.
> > > >
> > > > So consider this example. In practice it is quite possible that
> > > > driver is in the process of specifying e.g. queue_desc, and it set
> > > > queue_desc_hi but not queue_desc_lo. Then what is queue_desc?
> > > > Just a combination of a legal value of queue_desc_hi and illegal one
> > > > of queue_desc_lo? This makes no sense.
> > > > queue_desc is fundamentally undefined until queue is enabled.
> > >
> > > Sure, in next read of the device context the updated value will reflect.
> > > The destination will not work on this vq anyway as the device mode is freeze.
> > > The whole device context is not atomic, so having one field like this as
> > nonatomic similar to others is ok.
> > >
> > > In this example queue_enabled with the partial write should be still set to 0.
> > >
> > > >
> > > > This is why I suggest that we have records that match the transport.
> > > > Offset in structure can then we used as a tag and so we do not need
> > > > to come up with new definitions for each single thing.
> > > >
> > > If I understood you correctly, you prefer to transfer virtio config space as offset
> > and value as tag?
> > > If so, how tag helps if it still transfers partial value?
> > 
> > For example:
> > 
> > struct virtio_pci_common_cfg {
> >         /* About the whole device. */
> >         __le32 device_feature_select;   /* read-write */
> >         __le32 device_feature;          /* read-only */
> >         __le32 guest_feature_select;    /* read-write */
> >         __le32 guest_feature;           /* read-write */
> >         __le16 msix_config;             /* read-write */
> >         __le16 num_queues;              /* read-only */
> >         __u8 device_status;             /* read-write */
> >         __u8 config_generation;         /* read-only */
> > 
> >         /* About a specific virtqueue. */
> >         __le16 queue_select;            /* read-write */
> > 
> >         __le16 queue_size;              /* read-write, power of 2. */
> >         __le16 queue_msix_vector;       /* read-write */
> >         __le16 queue_enable;            /* read-write */
> >         __le16 queue_notify_off;        /* read-only */
> >         __le32 queue_desc_lo;           /* read-write */
> >         __le32 queue_desc_hi;           /* read-write */
> >         __le32 queue_avail_lo;          /* read-write */
> >         __le32 queue_avail_hi;          /* read-write */
> >         __le32 queue_used_lo;           /* read-write */
> >         __le32 queue_used_hi;           /* read-write */
> > };
> > 
> > We would have:
> >     tag: 32 (queue_desc_lo), len: 4
> >     tag: 34 (queue_desc_hi), len: 4
> > 
> So tag is offset. Ok.
> > 
> > The point is that the values programmed just map 1:1 to what is exposed in the
> > transport.
> > 
> For queue addresses, 
> > This means that the structure is different for different transports btw.
> > 
> > 
> > 
> > 
> > 
> > 
> > > > And, this is only an instance of the general principle: do not have
> > > > two definitions of the same thing. In fact I'd argue our transport
> > > > structures are an example of a bad design and the cost is that less
> > > > used ones like mmio and ccw sometimes lag behind on features.
> > >
> > > I totally agree on not duplicating it.
> > >
> > > For 64 VQs their content of struct virtio_dev_ctx_pci_vq_cfg is behind 8
> > registers.
> > > So for them there has contained in their own struct such as struct
> > virtio_dev_ctx_pci_vq_cfg, right?
> > >
> > > And current struct virtio_pci_common_cfg is giving the current snap shot of
> > register file.
> > >
> > > This is why there is some duplication.
> > > To avoid duplication of this registers, we will have to bisect each field of it and
> > omit these 3 address registers.
> > >
> > > Not able to see the gain of that overhead. Do you?
> > 
> > Not bisect. My idea is to just have each register in its own tag+length field. And
> > the gain is that we easily add fields without any pain and any duplication and
> > special documentation effort.
> We have to transport all the vq fields located behind the common config registers as struct anyway, isn’t it?
> And if we use tag for virtio common config space (instead of struct), why there won't be duplication? This is the part I still miss.

Not sure I understand the question.  The point is that when we add a new
field to common config we don't need to also add it to the migration
format - it has an offset and that automatically defines the format.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-02  6:10                                                                         ` Parav Pandit
@ 2023-11-06  6:34                                                                           ` Jason Wang
  2023-11-06  7:05                                                                             ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-06  6:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, November 2, 2023 9:56 AM
> >
> > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > >
> > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > >
> > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason
> > > > > > > > Wang
> > > > > > > >
> > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > >
> > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > > > > >
> > > > > > > > > > > > How do you know that?
> > > > > > > > > > > Because for passthrough, the hypervisor is not
> > > > > > > > > > > involved in dealing with VQ at
> > > > > > > > > > all.
> > > > > > > > > >
> > > > > > > > > > Ok, so if I understand correctly, you are saying your
> > > > > > > > > > design can't work for the case of PASID assignment.
> > > > > > > > > >
> > > > > > > > > No. PASID assignment will happen from the guest for its
> > > > > > > > > own use and device
> > > > > > > > migration will just work fine because device context will capture this.
> > > > > > > >
> > > > > > > > It's not about device context. We're discussing "passthrough", no?
> > > > > > > >
> > > > > > > Not sure, we are discussing same.
> > > > > > > A member device is passthrough to the guest, dealing with its
> > > > > > > own PASIDs and
> > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > So VQ context captured by the hypervisor, will have some PASID
> > > > > > > attached to
> > > > > > this VQ.
> > > > > > > Device context will be updated.
> > > > > > >
> > > > > > > > You want all virtio stuff to be "passthrough", but assigning
> > > > > > > > a PASID to a specific virtqueue in the guest must be trapped.
> > > > > > > >
> > > > > > > No. PASID assignment to a specific virtqueue in the guest must
> > > > > > > go directly
> > > > > > from guest to device.
> > > > > >
> > > > > > This works like setting CR3, you can't simply let it go from guest to host.
> > > > > >
> > > > > > Host IOMMU driver needs to know the PASID to program the IO page
> > > > > > tables correctly.
> > > > > >
> > > > > This will be done by the IOMMU.
> > > > >
> > > > > > > When guest iommu may need to communicate anything for this
> > > > > > > PASID, it will
> > > > > > come through its proper IOMMU channel/hypercall.
> > > > > >
> > > > > > Let's say using PASID X for queue 0, this knowledge is beyond
> > > > > > the IOMMU scope but belongs to virtio. Or please explain how it
> > > > > > can work when it goes directly from guest to device.
> > > > > >
> > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > >
> > > > It has one.
> > > >
> > > > > For ok for theory sake it is there.
> > > > >
> > > > > Virtio driver will assign the PASID directly from guest driver to
> > > > > device using a
> > > > create_vq(pasid=X) command.
> > > > > Same process is somehow attached the PASID by the guest OS.
> > > > > The whole PASID range is known to the hypervisor when the device
> > > > > is handed
> > > > over to the guest VM.
> > > >
> > > > How can it know?
> > > >
> > > > > So PASID mapping is setup by the hypervisor IOMMU at this point.
> > > >
> > > > You disallow the PASID to be virtualized here. What's more, such a
> > > > PASID passthrough has security implications.
> > > >
> > > No. virtio spec is not disallowing. At least for sure, this series is not the one.
> > > My main point is, virtio device interface will not be the source of hypercall to
> > program IOMMU in the hypervisor.
> > > It is something to be done by IOMMU side.
> >
> > So unless vPASID can be used by the hardware you need to trap the mapping
> > from a PASID to a virtqueue. Then you need virtio specific knowledge.
> >
> vPASID by hardware is unlikely to be used by hw PCI EP devices at least in any near term future.
> This requires either vPASID to pPASID table in device or in IOMMU.

So we are on the same page.

Claiming a method that can only work for passthrough or emulation is
not good. We all know virtualization is passthrough + emulation.

>
> > >
> > > > Again, we are talking about different things, I've tried to show you
> > > > that there are cases that passthrough can't work but if you think
> > > > the only way for migration is to use passthrough in every case, you will
> > probably fail.
> > > >
> > > I didn't say only way for migration is passthrough.
> > > Passthrough is clearly one way.
> > > Other ways may be possible.
> > >
> > > > >
> > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > There are works ongoing to make vPASID work for the
> > > > > > > > > > > > guest like
> > > > > > vSVA.
> > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > >
> > > > > > > > > > Great, you find another limitation of "passthrough" by yourself.
> > > > > > > > > >
> > > > > > > > > No. it is not the limitation it is just the way it does
> > > > > > > > > not need complex SVA to
> > > > > > > > split the device for unrelated usage.
> > > > > > > >
> > > > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > > > >
> > > > > > > He he, I am not limiting, again misunderstanding or wrong attribution.
> > > > > > > I explained that hypervisor for passthrough does not need SVA.
> > > > > > > Guest can do anything it wants from the guest OS with the
> > > > > > > member
> > > > device.
> > > > > >
> > > > > > Ok, so the point stills, see above.
> > > > >
> > > > > I don’t think so. The guest owns its PASID space
> > > >
> > > > Again, vPASID to PASID can't be done hardware unless I miss some
> > > > recent features of IOMMUs.
> > > >
> > > Cpu vendors have different way of doing vPASID to pPASID.
> >
> > At least for the current version of major IOMMU vendors, such translation (aka
> > PASID remapping) is not implemented in the hardware so it needs to be trapped
> > first.
> >
> Right. So it is really far in future, atleast few years away.
>
> > > It is still an early space for virtio.
> > >
> > > > > and directly communicates like any other device attribute.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > Each passthrough device has PASID from its own space
> > > > > > > > > > > fully managed by the
> > > > > > > > > > guest.
> > > > > > > > > > > Some cpu required vPASID and SIOV is not going this way
> > anmore.
> > > > > > > > > >
> > > > > > > > > > Then how to migrate? Invent a full set of something else
> > > > > > > > > > through another giant series like this to migrate to the SIOV
> > thing?
> > > > > > > > > > That's a mess for
> > > > > > > > sure.
> > > > > > > > > >
> > > > > > > > > SIOV will for sure reuse most or all parts of this work,
> > > > > > > > > almost entirely
> > > > as_is.
> > > > > > > > > vPASID is cpu/platform specific things not part of the SIOV devices.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > If at all it is done, it will be done from the
> > > > > > > > > > > > > guest by the driver using virtio
> > > > > > > > > > > > interface.
> > > > > > > > > > > >
> > > > > > > > > > > > Then you need to trap. Such things couldn't be
> > > > > > > > > > > > passed through to guests
> > > > > > > > > > directly.
> > > > > > > > > > > >
> > > > > > > > > > > Only PASID capability is trapped. PASID allocation and
> > > > > > > > > > > usage is directly from
> > > > > > > > > > guest.
> > > > > > > > > >
> > > > > > > > > > How can you achieve this? Assigning a PAISD to a device
> > > > > > > > > > is completely
> > > > > > > > > > device(virtio) specific. How can you use a general layer
> > > > > > > > > > without the knowledge of virtio to trap that?
> > > > > > > > > When one wants to map vPASID to pPASID a platform needs to
> > > > > > > > > be
> > > > > > involved.
> > > > > > > >
> > > > > > > > I'm not talking about how to map vPASID to pPASID, it's out
> > > > > > > > of the scope of virtio. I'm talking about assigning a vPASID
> > > > > > > > to a specific virtqueue or other virtio function in the guest.
> > > > > > > >
> > > > > > > That can be done in the guest. The key is guest wont know that
> > > > > > > it is dealing
> > > > > > with vPASID.
> > > > > > > It will follow the same principle from your paper of
> > > > > > > equivalency, where virtio
> > > > > > software layer will assign PASID to VQ and communicate to device.
> > > > > > >
> > > > > > > Anyway, all of this just digression from current series.
> > > > > >
> > > > > > It's not, as you mention that only MSI-X is trapped, I give you another
> > one.
> > > > > >
> > > > > PASID access from the guest to be done fully by the guest IOMMU.
> > > > > Not by virtio devices.
> > > > >
> > > > > > >
> > > > > > > > You need a virtio specific queue or capability to assign a
> > > > > > > > PASID to a specific virtqueue, and that can't be done
> > > > > > > > without trapping and without virito specific knowledge.
> > > > > > > >
> > > > > > > I disagree. PASID assignment to a virqueue in future from
> > > > > > > guest virtio driver to
> > > > > > device is uniform method.
> > > > > > > Whether its PF assigning PASID to VQ of self, Or VF driver in
> > > > > > > the guest assigning PASID to VQ.
> > > > > > >
> > > > > > > All same.
> > > > > > > Only IOMMU layer hypercalls will know how to deal with PASID
> > > > > > > assignment at
> > > > > > platform layer to setup the domain etc table.
> > > > > > >
> > > > > > > And this is way beyond our device migration discussion.
> > > > > > > By any means, if you were implying that somehow vq to PASID
> > > > > > > assignment
> > > > > > _may_ need trap+emulation, hence whole device migration to
> > > > > > depend on some
> > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > >
> > > > > > See above.
> > > > > >
> > > > > Yeah, I disagree to such implying.
> > > > >
> > > > > > >
> > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating the
> > > > > > > guest process and
> > > > > > all of that just works on efficiency and equivalence principle
> > > > > > already for a decade now without any trap+emulation.
> > > > > > >
> > > > > > > > > When virtio passthrough device is in guest, it has all its
> > > > > > > > > PASID
> > > > accessible.
> > > > > > > > >
> > > > > > > > > All these is large deviation from current discussion of
> > > > > > > > > this series, so I will keep
> > > > > > > > it short.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Regardless it is not relevant to passthrough mode as
> > > > > > > > > > > PASID is yet another
> > > > > > > > > > resource.
> > > > > > > > > > > And for some cpu if it is trapped, it is generic
> > > > > > > > > > > layer, that does not require virtio
> > > > > > > > > > involvement.
> > > > > > > > > > > So virtio interface asking to trap something because
> > > > > > > > > > > generic facility has done
> > > > > > > > > > in not the approach.
> > > > > > > > > >
> > > > > > > > > > This misses the point of PASID. How to use PASID is
> > > > > > > > > > totally device
> > > > > > specific.
> > > > > > > > > Sure, and how to virtualize vPASID/pPASID is platform
> > > > > > > > > specific as single PASID
> > > > > > > > can be used by multiple devices and process.
> > > > > > > >
> > > > > > > > See above, I think we're talking about different things.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > Capabilities of #2 is generic across all pci
> > > > > > > > > > > > > devices, so it will be handled by the
> > > > > > > > > > > > HV.
> > > > > > > > > > > > > ATS/PRI cap is also generic manner handled by the
> > > > > > > > > > > > > HV and PCI
> > > > > > device.
> > > > > > > > > > > >
> > > > > > > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU.
> > > > > > > > > > > > You can simply do ATS/PRI passthrough but with an
> > > > > > > > > > > > emulated
> > > > vIOMMU.
> > > > > > > > > > > And that is not the reason for virtio device to build
> > > > > > > > > > > trap+emulation for
> > > > > > > > > > passthrough member devices.
> > > > > > > > > >
> > > > > > > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > >
> > > > > > > > Shouldn't it arrive at platform IOMMU first? The path should
> > > > > > > > be PRI
> > > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest
> > IOMMU.
> > > > > > > >
> > > > > > > Above sequence seems write.
> > > > > > >
> > > > > > > > And things will be more complicated when (v)PASID is used.
> > > > > > > > So you can't simply let PRI go directly to the guest with
> > > > > > > > the current
> > > > architecture.
> > > > > > > >
> > > > > > > In current architecture of the pci VF, PRI does not go directly to the
> > guest.
> > > > > > > (and that is not reason to trap and emulate other things).
> > > > > >
> > > > > > Ok, so beyond MSI-X we need to trap PRI, and we will probably
> > > > > > trap other things in the future like PASID assignment.
> > > > > PRI etc all belong to generic PCI 4K config space region.
> > > >
> > > > It's not about the capability, it's about the whole process of PRI
> > > > request handling. We've agreed that the PRI request needs to be
> > > > trapped by the hypervisor and then delivered to the vIOMMU.
> > > >
> > >
> > > > > Trap+emulation done in generic manner without involving virtio or
> > > > > Trap+other
> > > > device types.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > how can you pass
> > > > > > > > > > through a hardware PRI request to a guest directly
> > > > > > > > > > without trapping it
> > > > > > then?
> > > > > > > > > > What's more, PCIE allows the PRI to be done in a vendor
> > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > for virtio?
> > > > > > > > > >
> > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > Do you have a reference to the ECN that enables vendor
> > > > > > > > > specific way of PRI? I
> > > > > > > > would like to read it.
> > > > > > > >
> > > > > > > > I mean it doesn't forbid us to build a virtio specific
> > > > > > > > interface for I/O page fault report and recovery.
> > > > > > > >
> > > > > > > So PRI of PCI does not allow. It is ODP kind of technique you meant
> > above.
> > > > > > > Yes one can build.
> > > > > > > Ok. unrelated to device migration, so I will park this good
> > > > > > > discussion for
> > > > later.
> > > > > >
> > > > > > That's fine.
> > > > > >
> > > > > > >
> > > > > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > > > > >
> > > > > > > > Probably.
> > > > > > > >
> > > > > > > > > PRI will directly go to the guest driver, and guest would
> > > > > > > > > interact with IOMMU
> > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > >
> > > > > > > > With PASID, it can't go directly.
> > > > > > > >
> > > > > > > When the request consist of PASID in it, it can.
> > > > > > > But again these PCI-SIG extensions of PASID are not related to
> > > > > > > device
> > > > > > migration, so I am differing it.
> > > > > > >
> > > > > > > > > For PRI in vendor specific way needs a separate
> > > > > > > > > discussion. It is not related to
> > > > > > > > live migration.
> > > > > > > >
> > > > > > > > PRI itself is not related. But the point is, you can't
> > > > > > > > simply pass through ATS/PRI now.
> > > > > > > >
> > > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI
> > > > > > > capabilities are located
> > > > > > are trapped+emulated by hypervisor.
> > > > > > > So?
> > > > > > > So do we start emulating virito interfaces too for passthrough?
> > > > > > > No.
> > > > > > > Can one still continue to trap+emulate?
> > > > > > > Sure why not?
> > > > > >
> > > > > > Then let's not limit your proposal to be used by "passthrough" only?
> > > > > One can possibly build some variant of the existing virtio member
> > > > > device
> > > > using same owner and member scheme.
> > > >
> > > > It's not about the member/owner, it's about e.g whether the
> > > > hypervisor can trap and emulate.
> > > >
> > > > I've pointed out that what you invent here is actually a partial new
> > > > transport, for example, a hypervisor can trap and use things like
> > > > device context in PF to bypass the registers in VF. This is the idea of
> > transport commands/q.
> > > >
> > > I will not mix transport commands which are mainly useful for actual device
> > operation for SIOV only for backward compatibility that too optionally.
> > > One may still choose to have virtio common and device config in MMIO
> > ofcourse at lower scale.
> > >
> > > Anyway, mixing migration context with actual SIOV specific thing is not correct
> > as device context is read/write incremental values.
> >
> > SIOV is transport level stuff, the transport virtqueue is designed in a way that is
> > general enough to cover it. Let's not shift concepts.
> >
> Such TVQ is only for backward compatible vPCI composition.
> For ground up work such TVQ must not be done through the owner device.

That's the idea actually.

> Each SIOV device to have its own channel to communicate directly to the device.
>
> > One thing that you ignore is that, hypervisor can use what you invented as a
> > transport for VF, no?
> >
> No. by design,

It works like hypervisor traps the virito config and forwards it to
admin virtqueue and starts the device via device context.

> it is not good idea to overload management commands with actual run time guest commands.
> The device context read writes are largely for incremental updates.

It doesn't matter if it is incremental or not but

1) the function is there
2) hypervisor can use that function if they want and virtio (spec)
can't forbid that

>
> For VF driver it has own direct channel via its own BAR to talk to the device. So no need to transport via PF.
> For SIOV for backward compat vPCI composition, it may be needed.
> Hard to say, if that can be memory mapped as well on the BAR of the PF.
> We have seen one device supporting it outside of the virtio.
> For scale anyway, one needs to use the device own cvq for complex configuration.

That's the idea but I meant your current proposal overlaps those functions.

>
> > >
> > > > > If for that is some admin commands are missing, may be one can add
> > them.
> > > >
> > > > I would then build the device context commands on top of the
> > > > transport commands/q, then it would be complete.
> > > >
> > > > > No need to step on toes of use cases as they are different...
> > > > >
> > > > > > I've shown you that
> > > > > >
> > > > > > 1) you can't easily say you can pass through all the virtio
> > > > > > facilities
> > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > >
> > > > > It is not, it is well defined in v3, v2.
> > > > > One can continue to argue and keep defining the variant and still
> > > > > call it data
> > > > path acceleration and then claim it as passthrough ...
> > > > > But I won't debate this anymore as its just non-technical aspects
> > > > > of least
> > > > interest.
> > > >
> > > > You use this terminology in the spec which is all about technical,
> > > > and you think how to define it is a matter of non-technical. This is
> > > > self-contradictory. If you fail, it probably means it's ambiguous.
> > > > Let's don't use that terminology.
> > > >
> > > What it means is described in theory of operation.
> > >
> > > > > We have technical tasks and more improved specs to update going
> > forward.
> > > >
> > > > It's a burden to do the synchronization.
> > > We have discussed this.
> > > In current proposed the member device is not bifurcated,
> >
> > It is. Part of the functions were carried via the PCI interface, some are carried
> > via owner. You end up with two drivers to drive the devices.
> >
> Nop.
> All admin work of device migration is carried out via the owner device.
> All guest triggered work is carried out using VF itself.

Guests don't (or can't) care about how the hypervisor is structured.
So we're discussing the view of device, member devices needs to server
for

1) request from the transport (it's guest in your context)
2) request from the owner

Thanks


>
> > Thanks
> >
> >
> > > so it implements the necessary pieces.
> > > Feature != burden.
> > >
> > > >
> > > > > Working on extension for device specific contexts to enrich it.
> > > >
> > > > Again, making the proposal to be general is much more beneficial.
> > >
> > > Yes, it is general and like any other device-type, each has their extensions.
> > > Infrastructure covers in v3.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-02  6:10                                                                               ` Parav Pandit
  2023-11-02 14:01                                                                                 ` Michael S. Tsirkin
@ 2023-11-06  6:35                                                                                 ` Jason Wang
  2023-11-09  6:24                                                                                   ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-06  6:35 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, November 2, 2023 9:54 AM
> >
> > On Wed, Nov 1, 2023 at 11:07 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Wednesday, November 1, 2023 6:03 AM
> > > >
> > > > On Tue, Oct 31, 2023 at 1:17 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > > > Sent: Tuesday, October 31, 2023 7:07 AM
> > > > > >
> > > > > > On Mon, Oct 30, 2023 at 12:28 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Monday, October 30, 2023 9:35 AM
> > > > > > > >
> > > > > > > > 在 2023/10/26 11:50, Parav Pandit 写道:
> > > > > > > > >> From: virtio-comment@lists.oasis-open.org
> > > > > > > > >> <virtio-comment@lists.oasis- open.org> On Behalf Of Jason
> > > > > > > > >> Wang For example, you still haven't succeeded in defining
> > > > passthrough.
> > > > > > > > > It was defined on 19th Oct in [1].
> > > > > > > > > What part is not clear to you in definition of passthrough device?
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > > https://lore.kernel.org/virtio-
> > > > > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > > > >
> > > > > > > >
> > > > > > > > Let me copy-paste it again:
> > > > > > > >
> > > > > > > > For example, assuming you are correct, you still fail to
> > > > > > > > explain
> > > > > > > >
> > > > > > > > 1) what is trapped and what's not, or what's the boundary
> > > > > > > Passthrough definition was replied few times.
> > > > > > > One of them is here,
> > > > > > > https://lore.kernel.org/virtio-
> > > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > > > I don’t know what you mean by 'explain'. What do you want to
> > > > > > > be
> > > > explained?
> > > > > > > What is trapped is listed in
> > > > > > > https://lore.kernel.org/virtio-
> > > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > > > What is not trapped is also listed in
> > > > > > > https://lore.kernel.org/virtio-
> > > > > > comment/PH0PR12MB5481EA6A4D0C64C5AF6D3A
> > > > > > > 57DCD4A@PH0PR12MB5481.namprd12.prod.outlook.com/
> > > > > > > So what more do you want to explain in there?
> > > > > >
> > > > > > You explained that MSI-X is trapped but not the others. People
> > > > > > may know
> > > > why.
> > > > > > or what's the boundary to choose to trap or not.
> > > > > >
> > > > > If a platform can support without trapping, it can be avoided as
> > > > > well and can
> > > > be added in the future.
> > > >
> > > > Who is going to do that synchronization?
> > > Lets first bring that hypervisor sw design before discussing phantom problem
> > solving.
> > > All necessary modules will be involved in synchronization depending on how
> > its done in future.
> >
> > It's not the charge of the virtio spec to mandate any type of hypervisor design.
> > But it looks to me you want to do that.
> You always attribute is wrong to disregard the proposal which is incorrect.

This is not my point. I'm just saying, I never say any virtio existing
facilities need to be synchronized with the development of the
hypervisor. That's great proof that it is well designed.

If the proposal is designed in a general method without limitations,
the spec can keep working like a charm in the past.

> Virtio spec does not mandate it.
> Why?
> Because virtio is so late in the cycle of developing features, that it has to fit into the existing hypervisors design to support the feature and proven UAPIs.
>
> So like RSS, flow filters, statistics, provisioning, and more, it is adding the support for UAPIs which are already present for a while across multiple devices.
>
> So attributing it as mandating is simply wrong.
> It is addressing the existing use case.
>
> One can always build new hypervisor and demand new features from virtio.
> That is perfectly fine.
>
> Your expectation is that device migration framework to work for an undefined hypervisor, which is just silly.

It's not silly, for example virtio was designed before VFIO was
invented. If there's no layer violation and the spec aligns with PCI
spec, we don't need to do any synchronization to say "we can support
VFIO now".  And we never have a feature that claims to work under
condition X,Y,Z in the past.

>
> > > > > > >
> > > > > > > > 2) if the hypervisor is not developed with those
> > > > > > > > assumptions, things can work
> > > > > > > What to explain in #2. :)
> > > > > > > Things can expand when such hypervisor is born.
> > > > > >
> > > > > > So the point is still, to make your proposal to be useful in more use
> > cases.
> > > > > >
> > > > > When a use case arise, device context can be expanded.
> > > >
> > > > It's not device context.
> > > >
> > > I don’t see why not. It is stored in the device.
> > > Remapping part will be hypervisor specific, so it may be stored in platform
> > specific migration data.
> >
> > The point is, device context should work for all type of hypervisors.
> > You can't claim it can only work with your "passthrough" model.
> >
> Which other type you specifically have in mind?
> The current proposal should work for:
> 1. passthrough model
> 2. may be for vdpa model.

Note that, it's not the vdpa model, it's the model that can do
conditional traps for virtio config.

I think we are somehow making an agreement here, we need to make sure
the proposal works in both modes.

Then I'm fine.

> The model seems to work for passthrough and vdpa both cases to me.
>
> If something is missing for #2, either device context can be updated, or new commands can be added.
>
> > >
> > > > > No point in making things no one implements or not present in hypervisor.
> > > > > The infrastructure is extendible so spec is covered for it.
> > > >
> > > > It would be problematic if you stick to claim "passthrough" but not.
> > >
> > > I don’t know what this means. I am not debating passthrough/non-
> > passthrough.
> > > What is inside the device, will be part of device-context.
> > > What is part of the platform content, will be part of platform context.
> > > Since this is generic to all types of PCI devices, I don’t see a need to over-solve
> > it now in virtio.
> >
> > Ok, so you agree it can work even if hypervisor want to trap?
>
> Yes. I believe so, it can work.
> If something is missing, we should discuss to enhance it.

That's great.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-06  6:34                                                                           ` Jason Wang
@ 2023-11-06  7:05                                                                             ` Parav Pandit
  2023-11-07  4:05                                                                               ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-06  7:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 6, 2023 12:05 PM
> 
> On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, November 2, 2023 9:56 AM
> > >
> > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > >
> > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > >
> > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of
> > > > > > > > > Jason Wang
> > > > > > > > >
> > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > > > > > >
> > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > Because for passthrough, the hypervisor is not
> > > > > > > > > > > > involved in dealing with VQ at
> > > > > > > > > > > all.
> > > > > > > > > > >
> > > > > > > > > > > Ok, so if I understand correctly, you are saying
> > > > > > > > > > > your design can't work for the case of PASID assignment.
> > > > > > > > > > >
> > > > > > > > > > No. PASID assignment will happen from the guest for
> > > > > > > > > > its own use and device
> > > > > > > > > migration will just work fine because device context will capture
> this.
> > > > > > > > >
> > > > > > > > > It's not about device context. We're discussing "passthrough",
> no?
> > > > > > > > >
> > > > > > > > Not sure, we are discussing same.
> > > > > > > > A member device is passthrough to the guest, dealing with
> > > > > > > > its own PASIDs and
> > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > So VQ context captured by the hypervisor, will have some
> > > > > > > > PASID attached to
> > > > > > > this VQ.
> > > > > > > > Device context will be updated.
> > > > > > > >
> > > > > > > > > You want all virtio stuff to be "passthrough", but
> > > > > > > > > assigning a PASID to a specific virtqueue in the guest must be
> trapped.
> > > > > > > > >
> > > > > > > > No. PASID assignment to a specific virtqueue in the guest
> > > > > > > > must go directly
> > > > > > > from guest to device.
> > > > > > >
> > > > > > > This works like setting CR3, you can't simply let it go from guest to
> host.
> > > > > > >
> > > > > > > Host IOMMU driver needs to know the PASID to program the IO
> > > > > > > page tables correctly.
> > > > > > >
> > > > > > This will be done by the IOMMU.
> > > > > >
> > > > > > > > When guest iommu may need to communicate anything for this
> > > > > > > > PASID, it will
> > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > >
> > > > > > > Let's say using PASID X for queue 0, this knowledge is
> > > > > > > beyond the IOMMU scope but belongs to virtio. Or please
> > > > > > > explain how it can work when it goes directly from guest to device.
> > > > > > >
> > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > >
> > > > > It has one.
> > > > >
> > > > > > For ok for theory sake it is there.
> > > > > >
> > > > > > Virtio driver will assign the PASID directly from guest driver
> > > > > > to device using a
> > > > > create_vq(pasid=X) command.
> > > > > > Same process is somehow attached the PASID by the guest OS.
> > > > > > The whole PASID range is known to the hypervisor when the
> > > > > > device is handed
> > > > > over to the guest VM.
> > > > >
> > > > > How can it know?
> > > > >
> > > > > > So PASID mapping is setup by the hypervisor IOMMU at this point.
> > > > >
> > > > > You disallow the PASID to be virtualized here. What's more, such
> > > > > a PASID passthrough has security implications.
> > > > >
> > > > No. virtio spec is not disallowing. At least for sure, this series is not the
> one.
> > > > My main point is, virtio device interface will not be the source
> > > > of hypercall to
> > > program IOMMU in the hypervisor.
> > > > It is something to be done by IOMMU side.
> > >
> > > So unless vPASID can be used by the hardware you need to trap the
> > > mapping from a PASID to a virtqueue. Then you need virtio specific
> knowledge.
> > >
> > vPASID by hardware is unlikely to be used by hw PCI EP devices at least in any
> near term future.
> > This requires either vPASID to pPASID table in device or in IOMMU.
> 
> So we are on the same page.
> 
> Claiming a method that can only work for passthrough or emulation is not good.
> We all know virtualization is passthrough + emulation.
Again, I agree but I wont generalize it here.

> 
> >
> > > >
> > > > > Again, we are talking about different things, I've tried to show
> > > > > you that there are cases that passthrough can't work but if you
> > > > > think the only way for migration is to use passthrough in every
> > > > > case, you will
> > > probably fail.
> > > > >
> > > > I didn't say only way for migration is passthrough.
> > > > Passthrough is clearly one way.
> > > > Other ways may be possible.
> > > >
> > > > > >
> > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > There are works ongoing to make vPASID work for
> > > > > > > > > > > > > the guest like
> > > > > > > vSVA.
> > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > >
> > > > > > > > > > > Great, you find another limitation of "passthrough" by
> yourself.
> > > > > > > > > > >
> > > > > > > > > > No. it is not the limitation it is just the way it
> > > > > > > > > > does not need complex SVA to
> > > > > > > > > split the device for unrelated usage.
> > > > > > > > >
> > > > > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > > > > >
> > > > > > > > He he, I am not limiting, again misunderstanding or wrong
> attribution.
> > > > > > > > I explained that hypervisor for passthrough does not need SVA.
> > > > > > > > Guest can do anything it wants from the guest OS with the
> > > > > > > > member
> > > > > device.
> > > > > > >
> > > > > > > Ok, so the point stills, see above.
> > > > > >
> > > > > > I don’t think so. The guest owns its PASID space
> > > > >
> > > > > Again, vPASID to PASID can't be done hardware unless I miss some
> > > > > recent features of IOMMUs.
> > > > >
> > > > Cpu vendors have different way of doing vPASID to pPASID.
> > >
> > > At least for the current version of major IOMMU vendors, such
> > > translation (aka PASID remapping) is not implemented in the hardware
> > > so it needs to be trapped first.
> > >
> > Right. So it is really far in future, atleast few years away.
> >
> > > > It is still an early space for virtio.
> > > >
> > > > > > and directly communicates like any other device attribute.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > Each passthrough device has PASID from its own
> > > > > > > > > > > > space fully managed by the
> > > > > > > > > > > guest.
> > > > > > > > > > > > Some cpu required vPASID and SIOV is not going
> > > > > > > > > > > > this way
> > > anmore.
> > > > > > > > > > >
> > > > > > > > > > > Then how to migrate? Invent a full set of something
> > > > > > > > > > > else through another giant series like this to
> > > > > > > > > > > migrate to the SIOV
> > > thing?
> > > > > > > > > > > That's a mess for
> > > > > > > > > sure.
> > > > > > > > > > >
> > > > > > > > > > SIOV will for sure reuse most or all parts of this
> > > > > > > > > > work, almost entirely
> > > > > as_is.
> > > > > > > > > > vPASID is cpu/platform specific things not part of the SIOV
> devices.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > If at all it is done, it will be done from the
> > > > > > > > > > > > > > guest by the driver using virtio
> > > > > > > > > > > > > interface.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Then you need to trap. Such things couldn't be
> > > > > > > > > > > > > passed through to guests
> > > > > > > > > > > directly.
> > > > > > > > > > > > >
> > > > > > > > > > > > Only PASID capability is trapped. PASID allocation
> > > > > > > > > > > > and usage is directly from
> > > > > > > > > > > guest.
> > > > > > > > > > >
> > > > > > > > > > > How can you achieve this? Assigning a PAISD to a
> > > > > > > > > > > device is completely
> > > > > > > > > > > device(virtio) specific. How can you use a general
> > > > > > > > > > > layer without the knowledge of virtio to trap that?
> > > > > > > > > > When one wants to map vPASID to pPASID a platform
> > > > > > > > > > needs to be
> > > > > > > involved.
> > > > > > > > >
> > > > > > > > > I'm not talking about how to map vPASID to pPASID, it's
> > > > > > > > > out of the scope of virtio. I'm talking about assigning
> > > > > > > > > a vPASID to a specific virtqueue or other virtio function in the
> guest.
> > > > > > > > >
> > > > > > > > That can be done in the guest. The key is guest wont know
> > > > > > > > that it is dealing
> > > > > > > with vPASID.
> > > > > > > > It will follow the same principle from your paper of
> > > > > > > > equivalency, where virtio
> > > > > > > software layer will assign PASID to VQ and communicate to device.
> > > > > > > >
> > > > > > > > Anyway, all of this just digression from current series.
> > > > > > >
> > > > > > > It's not, as you mention that only MSI-X is trapped, I give
> > > > > > > you another
> > > one.
> > > > > > >
> > > > > > PASID access from the guest to be done fully by the guest IOMMU.
> > > > > > Not by virtio devices.
> > > > > >
> > > > > > > >
> > > > > > > > > You need a virtio specific queue or capability to assign
> > > > > > > > > a PASID to a specific virtqueue, and that can't be done
> > > > > > > > > without trapping and without virito specific knowledge.
> > > > > > > > >
> > > > > > > > I disagree. PASID assignment to a virqueue in future from
> > > > > > > > guest virtio driver to
> > > > > > > device is uniform method.
> > > > > > > > Whether its PF assigning PASID to VQ of self, Or VF driver
> > > > > > > > in the guest assigning PASID to VQ.
> > > > > > > >
> > > > > > > > All same.
> > > > > > > > Only IOMMU layer hypercalls will know how to deal with
> > > > > > > > PASID assignment at
> > > > > > > platform layer to setup the domain etc table.
> > > > > > > >
> > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > By any means, if you were implying that somehow vq to
> > > > > > > > PASID assignment
> > > > > > > _may_ need trap+emulation, hence whole device migration to
> > > > > > > depend on some
> > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > >
> > > > > > > See above.
> > > > > > >
> > > > > > Yeah, I disagree to such implying.
> > > > > >
> > > > > > > >
> > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating the
> > > > > > > > guest process and
> > > > > > > all of that just works on efficiency and equivalence
> > > > > > > principle already for a decade now without any trap+emulation.
> > > > > > > >
> > > > > > > > > > When virtio passthrough device is in guest, it has all
> > > > > > > > > > its PASID
> > > > > accessible.
> > > > > > > > > >
> > > > > > > > > > All these is large deviation from current discussion
> > > > > > > > > > of this series, so I will keep
> > > > > > > > > it short.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Regardless it is not relevant to passthrough mode
> > > > > > > > > > > > as PASID is yet another
> > > > > > > > > > > resource.
> > > > > > > > > > > > And for some cpu if it is trapped, it is generic
> > > > > > > > > > > > layer, that does not require virtio
> > > > > > > > > > > involvement.
> > > > > > > > > > > > So virtio interface asking to trap something
> > > > > > > > > > > > because generic facility has done
> > > > > > > > > > > in not the approach.
> > > > > > > > > > >
> > > > > > > > > > > This misses the point of PASID. How to use PASID is
> > > > > > > > > > > totally device
> > > > > > > specific.
> > > > > > > > > > Sure, and how to virtualize vPASID/pPASID is platform
> > > > > > > > > > specific as single PASID
> > > > > > > > > can be used by multiple devices and process.
> > > > > > > > >
> > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > > Capabilities of #2 is generic across all pci
> > > > > > > > > > > > > > devices, so it will be handled by the
> > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > ATS/PRI cap is also generic manner handled by
> > > > > > > > > > > > > > the HV and PCI
> > > > > > > device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU.
> > > > > > > > > > > > > You can simply do ATS/PRI passthrough but with
> > > > > > > > > > > > > an emulated
> > > > > vIOMMU.
> > > > > > > > > > > > And that is not the reason for virtio device to
> > > > > > > > > > > > build
> > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > passthrough member devices.
> > > > > > > > > > >
> > > > > > > > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > >
> > > > > > > > > Shouldn't it arrive at platform IOMMU first? The path
> > > > > > > > > should be PRI
> > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI ->
> > > > > > > > > -> guest
> > > IOMMU.
> > > > > > > > >
> > > > > > > > Above sequence seems write.
> > > > > > > >
> > > > > > > > > And things will be more complicated when (v)PASID is used.
> > > > > > > > > So you can't simply let PRI go directly to the guest
> > > > > > > > > with the current
> > > > > architecture.
> > > > > > > > >
> > > > > > > > In current architecture of the pci VF, PRI does not go
> > > > > > > > directly to the
> > > guest.
> > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > >
> > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we will
> > > > > > > probably trap other things in the future like PASID assignment.
> > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > >
> > > > > It's not about the capability, it's about the whole process of
> > > > > PRI request handling. We've agreed that the PRI request needs to
> > > > > be trapped by the hypervisor and then delivered to the vIOMMU.
> > > > >
> > > >
> > > > > > Trap+emulation done in generic manner without involving virtio
> > > > > > Trap+or other
> > > > > device types.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > how can you pass
> > > > > > > > > > > through a hardware PRI request to a guest directly
> > > > > > > > > > > without trapping it
> > > > > > > then?
> > > > > > > > > > > What's more, PCIE allows the PRI to be done in a
> > > > > > > > > > > vendor
> > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > for virtio?
> > > > > > > > > > >
> > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > Do you have a reference to the ECN that enables vendor
> > > > > > > > > > specific way of PRI? I
> > > > > > > > > would like to read it.
> > > > > > > > >
> > > > > > > > > I mean it doesn't forbid us to build a virtio specific
> > > > > > > > > interface for I/O page fault report and recovery.
> > > > > > > > >
> > > > > > > > So PRI of PCI does not allow. It is ODP kind of technique
> > > > > > > > you meant
> > > above.
> > > > > > > > Yes one can build.
> > > > > > > > Ok. unrelated to device migration, so I will park this
> > > > > > > > good discussion for
> > > > > later.
> > > > > > >
> > > > > > > That's fine.
> > > > > > >
> > > > > > > >
> > > > > > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > > > > > >
> > > > > > > > > Probably.
> > > > > > > > >
> > > > > > > > > > PRI will directly go to the guest driver, and guest
> > > > > > > > > > would interact with IOMMU
> > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > >
> > > > > > > > > With PASID, it can't go directly.
> > > > > > > > >
> > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > But again these PCI-SIG extensions of PASID are not
> > > > > > > > related to device
> > > > > > > migration, so I am differing it.
> > > > > > > >
> > > > > > > > > > For PRI in vendor specific way needs a separate
> > > > > > > > > > discussion. It is not related to
> > > > > > > > > live migration.
> > > > > > > > >
> > > > > > > > > PRI itself is not related. But the point is, you can't
> > > > > > > > > simply pass through ATS/PRI now.
> > > > > > > > >
> > > > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI
> > > > > > > > capabilities are located
> > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > So?
> > > > > > > > So do we start emulating virito interfaces too for passthrough?
> > > > > > > > No.
> > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > Sure why not?
> > > > > > >
> > > > > > > Then let's not limit your proposal to be used by "passthrough" only?
> > > > > > One can possibly build some variant of the existing virtio
> > > > > > member device
> > > > > using same owner and member scheme.
> > > > >
> > > > > It's not about the member/owner, it's about e.g whether the
> > > > > hypervisor can trap and emulate.
> > > > >
> > > > > I've pointed out that what you invent here is actually a partial
> > > > > new transport, for example, a hypervisor can trap and use things
> > > > > like device context in PF to bypass the registers in VF. This is
> > > > > the idea of
> > > transport commands/q.
> > > > >
> > > > I will not mix transport commands which are mainly useful for
> > > > actual device
> > > operation for SIOV only for backward compatibility that too optionally.
> > > > One may still choose to have virtio common and device config in
> > > > MMIO
> > > ofcourse at lower scale.
> > > >
> > > > Anyway, mixing migration context with actual SIOV specific thing
> > > > is not correct
> > > as device context is read/write incremental values.
> > >
> > > SIOV is transport level stuff, the transport virtqueue is designed
> > > in a way that is general enough to cover it. Let's not shift concepts.
> > >
> > Such TVQ is only for backward compatible vPCI composition.
> > For ground up work such TVQ must not be done through the owner device.
> 
> That's the idea actually.
> 
> > Each SIOV device to have its own channel to communicate directly to the
> device.
> >
> > > One thing that you ignore is that, hypervisor can use what you
> > > invented as a transport for VF, no?
> > >
> > No. by design,
> 
> It works like hypervisor traps the virito config and forwards it to admin
> virtqueue and starts the device via device context.
It needs more granular support than the management framework of device context.

> 
> > it is not good idea to overload management commands with actual run time
> guest commands.
> > The device context read writes are largely for incremental updates.
> 
> It doesn't matter if it is incremental or not but
> 
It does because you want different functionality only for purpose of backward compatibility.
That also if the device does not offer them as portion of MMIO BAR.

> 1) the function is there
> 2) hypervisor can use that function if they want and virtio (spec) can't forbid
> that
> 
It is not about forbidding or supporting.
Its about what functionality to use for management plane and guest plane.
Both have different needs.

> >
> > For VF driver it has own direct channel via its own BAR to talk to the device.
> So no need to transport via PF.
> > For SIOV for backward compat vPCI composition, it may be needed.
> > Hard to say, if that can be memory mapped as well on the BAR of the PF.
> > We have seen one device supporting it outside of the virtio.
> > For scale anyway, one needs to use the device own cvq for complex
> configuration.
> 
> That's the idea but I meant your current proposal overlaps those functions.
> 
Not really. One can have simple virtio config space access read/write functionality, in addition to what is done here.
And that is still fine. One is doing proxying for guest.
Management plane is doing more than just register proxy.

> >
> > > >
> > > > > > If for that is some admin commands are missing, may be one can
> > > > > > add
> > > them.
> > > > >
> > > > > I would then build the device context commands on top of the
> > > > > transport commands/q, then it would be complete.
> > > > >
> > > > > > No need to step on toes of use cases as they are different...
> > > > > >
> > > > > > > I've shown you that
> > > > > > >
> > > > > > > 1) you can't easily say you can pass through all the virtio
> > > > > > > facilities
> > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > >
> > > > > > It is not, it is well defined in v3, v2.
> > > > > > One can continue to argue and keep defining the variant and
> > > > > > still call it data
> > > > > path acceleration and then claim it as passthrough ...
> > > > > > But I won't debate this anymore as its just non-technical
> > > > > > aspects of least
> > > > > interest.
> > > > >
> > > > > You use this terminology in the spec which is all about
> > > > > technical, and you think how to define it is a matter of
> > > > > non-technical. This is self-contradictory. If you fail, it probably means it's
> ambiguous.
> > > > > Let's don't use that terminology.
> > > > >
> > > > What it means is described in theory of operation.
> > > >
> > > > > > We have technical tasks and more improved specs to update
> > > > > > going
> > > forward.
> > > > >
> > > > > It's a burden to do the synchronization.
> > > > We have discussed this.
> > > > In current proposed the member device is not bifurcated,
> > >
> > > It is. Part of the functions were carried via the PCI interface,
> > > some are carried via owner. You end up with two drivers to drive the
> devices.
> > >
> > Nop.
> > All admin work of device migration is carried out via the owner device.
> > All guest triggered work is carried out using VF itself.
> 
> Guests don't (or can't) care about how the hypervisor is structured.
For passthrough mode, it just cannot be structured inside the VF.

> So we're discussing the view of device, member devices needs to server for
> 
> 1) request from the transport (it's guest in your context)
> 2) request from the owner

Doing #2 of the owner on the member device functionality do not work when hypervisor do not have access to the member device.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-06  7:05                                                                             ` Parav Pandit
@ 2023-11-07  4:05                                                                               ` Jason Wang
  2023-11-07  7:22                                                                                 ` Michael S. Tsirkin
  2023-11-09  6:25                                                                                 ` Parav Pandit
  0 siblings, 2 replies; 341+ messages in thread
From: Jason Wang @ 2023-11-07  4:05 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 6, 2023 12:05 PM
> >
> > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > >
> > > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > >
> > > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > >
> > > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of
> > > > > > > > > > Jason Wang
> > > > > > > > > >
> > > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > Because for passthrough, the hypervisor is not
> > > > > > > > > > > > > involved in dealing with VQ at
> > > > > > > > > > > > all.
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, so if I understand correctly, you are saying
> > > > > > > > > > > > your design can't work for the case of PASID assignment.
> > > > > > > > > > > >
> > > > > > > > > > > No. PASID assignment will happen from the guest for
> > > > > > > > > > > its own use and device
> > > > > > > > > > migration will just work fine because device context will capture
> > this.
> > > > > > > > > >
> > > > > > > > > > It's not about device context. We're discussing "passthrough",
> > no?
> > > > > > > > > >
> > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > A member device is passthrough to the guest, dealing with
> > > > > > > > > its own PASIDs and
> > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > So VQ context captured by the hypervisor, will have some
> > > > > > > > > PASID attached to
> > > > > > > > this VQ.
> > > > > > > > > Device context will be updated.
> > > > > > > > >
> > > > > > > > > > You want all virtio stuff to be "passthrough", but
> > > > > > > > > > assigning a PASID to a specific virtqueue in the guest must be
> > trapped.
> > > > > > > > > >
> > > > > > > > > No. PASID assignment to a specific virtqueue in the guest
> > > > > > > > > must go directly
> > > > > > > > from guest to device.
> > > > > > > >
> > > > > > > > This works like setting CR3, you can't simply let it go from guest to
> > host.
> > > > > > > >
> > > > > > > > Host IOMMU driver needs to know the PASID to program the IO
> > > > > > > > page tables correctly.
> > > > > > > >
> > > > > > > This will be done by the IOMMU.
> > > > > > >
> > > > > > > > > When guest iommu may need to communicate anything for this
> > > > > > > > > PASID, it will
> > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > >
> > > > > > > > Let's say using PASID X for queue 0, this knowledge is
> > > > > > > > beyond the IOMMU scope but belongs to virtio. Or please
> > > > > > > > explain how it can work when it goes directly from guest to device.
> > > > > > > >
> > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > >
> > > > > > It has one.
> > > > > >
> > > > > > > For ok for theory sake it is there.
> > > > > > >
> > > > > > > Virtio driver will assign the PASID directly from guest driver
> > > > > > > to device using a
> > > > > > create_vq(pasid=X) command.
> > > > > > > Same process is somehow attached the PASID by the guest OS.
> > > > > > > The whole PASID range is known to the hypervisor when the
> > > > > > > device is handed
> > > > > > over to the guest VM.
> > > > > >
> > > > > > How can it know?
> > > > > >
> > > > > > > So PASID mapping is setup by the hypervisor IOMMU at this point.
> > > > > >
> > > > > > You disallow the PASID to be virtualized here. What's more, such
> > > > > > a PASID passthrough has security implications.
> > > > > >
> > > > > No. virtio spec is not disallowing. At least for sure, this series is not the
> > one.
> > > > > My main point is, virtio device interface will not be the source
> > > > > of hypercall to
> > > > program IOMMU in the hypervisor.
> > > > > It is something to be done by IOMMU side.
> > > >
> > > > So unless vPASID can be used by the hardware you need to trap the
> > > > mapping from a PASID to a virtqueue. Then you need virtio specific
> > knowledge.
> > > >
> > > vPASID by hardware is unlikely to be used by hw PCI EP devices at least in any
> > near term future.
> > > This requires either vPASID to pPASID table in device or in IOMMU.
> >
> > So we are on the same page.
> >
> > Claiming a method that can only work for passthrough or emulation is not good.
> > We all know virtualization is passthrough + emulation.
> Again, I agree but I wont generalize it here.
>
> >
> > >
> > > > >
> > > > > > Again, we are talking about different things, I've tried to show
> > > > > > you that there are cases that passthrough can't work but if you
> > > > > > think the only way for migration is to use passthrough in every
> > > > > > case, you will
> > > > probably fail.
> > > > > >
> > > > > I didn't say only way for migration is passthrough.
> > > > > Passthrough is clearly one way.
> > > > > Other ways may be possible.
> > > > >
> > > > > > >
> > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > There are works ongoing to make vPASID work for
> > > > > > > > > > > > > > the guest like
> > > > > > > > vSVA.
> > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > >
> > > > > > > > > > > > Great, you find another limitation of "passthrough" by
> > yourself.
> > > > > > > > > > > >
> > > > > > > > > > > No. it is not the limitation it is just the way it
> > > > > > > > > > > does not need complex SVA to
> > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > >
> > > > > > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > > > > > >
> > > > > > > > > He he, I am not limiting, again misunderstanding or wrong
> > attribution.
> > > > > > > > > I explained that hypervisor for passthrough does not need SVA.
> > > > > > > > > Guest can do anything it wants from the guest OS with the
> > > > > > > > > member
> > > > > > device.
> > > > > > > >
> > > > > > > > Ok, so the point stills, see above.
> > > > > > >
> > > > > > > I don’t think so. The guest owns its PASID space
> > > > > >
> > > > > > Again, vPASID to PASID can't be done hardware unless I miss some
> > > > > > recent features of IOMMUs.
> > > > > >
> > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > >
> > > > At least for the current version of major IOMMU vendors, such
> > > > translation (aka PASID remapping) is not implemented in the hardware
> > > > so it needs to be trapped first.
> > > >
> > > Right. So it is really far in future, atleast few years away.
> > >
> > > > > It is still an early space for virtio.
> > > > >
> > > > > > > and directly communicates like any other device attribute.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > Each passthrough device has PASID from its own
> > > > > > > > > > > > > space fully managed by the
> > > > > > > > > > > > guest.
> > > > > > > > > > > > > Some cpu required vPASID and SIOV is not going
> > > > > > > > > > > > > this way
> > > > anmore.
> > > > > > > > > > > >
> > > > > > > > > > > > Then how to migrate? Invent a full set of something
> > > > > > > > > > > > else through another giant series like this to
> > > > > > > > > > > > migrate to the SIOV
> > > > thing?
> > > > > > > > > > > > That's a mess for
> > > > > > > > > > sure.
> > > > > > > > > > > >
> > > > > > > > > > > SIOV will for sure reuse most or all parts of this
> > > > > > > > > > > work, almost entirely
> > > > > > as_is.
> > > > > > > > > > > vPASID is cpu/platform specific things not part of the SIOV
> > devices.
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If at all it is done, it will be done from the
> > > > > > > > > > > > > > > guest by the driver using virtio
> > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Then you need to trap. Such things couldn't be
> > > > > > > > > > > > > > passed through to guests
> > > > > > > > > > > > directly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > Only PASID capability is trapped. PASID allocation
> > > > > > > > > > > > > and usage is directly from
> > > > > > > > > > > > guest.
> > > > > > > > > > > >
> > > > > > > > > > > > How can you achieve this? Assigning a PAISD to a
> > > > > > > > > > > > device is completely
> > > > > > > > > > > > device(virtio) specific. How can you use a general
> > > > > > > > > > > > layer without the knowledge of virtio to trap that?
> > > > > > > > > > > When one wants to map vPASID to pPASID a platform
> > > > > > > > > > > needs to be
> > > > > > > > involved.
> > > > > > > > > >
> > > > > > > > > > I'm not talking about how to map vPASID to pPASID, it's
> > > > > > > > > > out of the scope of virtio. I'm talking about assigning
> > > > > > > > > > a vPASID to a specific virtqueue or other virtio function in the
> > guest.
> > > > > > > > > >
> > > > > > > > > That can be done in the guest. The key is guest wont know
> > > > > > > > > that it is dealing
> > > > > > > > with vPASID.
> > > > > > > > > It will follow the same principle from your paper of
> > > > > > > > > equivalency, where virtio
> > > > > > > > software layer will assign PASID to VQ and communicate to device.
> > > > > > > > >
> > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > >
> > > > > > > > It's not, as you mention that only MSI-X is trapped, I give
> > > > > > > > you another
> > > > one.
> > > > > > > >
> > > > > > > PASID access from the guest to be done fully by the guest IOMMU.
> > > > > > > Not by virtio devices.
> > > > > > >
> > > > > > > > >
> > > > > > > > > > You need a virtio specific queue or capability to assign
> > > > > > > > > > a PASID to a specific virtqueue, and that can't be done
> > > > > > > > > > without trapping and without virito specific knowledge.
> > > > > > > > > >
> > > > > > > > > I disagree. PASID assignment to a virqueue in future from
> > > > > > > > > guest virtio driver to
> > > > > > > > device is uniform method.
> > > > > > > > > Whether its PF assigning PASID to VQ of self, Or VF driver
> > > > > > > > > in the guest assigning PASID to VQ.
> > > > > > > > >
> > > > > > > > > All same.
> > > > > > > > > Only IOMMU layer hypercalls will know how to deal with
> > > > > > > > > PASID assignment at
> > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > >
> > > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > > By any means, if you were implying that somehow vq to
> > > > > > > > > PASID assignment
> > > > > > > > _may_ need trap+emulation, hence whole device migration to
> > > > > > > > depend on some
> > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > >
> > > > > > > > See above.
> > > > > > > >
> > > > > > > Yeah, I disagree to such implying.
> > > > > > >
> > > > > > > > >
> > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating the
> > > > > > > > > guest process and
> > > > > > > > all of that just works on efficiency and equivalence
> > > > > > > > principle already for a decade now without any trap+emulation.
> > > > > > > > >
> > > > > > > > > > > When virtio passthrough device is in guest, it has all
> > > > > > > > > > > its PASID
> > > > > > accessible.
> > > > > > > > > > >
> > > > > > > > > > > All these is large deviation from current discussion
> > > > > > > > > > > of this series, so I will keep
> > > > > > > > > > it short.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Regardless it is not relevant to passthrough mode
> > > > > > > > > > > > > as PASID is yet another
> > > > > > > > > > > > resource.
> > > > > > > > > > > > > And for some cpu if it is trapped, it is generic
> > > > > > > > > > > > > layer, that does not require virtio
> > > > > > > > > > > > involvement.
> > > > > > > > > > > > > So virtio interface asking to trap something
> > > > > > > > > > > > > because generic facility has done
> > > > > > > > > > > > in not the approach.
> > > > > > > > > > > >
> > > > > > > > > > > > This misses the point of PASID. How to use PASID is
> > > > > > > > > > > > totally device
> > > > > > > > specific.
> > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID is platform
> > > > > > > > > > > specific as single PASID
> > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > >
> > > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > Capabilities of #2 is generic across all pci
> > > > > > > > > > > > > > > devices, so it will be handled by the
> > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > ATS/PRI cap is also generic manner handled by
> > > > > > > > > > > > > > > the HV and PCI
> > > > > > > > device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU.
> > > > > > > > > > > > > > You can simply do ATS/PRI passthrough but with
> > > > > > > > > > > > > > an emulated
> > > > > > vIOMMU.
> > > > > > > > > > > > > And that is not the reason for virtio device to
> > > > > > > > > > > > > build
> > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > >
> > > > > > > > > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > >
> > > > > > > > > > Shouldn't it arrive at platform IOMMU first? The path
> > > > > > > > > > should be PRI
> > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI ->
> > > > > > > > > > -> guest
> > > > IOMMU.
> > > > > > > > > >
> > > > > > > > > Above sequence seems write.
> > > > > > > > >
> > > > > > > > > > And things will be more complicated when (v)PASID is used.
> > > > > > > > > > So you can't simply let PRI go directly to the guest
> > > > > > > > > > with the current
> > > > > > architecture.
> > > > > > > > > >
> > > > > > > > > In current architecture of the pci VF, PRI does not go
> > > > > > > > > directly to the
> > > > guest.
> > > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > > >
> > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we will
> > > > > > > > probably trap other things in the future like PASID assignment.
> > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > >
> > > > > > It's not about the capability, it's about the whole process of
> > > > > > PRI request handling. We've agreed that the PRI request needs to
> > > > > > be trapped by the hypervisor and then delivered to the vIOMMU.
> > > > > >
> > > > >
> > > > > > > Trap+emulation done in generic manner without involving virtio
> > > > > > > Trap+or other
> > > > > > device types.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > how can you pass
> > > > > > > > > > > > through a hardware PRI request to a guest directly
> > > > > > > > > > > > without trapping it
> > > > > > > > then?
> > > > > > > > > > > > What's more, PCIE allows the PRI to be done in a
> > > > > > > > > > > > vendor
> > > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > for virtio?
> > > > > > > > > > > >
> > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > Do you have a reference to the ECN that enables vendor
> > > > > > > > > > > specific way of PRI? I
> > > > > > > > > > would like to read it.
> > > > > > > > > >
> > > > > > > > > > I mean it doesn't forbid us to build a virtio specific
> > > > > > > > > > interface for I/O page fault report and recovery.
> > > > > > > > > >
> > > > > > > > > So PRI of PCI does not allow. It is ODP kind of technique
> > > > > > > > > you meant
> > > > above.
> > > > > > > > > Yes one can build.
> > > > > > > > > Ok. unrelated to device migration, so I will park this
> > > > > > > > > good discussion for
> > > > > > later.
> > > > > > > >
> > > > > > > > That's fine.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > > > > > > >
> > > > > > > > > > Probably.
> > > > > > > > > >
> > > > > > > > > > > PRI will directly go to the guest driver, and guest
> > > > > > > > > > > would interact with IOMMU
> > > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > > >
> > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > >
> > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > But again these PCI-SIG extensions of PASID are not
> > > > > > > > > related to device
> > > > > > > > migration, so I am differing it.
> > > > > > > > >
> > > > > > > > > > > For PRI in vendor specific way needs a separate
> > > > > > > > > > > discussion. It is not related to
> > > > > > > > > > live migration.
> > > > > > > > > >
> > > > > > > > > > PRI itself is not related. But the point is, you can't
> > > > > > > > > > simply pass through ATS/PRI now.
> > > > > > > > > >
> > > > > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI
> > > > > > > > > capabilities are located
> > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > So?
> > > > > > > > > So do we start emulating virito interfaces too for passthrough?
> > > > > > > > > No.
> > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > Sure why not?
> > > > > > > >
> > > > > > > > Then let's not limit your proposal to be used by "passthrough" only?
> > > > > > > One can possibly build some variant of the existing virtio
> > > > > > > member device
> > > > > > using same owner and member scheme.
> > > > > >
> > > > > > It's not about the member/owner, it's about e.g whether the
> > > > > > hypervisor can trap and emulate.
> > > > > >
> > > > > > I've pointed out that what you invent here is actually a partial
> > > > > > new transport, for example, a hypervisor can trap and use things
> > > > > > like device context in PF to bypass the registers in VF. This is
> > > > > > the idea of
> > > > transport commands/q.
> > > > > >
> > > > > I will not mix transport commands which are mainly useful for
> > > > > actual device
> > > > operation for SIOV only for backward compatibility that too optionally.
> > > > > One may still choose to have virtio common and device config in
> > > > > MMIO
> > > > ofcourse at lower scale.
> > > > >
> > > > > Anyway, mixing migration context with actual SIOV specific thing
> > > > > is not correct
> > > > as device context is read/write incremental values.
> > > >
> > > > SIOV is transport level stuff, the transport virtqueue is designed
> > > > in a way that is general enough to cover it. Let's not shift concepts.
> > > >
> > > Such TVQ is only for backward compatible vPCI composition.
> > > For ground up work such TVQ must not be done through the owner device.
> >
> > That's the idea actually.
> >
> > > Each SIOV device to have its own channel to communicate directly to the
> > device.
> > >
> > > > One thing that you ignore is that, hypervisor can use what you
> > > > invented as a transport for VF, no?
> > > >
> > > No. by design,
> >
> > It works like hypervisor traps the virito config and forwards it to admin
> > virtqueue and starts the device via device context.
> It needs more granular support than the management framework of device context.

It doesn't otherwise it is a design defect as you can't recover the
device context in the destination.

Let me give you an example:

1) in the case of live migration, dst receive migration byte flows and
convert them into device context
2) in the case of transporting, hypervisor traps virtio config and
convert them into the device context

I don't see anything different in this case. Or can you give me an example?

>
> >
> > > it is not good idea to overload management commands with actual run time
> > guest commands.
> > > The device context read writes are largely for incremental updates.
> >
> > It doesn't matter if it is incremental or not but
> >
> It does because you want different functionality only for purpose of backward compatibility.
> That also if the device does not offer them as portion of MMIO BAR.

I don't see how it is related to the "incremental part".

>
> > 1) the function is there
> > 2) hypervisor can use that function if they want and virtio (spec) can't forbid
> > that
> >
> It is not about forbidding or supporting.
> Its about what functionality to use for management plane and guest plane.
> Both have different needs.

People can have different views, there's nothing we can prevent a
hypervisor from using it as a transport as far as I can see.

>
> > >
> > > For VF driver it has own direct channel via its own BAR to talk to the device.
> > So no need to transport via PF.
> > > For SIOV for backward compat vPCI composition, it may be needed.
> > > Hard to say, if that can be memory mapped as well on the BAR of the PF.
> > > We have seen one device supporting it outside of the virtio.
> > > For scale anyway, one needs to use the device own cvq for complex
> > configuration.
> >
> > That's the idea but I meant your current proposal overlaps those functions.
> >
> Not really. One can have simple virtio config space access read/write functionality, in addition to what is done here.
> And that is still fine. One is doing proxying for guest.
> Management plane is doing more than just register proxy.

See above, let's figure out whether it is possible as a transport first then.

>
> > >
> > > > >
> > > > > > > If for that is some admin commands are missing, may be one can
> > > > > > > add
> > > > them.
> > > > > >
> > > > > > I would then build the device context commands on top of the
> > > > > > transport commands/q, then it would be complete.
> > > > > >
> > > > > > > No need to step on toes of use cases as they are different...
> > > > > > >
> > > > > > > > I've shown you that
> > > > > > > >
> > > > > > > > 1) you can't easily say you can pass through all the virtio
> > > > > > > > facilities
> > > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > > >
> > > > > > > It is not, it is well defined in v3, v2.
> > > > > > > One can continue to argue and keep defining the variant and
> > > > > > > still call it data
> > > > > > path acceleration and then claim it as passthrough ...
> > > > > > > But I won't debate this anymore as its just non-technical
> > > > > > > aspects of least
> > > > > > interest.
> > > > > >
> > > > > > You use this terminology in the spec which is all about
> > > > > > technical, and you think how to define it is a matter of
> > > > > > non-technical. This is self-contradictory. If you fail, it probably means it's
> > ambiguous.
> > > > > > Let's don't use that terminology.
> > > > > >
> > > > > What it means is described in theory of operation.
> > > > >
> > > > > > > We have technical tasks and more improved specs to update
> > > > > > > going
> > > > forward.
> > > > > >
> > > > > > It's a burden to do the synchronization.
> > > > > We have discussed this.
> > > > > In current proposed the member device is not bifurcated,
> > > >
> > > > It is. Part of the functions were carried via the PCI interface,
> > > > some are carried via owner. You end up with two drivers to drive the
> > devices.
> > > >
> > > Nop.
> > > All admin work of device migration is carried out via the owner device.
> > > All guest triggered work is carried out using VF itself.
> >
> > Guests don't (or can't) care about how the hypervisor is structured.
> For passthrough mode, it just cannot be structured inside the VF.

Well, again, we are talking about different things.

>
> > So we're discussing the view of device, member devices needs to server for
> >
> > 1) request from the transport (it's guest in your context)
> > 2) request from the owner
>
> Doing #2 of the owner on the member device functionality do not work when hypervisor do not have access to the member device.

I don't get here, isn't 2) just what we invent for admin commands?
Driver sends commands to the owner, owner forward those requests to
the member?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-07  4:05                                                                               ` Jason Wang
@ 2023-11-07  7:22                                                                                 ` Michael S. Tsirkin
  2023-11-07  7:57                                                                                   ` Zhu, Lingshan
  2023-11-08  4:28                                                                                   ` Jason Wang
  2023-11-09  6:25                                                                                 ` Parav Pandit
  1 sibling, 2 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07  7:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Tue, Nov 07, 2023 at 12:05:12PM +0800, Jason Wang wrote:
> > > > > One thing that you ignore is that, hypervisor can use what you
> > > > > invented as a transport for VF, no?
> > > > >
> > > > No. by design,
> > >
> > > It works like hypervisor traps the virito config and forwards it to admin
> > > virtqueue and starts the device via device context.
> > It needs more granular support than the management framework of device context.
> 
> It doesn't otherwise it is a design defect as you can't recover the
> device context in the destination.
> 
> Let me give you an example:
> 
> 1) in the case of live migration, dst receive migration byte flows and
> convert them into device context
> 2) in the case of transporting, hypervisor traps virtio config and
> convert them into the device context
> 
> I don't see anything different in this case. Or can you give me an example?

"trap virtio config" means "trap writes into virtio config" presumably?
config can change itself without driver doing anything. Hypervisor
can't trap it then. This is one of the problems with Lingshan's
SUSPEND bit - what happens with these config changes in underspecified.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-07  7:22                                                                                 ` Michael S. Tsirkin
@ 2023-11-07  7:57                                                                                   ` Zhu, Lingshan
  2023-11-07  8:05                                                                                     ` Michael S. Tsirkin
  2023-11-08  4:28                                                                                   ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  7:57 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



On 11/7/2023 3:22 PM, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 12:05:12PM +0800, Jason Wang wrote:
>>>>>> One thing that you ignore is that, hypervisor can use what you
>>>>>> invented as a transport for VF, no?
>>>>>>
>>>>> No. by design,
>>>> It works like hypervisor traps the virito config and forwards it to admin
>>>> virtqueue and starts the device via device context.
>>> It needs more granular support than the management framework of device context.
>> It doesn't otherwise it is a design defect as you can't recover the
>> device context in the destination.
>>
>> Let me give you an example:
>>
>> 1) in the case of live migration, dst receive migration byte flows and
>> convert them into device context
>> 2) in the case of transporting, hypervisor traps virtio config and
>> convert them into the device context
>>
>> I don't see anything different in this case. Or can you give me an example?
> "trap virtio config" means "trap writes into virtio config" presumably?
> config can change itself without driver doing anything. Hypervisor
> can't trap it then. This is one of the problems with Lingshan's
> SUSPEND bit - what happens with these config changes in underspecified.
It should send a config interrupt and increase its generation.

By the way, is it buggy if a device suspend itself?
Is needs_reset better even not properly handled by virito driver
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-07  7:57                                                                                   ` Zhu, Lingshan
@ 2023-11-07  8:05                                                                                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07  8:05 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Tue, Nov 07, 2023 at 03:57:29PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/7/2023 3:22 PM, Michael S. Tsirkin wrote:
> > On Tue, Nov 07, 2023 at 12:05:12PM +0800, Jason Wang wrote:
> > > > > > > One thing that you ignore is that, hypervisor can use what you
> > > > > > > invented as a transport for VF, no?
> > > > > > > 
> > > > > > No. by design,
> > > > > It works like hypervisor traps the virito config and forwards it to admin
> > > > > virtqueue and starts the device via device context.
> > > > It needs more granular support than the management framework of device context.
> > > It doesn't otherwise it is a design defect as you can't recover the
> > > device context in the destination.
> > > 
> > > Let me give you an example:
> > > 
> > > 1) in the case of live migration, dst receive migration byte flows and
> > > convert them into device context
> > > 2) in the case of transporting, hypervisor traps virtio config and
> > > convert them into the device context
> > > 
> > > I don't see anything different in this case. Or can you give me an example?
> > "trap virtio config" means "trap writes into virtio config" presumably?
> > config can change itself without driver doing anything. Hypervisor
> > can't trap it then. This is one of the problems with Lingshan's
> > SUSPEND bit - what happens with these config changes in underspecified.
> It should send a config interrupt and increase its generation.

/me shrugs

> By the way, is it buggy if a device suspend itself?
> Is needs_reset better even not properly handled by virito driver

NEEDS_RESET turned out not to be a great design.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-07  7:22                                                                                 ` Michael S. Tsirkin
  2023-11-07  7:57                                                                                   ` Zhu, Lingshan
@ 2023-11-08  4:28                                                                                   ` Jason Wang
  1 sibling, 0 replies; 341+ messages in thread
From: Jason Wang @ 2023-11-08  4:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Tue, Nov 7, 2023 at 3:22 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Nov 07, 2023 at 12:05:12PM +0800, Jason Wang wrote:
> > > > > > One thing that you ignore is that, hypervisor can use what you
> > > > > > invented as a transport for VF, no?
> > > > > >
> > > > > No. by design,
> > > >
> > > > It works like hypervisor traps the virito config and forwards it to admin
> > > > virtqueue and starts the device via device context.
> > > It needs more granular support than the management framework of device context.
> >
> > It doesn't otherwise it is a design defect as you can't recover the
> > device context in the destination.
> >
> > Let me give you an example:
> >
> > 1) in the case of live migration, dst receive migration byte flows and
> > convert them into device context
> > 2) in the case of transporting, hypervisor traps virtio config and
> > convert them into the device context
> >
> > I don't see anything different in this case. Or can you give me an example?
>
> "trap virtio config" means "trap writes into virtio config" presumably?

It means trap as current Qemu did.

> config can change itself without driver doing anything. Hypervisor
> can't trap it then.

I don't understand here, the hypervisor can see the config interrupt.
This is how Qemu works now?

> This is one of the problems with Lingshan's
> SUSPEND bit - what happens with these config changes in underspecified.

I don't see a direct relationship with SUSPEND bit here. I meant this
proposal can be used for the hypervisor to trap and emulate config.

Thanks


>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  2023-11-06  6:35                                                                                 ` Jason Wang
@ 2023-11-09  6:24                                                                                   ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-09  6:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 6, 2023 12:05 PM

[..]
> > > It's not the charge of the virtio spec to mandate any type of hypervisor
> design.
> > > But it looks to me you want to do that.
> > You always attribute is wrong to disregard the proposal which is incorrect.
> 
> This is not my point. I'm just saying, I never say any virtio existing facilities need
> to be synchronized with the development of the hypervisor. That's great proof
> that it is well designed.
> 
> If the proposal is designed in a general method without limitations, the spec can
> keep working like a charm in the past.
> 

The fact is there are at lest 3 hypervisors exists which has defined the live migration framework.
And virtio is adapting to it.

Defining something very generic without a known interface is mostly theoretical discussion.

> > Virtio spec does not mandate it.
> > Why?
> > Because virtio is so late in the cycle of developing features, that it has to fit
> into the existing hypervisors design to support the feature and proven UAPIs.
> >
> > So like RSS, flow filters, statistics, provisioning, and more, it is adding the
> support for UAPIs which are already present for a while across multiple devices.
> >
> > So attributing it as mandating is simply wrong.
> > It is addressing the existing use case.
> >
> > One can always build new hypervisor and demand new features from virtio.
> > That is perfectly fine.
> >
> > Your expectation is that device migration framework to work for an undefined
> hypervisor, which is just silly.
> 
> It's not silly, for example virtio was designed before VFIO was invented. If there's
> no layer violation and the spec aligns with PCI spec, we don't need to do any
> synchronization to say "we can support VFIO now".  And we never have a
> feature that claims to work under condition X,Y,Z in the past.
> 
With that theory, device migration should have worked without any patches that we are doing now. :)

> >
> > > > > > > >
> > > > > > > > > 2) if the hypervisor is not developed with those
> > > > > > > > > assumptions, things can work
> > > > > > > > What to explain in #2. :)
> > > > > > > > Things can expand when such hypervisor is born.
> > > > > > >
> > > > > > > So the point is still, to make your proposal to be useful in
> > > > > > > more use
> > > cases.
> > > > > > >
> > > > > > When a use case arise, device context can be expanded.
> > > > >
> > > > > It's not device context.
> > > > >
> > > > I don’t see why not. It is stored in the device.
> > > > Remapping part will be hypervisor specific, so it may be stored in
> > > > platform
> > > specific migration data.
> > >
> > > The point is, device context should work for all type of hypervisors.
> > > You can't claim it can only work with your "passthrough" model.
> > >
> > Which other type you specifically have in mind?
> > The current proposal should work for:
> > 1. passthrough model
> > 2. may be for vdpa model.
> 
> Note that, it's not the vdpa model, it's the model that can do conditional traps
> for virtio config.
> 
> I think we are somehow making an agreement here, we need to make sure the
> proposal works in both modes.
> 
> Then I'm fine.
> 
Ok. So lets extend the admin commands that fits the both the models.
I did for #1, you should see which of those can be useful for #2 or it needs new commands or cmd extensions.

> > The model seems to work for passthrough and vdpa both cases to me.
> >
> > If something is missing for #2, either device context can be updated, or new
> commands can be added.
> >
> > > >
> > > > > > No point in making things no one implements or not present in
> hypervisor.
> > > > > > The infrastructure is extendible so spec is covered for it.
> > > > >
> > > > > It would be problematic if you stick to claim "passthrough" but not.
> > > >
> > > > I don’t know what this means. I am not debating passthrough/non-
> > > passthrough.
> > > > What is inside the device, will be part of device-context.
> > > > What is part of the platform content, will be part of platform context.
> > > > Since this is generic to all types of PCI devices, I don’t see a
> > > > need to over-solve
> > > it now in virtio.
> > >
> > > Ok, so you agree it can work even if hypervisor want to trap?
> >
> > Yes. I believe so, it can work.
> > If something is missing, we should discuss to enhance it.
> 
> That's great.
> 
> Thanks


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-07  4:05                                                                               ` Jason Wang
  2023-11-07  7:22                                                                                 ` Michael S. Tsirkin
@ 2023-11-09  6:25                                                                                 ` Parav Pandit
  2023-11-13  3:32                                                                                   ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-09  6:25 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 7, 2023 9:35 AM
> 
> On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 6, 2023 12:05 PM
> > >
> > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > > >
> > > > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > > >
> > > > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > > >
> > > > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of
> > > > > > > > > > > Jason Wang
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > > Because for passthrough, the hypervisor is not
> > > > > > > > > > > > > > involved in dealing with VQ at
> > > > > > > > > > > > > all.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, so if I understand correctly, you are saying
> > > > > > > > > > > > > your design can't work for the case of PASID assignment.
> > > > > > > > > > > > >
> > > > > > > > > > > > No. PASID assignment will happen from the guest
> > > > > > > > > > > > for its own use and device
> > > > > > > > > > > migration will just work fine because device context
> > > > > > > > > > > will capture
> > > this.
> > > > > > > > > > >
> > > > > > > > > > > It's not about device context. We're discussing
> > > > > > > > > > > "passthrough",
> > > no?
> > > > > > > > > > >
> > > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > > A member device is passthrough to the guest, dealing
> > > > > > > > > > with its own PASIDs and
> > > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > > So VQ context captured by the hypervisor, will have
> > > > > > > > > > some PASID attached to
> > > > > > > > > this VQ.
> > > > > > > > > > Device context will be updated.
> > > > > > > > > >
> > > > > > > > > > > You want all virtio stuff to be "passthrough", but
> > > > > > > > > > > assigning a PASID to a specific virtqueue in the
> > > > > > > > > > > guest must be
> > > trapped.
> > > > > > > > > > >
> > > > > > > > > > No. PASID assignment to a specific virtqueue in the
> > > > > > > > > > guest must go directly
> > > > > > > > > from guest to device.
> > > > > > > > >
> > > > > > > > > This works like setting CR3, you can't simply let it go
> > > > > > > > > from guest to
> > > host.
> > > > > > > > >
> > > > > > > > > Host IOMMU driver needs to know the PASID to program the
> > > > > > > > > IO page tables correctly.
> > > > > > > > >
> > > > > > > > This will be done by the IOMMU.
> > > > > > > >
> > > > > > > > > > When guest iommu may need to communicate anything for
> > > > > > > > > > this PASID, it will
> > > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > > >
> > > > > > > > > Let's say using PASID X for queue 0, this knowledge is
> > > > > > > > > beyond the IOMMU scope but belongs to virtio. Or please
> > > > > > > > > explain how it can work when it goes directly from guest to
> device.
> > > > > > > > >
> > > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > > >
> > > > > > > It has one.
> > > > > > >
> > > > > > > > For ok for theory sake it is there.
> > > > > > > >
> > > > > > > > Virtio driver will assign the PASID directly from guest
> > > > > > > > driver to device using a
> > > > > > > create_vq(pasid=X) command.
> > > > > > > > Same process is somehow attached the PASID by the guest OS.
> > > > > > > > The whole PASID range is known to the hypervisor when the
> > > > > > > > device is handed
> > > > > > > over to the guest VM.
> > > > > > >
> > > > > > > How can it know?
> > > > > > >
> > > > > > > > So PASID mapping is setup by the hypervisor IOMMU at this point.
> > > > > > >
> > > > > > > You disallow the PASID to be virtualized here. What's more,
> > > > > > > such a PASID passthrough has security implications.
> > > > > > >
> > > > > > No. virtio spec is not disallowing. At least for sure, this
> > > > > > series is not the
> > > one.
> > > > > > My main point is, virtio device interface will not be the
> > > > > > source of hypercall to
> > > > > program IOMMU in the hypervisor.
> > > > > > It is something to be done by IOMMU side.
> > > > >
> > > > > So unless vPASID can be used by the hardware you need to trap
> > > > > the mapping from a PASID to a virtqueue. Then you need virtio
> > > > > specific
> > > knowledge.
> > > > >
> > > > vPASID by hardware is unlikely to be used by hw PCI EP devices at
> > > > least in any
> > > near term future.
> > > > This requires either vPASID to pPASID table in device or in IOMMU.
> > >
> > > So we are on the same page.
> > >
> > > Claiming a method that can only work for passthrough or emulation is not
> good.
> > > We all know virtualization is passthrough + emulation.
> > Again, I agree but I wont generalize it here.
> >
> > >
> > > >
> > > > > >
> > > > > > > Again, we are talking about different things, I've tried to
> > > > > > > show you that there are cases that passthrough can't work
> > > > > > > but if you think the only way for migration is to use
> > > > > > > passthrough in every case, you will
> > > > > probably fail.
> > > > > > >
> > > > > > I didn't say only way for migration is passthrough.
> > > > > > Passthrough is clearly one way.
> > > > > > Other ways may be possible.
> > > > > >
> > > > > > > >
> > > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > There are works ongoing to make vPASID work
> > > > > > > > > > > > > > > for the guest like
> > > > > > > > > vSVA.
> > > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Great, you find another limitation of
> > > > > > > > > > > > > "passthrough" by
> > > yourself.
> > > > > > > > > > > > >
> > > > > > > > > > > > No. it is not the limitation it is just the way it
> > > > > > > > > > > > does not need complex SVA to
> > > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > > >
> > > > > > > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > > > > > > >
> > > > > > > > > > He he, I am not limiting, again misunderstanding or
> > > > > > > > > > wrong
> > > attribution.
> > > > > > > > > > I explained that hypervisor for passthrough does not need SVA.
> > > > > > > > > > Guest can do anything it wants from the guest OS with
> > > > > > > > > > the member
> > > > > > > device.
> > > > > > > > >
> > > > > > > > > Ok, so the point stills, see above.
> > > > > > > >
> > > > > > > > I don’t think so. The guest owns its PASID space
> > > > > > >
> > > > > > > Again, vPASID to PASID can't be done hardware unless I miss
> > > > > > > some recent features of IOMMUs.
> > > > > > >
> > > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > > >
> > > > > At least for the current version of major IOMMU vendors, such
> > > > > translation (aka PASID remapping) is not implemented in the
> > > > > hardware so it needs to be trapped first.
> > > > >
> > > > Right. So it is really far in future, atleast few years away.
> > > >
> > > > > > It is still an early space for virtio.
> > > > > >
> > > > > > > > and directly communicates like any other device attribute.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > > Each passthrough device has PASID from its own
> > > > > > > > > > > > > > space fully managed by the
> > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > Some cpu required vPASID and SIOV is not going
> > > > > > > > > > > > > > this way
> > > > > anmore.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Then how to migrate? Invent a full set of
> > > > > > > > > > > > > something else through another giant series like
> > > > > > > > > > > > > this to migrate to the SIOV
> > > > > thing?
> > > > > > > > > > > > > That's a mess for
> > > > > > > > > > > sure.
> > > > > > > > > > > > >
> > > > > > > > > > > > SIOV will for sure reuse most or all parts of this
> > > > > > > > > > > > work, almost entirely
> > > > > > > as_is.
> > > > > > > > > > > > vPASID is cpu/platform specific things not part of
> > > > > > > > > > > > the SIOV
> > > devices.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If at all it is done, it will be done from
> > > > > > > > > > > > > > > > the guest by the driver using virtio
> > > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Then you need to trap. Such things couldn't
> > > > > > > > > > > > > > > be passed through to guests
> > > > > > > > > > > > > directly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > Only PASID capability is trapped. PASID
> > > > > > > > > > > > > > allocation and usage is directly from
> > > > > > > > > > > > > guest.
> > > > > > > > > > > > >
> > > > > > > > > > > > > How can you achieve this? Assigning a PAISD to a
> > > > > > > > > > > > > device is completely
> > > > > > > > > > > > > device(virtio) specific. How can you use a
> > > > > > > > > > > > > general layer without the knowledge of virtio to trap that?
> > > > > > > > > > > > When one wants to map vPASID to pPASID a platform
> > > > > > > > > > > > needs to be
> > > > > > > > > involved.
> > > > > > > > > > >
> > > > > > > > > > > I'm not talking about how to map vPASID to pPASID,
> > > > > > > > > > > it's out of the scope of virtio. I'm talking about
> > > > > > > > > > > assigning a vPASID to a specific virtqueue or other
> > > > > > > > > > > virtio function in the
> > > guest.
> > > > > > > > > > >
> > > > > > > > > > That can be done in the guest. The key is guest wont
> > > > > > > > > > know that it is dealing
> > > > > > > > > with vPASID.
> > > > > > > > > > It will follow the same principle from your paper of
> > > > > > > > > > equivalency, where virtio
> > > > > > > > > software layer will assign PASID to VQ and communicate to
> device.
> > > > > > > > > >
> > > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > > >
> > > > > > > > > It's not, as you mention that only MSI-X is trapped, I
> > > > > > > > > give you another
> > > > > one.
> > > > > > > > >
> > > > > > > > PASID access from the guest to be done fully by the guest IOMMU.
> > > > > > > > Not by virtio devices.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > You need a virtio specific queue or capability to
> > > > > > > > > > > assign a PASID to a specific virtqueue, and that
> > > > > > > > > > > can't be done without trapping and without virito specific
> knowledge.
> > > > > > > > > > >
> > > > > > > > > > I disagree. PASID assignment to a virqueue in future
> > > > > > > > > > from guest virtio driver to
> > > > > > > > > device is uniform method.
> > > > > > > > > > Whether its PF assigning PASID to VQ of self, Or VF
> > > > > > > > > > driver in the guest assigning PASID to VQ.
> > > > > > > > > >
> > > > > > > > > > All same.
> > > > > > > > > > Only IOMMU layer hypercalls will know how to deal with
> > > > > > > > > > PASID assignment at
> > > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > > >
> > > > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > > > By any means, if you were implying that somehow vq to
> > > > > > > > > > PASID assignment
> > > > > > > > > _may_ need trap+emulation, hence whole device migration
> > > > > > > > > to depend on some
> > > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > > >
> > > > > > > > > See above.
> > > > > > > > >
> > > > > > > > Yeah, I disagree to such implying.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating
> > > > > > > > > > the guest process and
> > > > > > > > > all of that just works on efficiency and equivalence
> > > > > > > > > principle already for a decade now without any trap+emulation.
> > > > > > > > > >
> > > > > > > > > > > > When virtio passthrough device is in guest, it has
> > > > > > > > > > > > all its PASID
> > > > > > > accessible.
> > > > > > > > > > > >
> > > > > > > > > > > > All these is large deviation from current
> > > > > > > > > > > > discussion of this series, so I will keep
> > > > > > > > > > > it short.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Regardless it is not relevant to passthrough
> > > > > > > > > > > > > > mode as PASID is yet another
> > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > And for some cpu if it is trapped, it is
> > > > > > > > > > > > > > generic layer, that does not require virtio
> > > > > > > > > > > > > involvement.
> > > > > > > > > > > > > > So virtio interface asking to trap something
> > > > > > > > > > > > > > because generic facility has done
> > > > > > > > > > > > > in not the approach.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This misses the point of PASID. How to use PASID
> > > > > > > > > > > > > is totally device
> > > > > > > > > specific.
> > > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID is
> > > > > > > > > > > > platform specific as single PASID
> > > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > > >
> > > > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Capabilities of #2 is generic across all
> > > > > > > > > > > > > > > > pci devices, so it will be handled by the
> > > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > > ATS/PRI cap is also generic manner handled
> > > > > > > > > > > > > > > > by the HV and PCI
> > > > > > > > > device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > No, ATS/PRI requires the cooperation from the
> vIOMMU.
> > > > > > > > > > > > > > > You can simply do ATS/PRI passthrough but
> > > > > > > > > > > > > > > with an emulated
> > > > > > > vIOMMU.
> > > > > > > > > > > > > > And that is not the reason for virtio device
> > > > > > > > > > > > > > to build
> > > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > > >
> > > > > > > > > > > > > vIOMMU is emulated by hypervisor with a PRI
> > > > > > > > > > > > > queue,
> > > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > > >
> > > > > > > > > > > Shouldn't it arrive at platform IOMMU first? The
> > > > > > > > > > > path should be PRI
> > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI
> > > > > > > > > > > -> -> guest
> > > > > IOMMU.
> > > > > > > > > > >
> > > > > > > > > > Above sequence seems write.
> > > > > > > > > >
> > > > > > > > > > > And things will be more complicated when (v)PASID is used.
> > > > > > > > > > > So you can't simply let PRI go directly to the guest
> > > > > > > > > > > with the current
> > > > > > > architecture.
> > > > > > > > > > >
> > > > > > > > > > In current architecture of the pci VF, PRI does not go
> > > > > > > > > > directly to the
> > > > > guest.
> > > > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > > > >
> > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we will
> > > > > > > > > probably trap other things in the future like PASID assignment.
> > > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > > >
> > > > > > > It's not about the capability, it's about the whole process
> > > > > > > of PRI request handling. We've agreed that the PRI request
> > > > > > > needs to be trapped by the hypervisor and then delivered to the
> vIOMMU.
> > > > > > >
> > > > > >
> > > > > > > > Trap+emulation done in generic manner without involving
> > > > > > > > Trap+virtio or other
> > > > > > > device types.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > how can you pass through a hardware PRI request
> > > > > > > > > > > > > to a guest directly without trapping it
> > > > > > > > > then?
> > > > > > > > > > > > > What's more, PCIE allows the PRI to be done in a
> > > > > > > > > > > > > vendor
> > > > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > > for virtio?
> > > > > > > > > > > > >
> > > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > > Do you have a reference to the ECN that enables
> > > > > > > > > > > > vendor specific way of PRI? I
> > > > > > > > > > > would like to read it.
> > > > > > > > > > >
> > > > > > > > > > > I mean it doesn't forbid us to build a virtio
> > > > > > > > > > > specific interface for I/O page fault report and recovery.
> > > > > > > > > > >
> > > > > > > > > > So PRI of PCI does not allow. It is ODP kind of
> > > > > > > > > > technique you meant
> > > > > above.
> > > > > > > > > > Yes one can build.
> > > > > > > > > > Ok. unrelated to device migration, so I will park this
> > > > > > > > > > good discussion for
> > > > > > > later.
> > > > > > > > >
> > > > > > > > > That's fine.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > > > > > > > >
> > > > > > > > > > > Probably.
> > > > > > > > > > >
> > > > > > > > > > > > PRI will directly go to the guest driver, and
> > > > > > > > > > > > guest would interact with IOMMU
> > > > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > > > >
> > > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > > >
> > > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > > But again these PCI-SIG extensions of PASID are not
> > > > > > > > > > related to device
> > > > > > > > > migration, so I am differing it.
> > > > > > > > > >
> > > > > > > > > > > > For PRI in vendor specific way needs a separate
> > > > > > > > > > > > discussion. It is not related to
> > > > > > > > > > > live migration.
> > > > > > > > > > >
> > > > > > > > > > > PRI itself is not related. But the point is, you
> > > > > > > > > > > can't simply pass through ATS/PRI now.
> > > > > > > > > > >
> > > > > > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI
> > > > > > > > > > capabilities are located
> > > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > > So?
> > > > > > > > > > So do we start emulating virito interfaces too for passthrough?
> > > > > > > > > > No.
> > > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > > Sure why not?
> > > > > > > > >
> > > > > > > > > Then let's not limit your proposal to be used by "passthrough"
> only?
> > > > > > > > One can possibly build some variant of the existing virtio
> > > > > > > > member device
> > > > > > > using same owner and member scheme.
> > > > > > >
> > > > > > > It's not about the member/owner, it's about e.g whether the
> > > > > > > hypervisor can trap and emulate.
> > > > > > >
> > > > > > > I've pointed out that what you invent here is actually a
> > > > > > > partial new transport, for example, a hypervisor can trap
> > > > > > > and use things like device context in PF to bypass the
> > > > > > > registers in VF. This is the idea of
> > > > > transport commands/q.
> > > > > > >
> > > > > > I will not mix transport commands which are mainly useful for
> > > > > > actual device
> > > > > operation for SIOV only for backward compatibility that too optionally.
> > > > > > One may still choose to have virtio common and device config
> > > > > > in MMIO
> > > > > ofcourse at lower scale.
> > > > > >
> > > > > > Anyway, mixing migration context with actual SIOV specific
> > > > > > thing is not correct
> > > > > as device context is read/write incremental values.
> > > > >
> > > > > SIOV is transport level stuff, the transport virtqueue is
> > > > > designed in a way that is general enough to cover it. Let's not shift
> concepts.
> > > > >
> > > > Such TVQ is only for backward compatible vPCI composition.
> > > > For ground up work such TVQ must not be done through the owner
> device.
> > >
> > > That's the idea actually.
> > >
> > > > Each SIOV device to have its own channel to communicate directly
> > > > to the
> > > device.
> > > >
> > > > > One thing that you ignore is that, hypervisor can use what you
> > > > > invented as a transport for VF, no?
> > > > >
> > > > No. by design,
> > >
> > > It works like hypervisor traps the virito config and forwards it to
> > > admin virtqueue and starts the device via device context.
> > It needs more granular support than the management framework of device
> context.
> 
> It doesn't otherwise it is a design defect as you can't recover the device context
> in the destination.
> 
> Let me give you an example:
> 
> 1) in the case of live migration, dst receive migration byte flows and convert
> them into device context
> 2) in the case of transporting, hypervisor traps virtio config and convert them
> into the device context
> 
> I don't see anything different in this case. Or can you give me an example?
In #1 dst received byte flows one or multiple times.
And byte flows can be large.
So it does not always contain everything. It only contains the new delta of the device context.
For example, VQ configuration is exchanged once between src and dst.
But VQ avail and used index may be updated multiple times.
So here hypervisor do not want to read any specific set of fields and hypervisor is not parsing them either.
It is just a byte stream for it.

As opposed to that, in case of transport, the guest explicitly asks to read or write specific bytes.
Therefore, it is not incremental.

Additionally, if hypervisor has put the trap on virtio config, and because the memory device already has the interface for virtio config,

Hypervisor can directly write/read from the virtual config to the member's config space, without going through the device context, right?

> 
> >
> > >
> > > > it is not good idea to overload management commands with actual
> > > > run time
> > > guest commands.
> > > > The device context read writes are largely for incremental updates.
> > >
> > > It doesn't matter if it is incremental or not but
> > >
> > It does because you want different functionality only for purpose of backward
> compatibility.
> > That also if the device does not offer them as portion of MMIO BAR.
> 
> I don't see how it is related to the "incremental part".
> 
> >
> > > 1) the function is there
> > > 2) hypervisor can use that function if they want and virtio (spec)
> > > can't forbid that
> > >
> > It is not about forbidding or supporting.
> > Its about what functionality to use for management plane and guest plane.
> > Both have different needs.
> 
> People can have different views, there's nothing we can prevent a hypervisor
> from using it as a transport as far as I can see.
For device context write command, it can be used (or probably abused) to do write but I fail to see why to use it.
Because member device already has the interface to do config read/write and it is accessible to the hypervisor.

The read as_is using device context cannot be done because the caller is not explicitly asking what to read.
And the interface does not have it, because member device has it.

So lets find the need if incremental bit is needed in the device_Context read command or not or a bits to ask explicitly what to read optionally.

> 
> >
> > > >
> > > > For VF driver it has own direct channel via its own BAR to talk to the
> device.
> > > So no need to transport via PF.
> > > > For SIOV for backward compat vPCI composition, it may be needed.
> > > > Hard to say, if that can be memory mapped as well on the BAR of the PF.
> > > > We have seen one device supporting it outside of the virtio.
> > > > For scale anyway, one needs to use the device own cvq for complex
> > > configuration.
> > >
> > > That's the idea but I meant your current proposal overlaps those functions.
> > >
> > Not really. One can have simple virtio config space access read/write
> functionality, in addition to what is done here.
> > And that is still fine. One is doing proxying for guest.
> > Management plane is doing more than just register proxy.
> 
> See above, let's figure out whether it is possible as a transport first then.
> 
Right. lets figure out.

I would still promote to not mix management command with transport command.
Commands are cheap in nature. For transport if needed, they can be explicit commands.

> >
> > > >
> > > > > >
> > > > > > > > If for that is some admin commands are missing, may be one
> > > > > > > > can add
> > > > > them.
> > > > > > >
> > > > > > > I would then build the device context commands on top of the
> > > > > > > transport commands/q, then it would be complete.
> > > > > > >
> > > > > > > > No need to step on toes of use cases as they are different...
> > > > > > > >
> > > > > > > > > I've shown you that
> > > > > > > > >
> > > > > > > > > 1) you can't easily say you can pass through all the
> > > > > > > > > virtio facilities
> > > > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > > > >
> > > > > > > > It is not, it is well defined in v3, v2.
> > > > > > > > One can continue to argue and keep defining the variant
> > > > > > > > and still call it data
> > > > > > > path acceleration and then claim it as passthrough ...
> > > > > > > > But I won't debate this anymore as its just non-technical
> > > > > > > > aspects of least
> > > > > > > interest.
> > > > > > >
> > > > > > > You use this terminology in the spec which is all about
> > > > > > > technical, and you think how to define it is a matter of
> > > > > > > non-technical. This is self-contradictory. If you fail, it
> > > > > > > probably means it's
> > > ambiguous.
> > > > > > > Let's don't use that terminology.
> > > > > > >
> > > > > > What it means is described in theory of operation.
> > > > > >
> > > > > > > > We have technical tasks and more improved specs to update
> > > > > > > > going
> > > > > forward.
> > > > > > >
> > > > > > > It's a burden to do the synchronization.
> > > > > > We have discussed this.
> > > > > > In current proposed the member device is not bifurcated,
> > > > >
> > > > > It is. Part of the functions were carried via the PCI interface,
> > > > > some are carried via owner. You end up with two drivers to drive
> > > > > the
> > > devices.
> > > > >
> > > > Nop.
> > > > All admin work of device migration is carried out via the owner device.
> > > > All guest triggered work is carried out using VF itself.
> > >
> > > Guests don't (or can't) care about how the hypervisor is structured.
> > For passthrough mode, it just cannot be structured inside the VF.
> 
> Well, again, we are talking about different things.
> 
> >
> > > So we're discussing the view of device, member devices needs to
> > > server for
> > >
> > > 1) request from the transport (it's guest in your context)
> > > 2) request from the owner
> >
> > Doing #2 of the owner on the member device functionality do not work when
> hypervisor do not have access to the member device.
> 
> I don't get here, isn't 2) just what we invent for admin commands?
> Driver sends commands to the owner, owner forward those requests to the
> member?
I am most with the term "driver" without notion of guest/hypervisor prefix.

In one model,
Member device does everything through its native interface = virtio config and device space, cvq, data vqs etc.
Here member device do not forward anything to its owner.

The live migration hypervisor driver who has the knowledge of live migration flow, accesses the owner device and get the side band member's information to control it.
So member driver do not forward anything here to owner driver.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-09  6:25                                                                                 ` Parav Pandit
@ 2023-11-13  3:32                                                                                   ` Jason Wang
  2023-11-15 17:39                                                                                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-13  3:32 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 7, 2023 9:35 AM
> >
> > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, November 6, 2023 12:05 PM
> > > >
> > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > > > >
> > > > > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > > > >
> > > > > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > > > >
> > > > > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of
> > > > > > > > > > > > Jason Wang
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > > > Because for passthrough, the hypervisor is not
> > > > > > > > > > > > > > > involved in dealing with VQ at
> > > > > > > > > > > > > > all.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ok, so if I understand correctly, you are saying
> > > > > > > > > > > > > > your design can't work for the case of PASID assignment.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > No. PASID assignment will happen from the guest
> > > > > > > > > > > > > for its own use and device
> > > > > > > > > > > > migration will just work fine because device context
> > > > > > > > > > > > will capture
> > > > this.
> > > > > > > > > > > >
> > > > > > > > > > > > It's not about device context. We're discussing
> > > > > > > > > > > > "passthrough",
> > > > no?
> > > > > > > > > > > >
> > > > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > > > A member device is passthrough to the guest, dealing
> > > > > > > > > > > with its own PASIDs and
> > > > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > > > So VQ context captured by the hypervisor, will have
> > > > > > > > > > > some PASID attached to
> > > > > > > > > > this VQ.
> > > > > > > > > > > Device context will be updated.
> > > > > > > > > > >
> > > > > > > > > > > > You want all virtio stuff to be "passthrough", but
> > > > > > > > > > > > assigning a PASID to a specific virtqueue in the
> > > > > > > > > > > > guest must be
> > > > trapped.
> > > > > > > > > > > >
> > > > > > > > > > > No. PASID assignment to a specific virtqueue in the
> > > > > > > > > > > guest must go directly
> > > > > > > > > > from guest to device.
> > > > > > > > > >
> > > > > > > > > > This works like setting CR3, you can't simply let it go
> > > > > > > > > > from guest to
> > > > host.
> > > > > > > > > >
> > > > > > > > > > Host IOMMU driver needs to know the PASID to program the
> > > > > > > > > > IO page tables correctly.
> > > > > > > > > >
> > > > > > > > > This will be done by the IOMMU.
> > > > > > > > >
> > > > > > > > > > > When guest iommu may need to communicate anything for
> > > > > > > > > > > this PASID, it will
> > > > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > > > >
> > > > > > > > > > Let's say using PASID X for queue 0, this knowledge is
> > > > > > > > > > beyond the IOMMU scope but belongs to virtio. Or please
> > > > > > > > > > explain how it can work when it goes directly from guest to
> > device.
> > > > > > > > > >
> > > > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > > > >
> > > > > > > > It has one.
> > > > > > > >
> > > > > > > > > For ok for theory sake it is there.
> > > > > > > > >
> > > > > > > > > Virtio driver will assign the PASID directly from guest
> > > > > > > > > driver to device using a
> > > > > > > > create_vq(pasid=X) command.
> > > > > > > > > Same process is somehow attached the PASID by the guest OS.
> > > > > > > > > The whole PASID range is known to the hypervisor when the
> > > > > > > > > device is handed
> > > > > > > > over to the guest VM.
> > > > > > > >
> > > > > > > > How can it know?
> > > > > > > >
> > > > > > > > > So PASID mapping is setup by the hypervisor IOMMU at this point.
> > > > > > > >
> > > > > > > > You disallow the PASID to be virtualized here. What's more,
> > > > > > > > such a PASID passthrough has security implications.
> > > > > > > >
> > > > > > > No. virtio spec is not disallowing. At least for sure, this
> > > > > > > series is not the
> > > > one.
> > > > > > > My main point is, virtio device interface will not be the
> > > > > > > source of hypercall to
> > > > > > program IOMMU in the hypervisor.
> > > > > > > It is something to be done by IOMMU side.
> > > > > >
> > > > > > So unless vPASID can be used by the hardware you need to trap
> > > > > > the mapping from a PASID to a virtqueue. Then you need virtio
> > > > > > specific
> > > > knowledge.
> > > > > >
> > > > > vPASID by hardware is unlikely to be used by hw PCI EP devices at
> > > > > least in any
> > > > near term future.
> > > > > This requires either vPASID to pPASID table in device or in IOMMU.
> > > >
> > > > So we are on the same page.
> > > >
> > > > Claiming a method that can only work for passthrough or emulation is not
> > good.
> > > > We all know virtualization is passthrough + emulation.
> > > Again, I agree but I wont generalize it here.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > > Again, we are talking about different things, I've tried to
> > > > > > > > show you that there are cases that passthrough can't work
> > > > > > > > but if you think the only way for migration is to use
> > > > > > > > passthrough in every case, you will
> > > > > > probably fail.
> > > > > > > >
> > > > > > > I didn't say only way for migration is passthrough.
> > > > > > > Passthrough is clearly one way.
> > > > > > > Other ways may be possible.
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > There are works ongoing to make vPASID work
> > > > > > > > > > > > > > > > for the guest like
> > > > > > > > > > vSVA.
> > > > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Great, you find another limitation of
> > > > > > > > > > > > > > "passthrough" by
> > > > yourself.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > No. it is not the limitation it is just the way it
> > > > > > > > > > > > > does not need complex SVA to
> > > > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > > > >
> > > > > > > > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > > > > > > > >
> > > > > > > > > > > He he, I am not limiting, again misunderstanding or
> > > > > > > > > > > wrong
> > > > attribution.
> > > > > > > > > > > I explained that hypervisor for passthrough does not need SVA.
> > > > > > > > > > > Guest can do anything it wants from the guest OS with
> > > > > > > > > > > the member
> > > > > > > > device.
> > > > > > > > > >
> > > > > > > > > > Ok, so the point stills, see above.
> > > > > > > > >
> > > > > > > > > I don’t think so. The guest owns its PASID space
> > > > > > > >
> > > > > > > > Again, vPASID to PASID can't be done hardware unless I miss
> > > > > > > > some recent features of IOMMUs.
> > > > > > > >
> > > > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > > > >
> > > > > > At least for the current version of major IOMMU vendors, such
> > > > > > translation (aka PASID remapping) is not implemented in the
> > > > > > hardware so it needs to be trapped first.
> > > > > >
> > > > > Right. So it is really far in future, atleast few years away.
> > > > >
> > > > > > > It is still an early space for virtio.
> > > > > > >
> > > > > > > > > and directly communicates like any other device attribute.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > Each passthrough device has PASID from its own
> > > > > > > > > > > > > > > space fully managed by the
> > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > Some cpu required vPASID and SIOV is not going
> > > > > > > > > > > > > > > this way
> > > > > > anmore.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Then how to migrate? Invent a full set of
> > > > > > > > > > > > > > something else through another giant series like
> > > > > > > > > > > > > > this to migrate to the SIOV
> > > > > > thing?
> > > > > > > > > > > > > > That's a mess for
> > > > > > > > > > > > sure.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > SIOV will for sure reuse most or all parts of this
> > > > > > > > > > > > > work, almost entirely
> > > > > > > > as_is.
> > > > > > > > > > > > > vPASID is cpu/platform specific things not part of
> > > > > > > > > > > > > the SIOV
> > > > devices.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If at all it is done, it will be done from
> > > > > > > > > > > > > > > > > the guest by the driver using virtio
> > > > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Then you need to trap. Such things couldn't
> > > > > > > > > > > > > > > > be passed through to guests
> > > > > > > > > > > > > > directly.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Only PASID capability is trapped. PASID
> > > > > > > > > > > > > > > allocation and usage is directly from
> > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > How can you achieve this? Assigning a PAISD to a
> > > > > > > > > > > > > > device is completely
> > > > > > > > > > > > > > device(virtio) specific. How can you use a
> > > > > > > > > > > > > > general layer without the knowledge of virtio to trap that?
> > > > > > > > > > > > > When one wants to map vPASID to pPASID a platform
> > > > > > > > > > > > > needs to be
> > > > > > > > > > involved.
> > > > > > > > > > > >
> > > > > > > > > > > > I'm not talking about how to map vPASID to pPASID,
> > > > > > > > > > > > it's out of the scope of virtio. I'm talking about
> > > > > > > > > > > > assigning a vPASID to a specific virtqueue or other
> > > > > > > > > > > > virtio function in the
> > > > guest.
> > > > > > > > > > > >
> > > > > > > > > > > That can be done in the guest. The key is guest wont
> > > > > > > > > > > know that it is dealing
> > > > > > > > > > with vPASID.
> > > > > > > > > > > It will follow the same principle from your paper of
> > > > > > > > > > > equivalency, where virtio
> > > > > > > > > > software layer will assign PASID to VQ and communicate to
> > device.
> > > > > > > > > > >
> > > > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > > > >
> > > > > > > > > > It's not, as you mention that only MSI-X is trapped, I
> > > > > > > > > > give you another
> > > > > > one.
> > > > > > > > > >
> > > > > > > > > PASID access from the guest to be done fully by the guest IOMMU.
> > > > > > > > > Not by virtio devices.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > You need a virtio specific queue or capability to
> > > > > > > > > > > > assign a PASID to a specific virtqueue, and that
> > > > > > > > > > > > can't be done without trapping and without virito specific
> > knowledge.
> > > > > > > > > > > >
> > > > > > > > > > > I disagree. PASID assignment to a virqueue in future
> > > > > > > > > > > from guest virtio driver to
> > > > > > > > > > device is uniform method.
> > > > > > > > > > > Whether its PF assigning PASID to VQ of self, Or VF
> > > > > > > > > > > driver in the guest assigning PASID to VQ.
> > > > > > > > > > >
> > > > > > > > > > > All same.
> > > > > > > > > > > Only IOMMU layer hypercalls will know how to deal with
> > > > > > > > > > > PASID assignment at
> > > > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > > > >
> > > > > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > > > > By any means, if you were implying that somehow vq to
> > > > > > > > > > > PASID assignment
> > > > > > > > > > _may_ need trap+emulation, hence whole device migration
> > > > > > > > > > to depend on some
> > > > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > > > >
> > > > > > > > > > See above.
> > > > > > > > > >
> > > > > > > > > Yeah, I disagree to such implying.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating
> > > > > > > > > > > the guest process and
> > > > > > > > > > all of that just works on efficiency and equivalence
> > > > > > > > > > principle already for a decade now without any trap+emulation.
> > > > > > > > > > >
> > > > > > > > > > > > > When virtio passthrough device is in guest, it has
> > > > > > > > > > > > > all its PASID
> > > > > > > > accessible.
> > > > > > > > > > > > >
> > > > > > > > > > > > > All these is large deviation from current
> > > > > > > > > > > > > discussion of this series, so I will keep
> > > > > > > > > > > > it short.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regardless it is not relevant to passthrough
> > > > > > > > > > > > > > > mode as PASID is yet another
> > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > And for some cpu if it is trapped, it is
> > > > > > > > > > > > > > > generic layer, that does not require virtio
> > > > > > > > > > > > > > involvement.
> > > > > > > > > > > > > > > So virtio interface asking to trap something
> > > > > > > > > > > > > > > because generic facility has done
> > > > > > > > > > > > > > in not the approach.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This misses the point of PASID. How to use PASID
> > > > > > > > > > > > > > is totally device
> > > > > > > > > > specific.
> > > > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID is
> > > > > > > > > > > > > platform specific as single PASID
> > > > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > > > >
> > > > > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Capabilities of #2 is generic across all
> > > > > > > > > > > > > > > > > pci devices, so it will be handled by the
> > > > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > > > ATS/PRI cap is also generic manner handled
> > > > > > > > > > > > > > > > > by the HV and PCI
> > > > > > > > > > device.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > No, ATS/PRI requires the cooperation from the
> > vIOMMU.
> > > > > > > > > > > > > > > > You can simply do ATS/PRI passthrough but
> > > > > > > > > > > > > > > > with an emulated
> > > > > > > > vIOMMU.
> > > > > > > > > > > > > > > And that is not the reason for virtio device
> > > > > > > > > > > > > > > to build
> > > > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > vIOMMU is emulated by hypervisor with a PRI
> > > > > > > > > > > > > > queue,
> > > > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > > > >
> > > > > > > > > > > > Shouldn't it arrive at platform IOMMU first? The
> > > > > > > > > > > > path should be PRI
> > > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI
> > > > > > > > > > > > -> -> guest
> > > > > > IOMMU.
> > > > > > > > > > > >
> > > > > > > > > > > Above sequence seems write.
> > > > > > > > > > >
> > > > > > > > > > > > And things will be more complicated when (v)PASID is used.
> > > > > > > > > > > > So you can't simply let PRI go directly to the guest
> > > > > > > > > > > > with the current
> > > > > > > > architecture.
> > > > > > > > > > > >
> > > > > > > > > > > In current architecture of the pci VF, PRI does not go
> > > > > > > > > > > directly to the
> > > > > > guest.
> > > > > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > > > > >
> > > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we will
> > > > > > > > > > probably trap other things in the future like PASID assignment.
> > > > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > > > >
> > > > > > > > It's not about the capability, it's about the whole process
> > > > > > > > of PRI request handling. We've agreed that the PRI request
> > > > > > > > needs to be trapped by the hypervisor and then delivered to the
> > vIOMMU.
> > > > > > > >
> > > > > > >
> > > > > > > > > Trap+emulation done in generic manner without involving
> > > > > > > > > Trap+virtio or other
> > > > > > > > device types.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > how can you pass through a hardware PRI request
> > > > > > > > > > > > > > to a guest directly without trapping it
> > > > > > > > > > then?
> > > > > > > > > > > > > > What's more, PCIE allows the PRI to be done in a
> > > > > > > > > > > > > > vendor
> > > > > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > > > for virtio?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > > > Do you have a reference to the ECN that enables
> > > > > > > > > > > > > vendor specific way of PRI? I
> > > > > > > > > > > > would like to read it.
> > > > > > > > > > > >
> > > > > > > > > > > > I mean it doesn't forbid us to build a virtio
> > > > > > > > > > > > specific interface for I/O page fault report and recovery.
> > > > > > > > > > > >
> > > > > > > > > > > So PRI of PCI does not allow. It is ODP kind of
> > > > > > > > > > > technique you meant
> > > > > > above.
> > > > > > > > > > > Yes one can build.
> > > > > > > > > > > Ok. unrelated to device migration, so I will park this
> > > > > > > > > > > good discussion for
> > > > > > > > later.
> > > > > > > > > >
> > > > > > > > > > That's fine.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > > > > > > > > >
> > > > > > > > > > > > Probably.
> > > > > > > > > > > >
> > > > > > > > > > > > > PRI will directly go to the guest driver, and
> > > > > > > > > > > > > guest would interact with IOMMU
> > > > > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > > > > >
> > > > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > > > >
> > > > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > > > But again these PCI-SIG extensions of PASID are not
> > > > > > > > > > > related to device
> > > > > > > > > > migration, so I am differing it.
> > > > > > > > > > >
> > > > > > > > > > > > > For PRI in vendor specific way needs a separate
> > > > > > > > > > > > > discussion. It is not related to
> > > > > > > > > > > > live migration.
> > > > > > > > > > > >
> > > > > > > > > > > > PRI itself is not related. But the point is, you
> > > > > > > > > > > > can't simply pass through ATS/PRI now.
> > > > > > > > > > > >
> > > > > > > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI
> > > > > > > > > > > capabilities are located
> > > > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > > > So?
> > > > > > > > > > > So do we start emulating virito interfaces too for passthrough?
> > > > > > > > > > > No.
> > > > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > > > Sure why not?
> > > > > > > > > >
> > > > > > > > > > Then let's not limit your proposal to be used by "passthrough"
> > only?
> > > > > > > > > One can possibly build some variant of the existing virtio
> > > > > > > > > member device
> > > > > > > > using same owner and member scheme.
> > > > > > > >
> > > > > > > > It's not about the member/owner, it's about e.g whether the
> > > > > > > > hypervisor can trap and emulate.
> > > > > > > >
> > > > > > > > I've pointed out that what you invent here is actually a
> > > > > > > > partial new transport, for example, a hypervisor can trap
> > > > > > > > and use things like device context in PF to bypass the
> > > > > > > > registers in VF. This is the idea of
> > > > > > transport commands/q.
> > > > > > > >
> > > > > > > I will not mix transport commands which are mainly useful for
> > > > > > > actual device
> > > > > > operation for SIOV only for backward compatibility that too optionally.
> > > > > > > One may still choose to have virtio common and device config
> > > > > > > in MMIO
> > > > > > ofcourse at lower scale.
> > > > > > >
> > > > > > > Anyway, mixing migration context with actual SIOV specific
> > > > > > > thing is not correct
> > > > > > as device context is read/write incremental values.
> > > > > >
> > > > > > SIOV is transport level stuff, the transport virtqueue is
> > > > > > designed in a way that is general enough to cover it. Let's not shift
> > concepts.
> > > > > >
> > > > > Such TVQ is only for backward compatible vPCI composition.
> > > > > For ground up work such TVQ must not be done through the owner
> > device.
> > > >
> > > > That's the idea actually.
> > > >
> > > > > Each SIOV device to have its own channel to communicate directly
> > > > > to the
> > > > device.
> > > > >
> > > > > > One thing that you ignore is that, hypervisor can use what you
> > > > > > invented as a transport for VF, no?
> > > > > >
> > > > > No. by design,
> > > >
> > > > It works like hypervisor traps the virito config and forwards it to
> > > > admin virtqueue and starts the device via device context.
> > > It needs more granular support than the management framework of device
> > context.
> >
> > It doesn't otherwise it is a design defect as you can't recover the device context
> > in the destination.
> >
> > Let me give you an example:
> >
> > 1) in the case of live migration, dst receive migration byte flows and convert
> > them into device context
> > 2) in the case of transporting, hypervisor traps virtio config and convert them
> > into the device context
> >
> > I don't see anything different in this case. Or can you give me an example?
> In #1 dst received byte flows one or multiple times.

How can this be different?

Transport can also receive initial state incrementally.

> And byte flows can be large.

So when doing transport, it is not that large, that's it. If it can
work with large byte flow, why can't it work for small?

> So it does not always contain everything. It only contains the new delta of the device context.

Isn't it just how current PCI transport does?

Guest configure the following one by one:

1) vq size
2) vq addresses
3) MSI-X

etc?

> For example, VQ configuration is exchanged once between src and dst.
> But VQ avail and used index may be updated multiple times.

If it can work with multiple times of updating, why can't it work if
we just update it once?

> So here hypervisor do not want to read any specific set of fields and hypervisor is not parsing them either.
> It is just a byte stream for it.

Firstly, spec must define the device context format, so hypervisor can
understand which byte is what otherwise you can't maintain migration
compatibility.
Secondly, you can't mandate how the hypervisor is written.

>
> As opposed to that, in case of transport, the guest explicitly asks to read or write specific bytes.
> Therefore, it is not incremental.

I'm totally lost. Which part of the transport is not incremental?

>
> Additionally, if hypervisor has put the trap on virtio config, and because the memory device already has the interface for virtio config,
>
> Hypervisor can directly write/read from the virtual config to the member's config space, without going through the device context, right?

If it can do it or it can choose to not. I don't see how it is related
to the discussion here.

>
> >
> > >
> > > >
> > > > > it is not good idea to overload management commands with actual
> > > > > run time
> > > > guest commands.
> > > > > The device context read writes are largely for incremental updates.
> > > >
> > > > It doesn't matter if it is incremental or not but
> > > >
> > > It does because you want different functionality only for purpose of backward
> > compatibility.
> > > That also if the device does not offer them as portion of MMIO BAR.
> >
> > I don't see how it is related to the "incremental part".
> >
> > >
> > > > 1) the function is there
> > > > 2) hypervisor can use that function if they want and virtio (spec)
> > > > can't forbid that
> > > >
> > > It is not about forbidding or supporting.
> > > Its about what functionality to use for management plane and guest plane.
> > > Both have different needs.
> >
> > People can have different views, there's nothing we can prevent a hypervisor
> > from using it as a transport as far as I can see.
> For device context write command, it can be used (or probably abused) to do write but I fail to see why to use it.

The function is there, you can't prevent people from doing that.

> Because member device already has the interface to do config read/write and it is accessible to the hypervisor.

Well, it looks self-contradictory again. Are you saying another set of
commands that is similar to device context is needed for non-PCI
transport?

>
> The read as_is using device context cannot be done because the caller is not explicitly asking what to read.
> And the interface does not have it, because member device has it.
>
> So lets find the need if incremental bit is needed in the device_Context read command or not or a bits to ask explicitly what to read optionally.
>
> >
> > >
> > > > >
> > > > > For VF driver it has own direct channel via its own BAR to talk to the
> > device.
> > > > So no need to transport via PF.
> > > > > For SIOV for backward compat vPCI composition, it may be needed.
> > > > > Hard to say, if that can be memory mapped as well on the BAR of the PF.
> > > > > We have seen one device supporting it outside of the virtio.
> > > > > For scale anyway, one needs to use the device own cvq for complex
> > > > configuration.
> > > >
> > > > That's the idea but I meant your current proposal overlaps those functions.
> > > >
> > > Not really. One can have simple virtio config space access read/write
> > functionality, in addition to what is done here.
> > > And that is still fine. One is doing proxying for guest.
> > > Management plane is doing more than just register proxy.
> >
> > See above, let's figure out whether it is possible as a transport first then.
> >
> Right. lets figure out.
>
> I would still promote to not mix management command with transport command.

It's not a mixing, it's just because they are functional equivalents.

> Commands are cheap in nature. For transport if needed, they can be explicit commands.

It will be a partial duplication of what is being proposed here.

Thanks



>
> > >
> > > > >
> > > > > > >
> > > > > > > > > If for that is some admin commands are missing, may be one
> > > > > > > > > can add
> > > > > > them.
> > > > > > > >
> > > > > > > > I would then build the device context commands on top of the
> > > > > > > > transport commands/q, then it would be complete.
> > > > > > > >
> > > > > > > > > No need to step on toes of use cases as they are different...
> > > > > > > > >
> > > > > > > > > > I've shown you that
> > > > > > > > > >
> > > > > > > > > > 1) you can't easily say you can pass through all the
> > > > > > > > > > virtio facilities
> > > > > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > > > > >
> > > > > > > > > It is not, it is well defined in v3, v2.
> > > > > > > > > One can continue to argue and keep defining the variant
> > > > > > > > > and still call it data
> > > > > > > > path acceleration and then claim it as passthrough ...
> > > > > > > > > But I won't debate this anymore as its just non-technical
> > > > > > > > > aspects of least
> > > > > > > > interest.
> > > > > > > >
> > > > > > > > You use this terminology in the spec which is all about
> > > > > > > > technical, and you think how to define it is a matter of
> > > > > > > > non-technical. This is self-contradictory. If you fail, it
> > > > > > > > probably means it's
> > > > ambiguous.
> > > > > > > > Let's don't use that terminology.
> > > > > > > >
> > > > > > > What it means is described in theory of operation.
> > > > > > >
> > > > > > > > > We have technical tasks and more improved specs to update
> > > > > > > > > going
> > > > > > forward.
> > > > > > > >
> > > > > > > > It's a burden to do the synchronization.
> > > > > > > We have discussed this.
> > > > > > > In current proposed the member device is not bifurcated,
> > > > > >
> > > > > > It is. Part of the functions were carried via the PCI interface,
> > > > > > some are carried via owner. You end up with two drivers to drive
> > > > > > the
> > > > devices.
> > > > > >
> > > > > Nop.
> > > > > All admin work of device migration is carried out via the owner device.
> > > > > All guest triggered work is carried out using VF itself.
> > > >
> > > > Guests don't (or can't) care about how the hypervisor is structured.
> > > For passthrough mode, it just cannot be structured inside the VF.
> >
> > Well, again, we are talking about different things.
> >
> > >
> > > > So we're discussing the view of device, member devices needs to
> > > > server for
> > > >
> > > > 1) request from the transport (it's guest in your context)
> > > > 2) request from the owner
> > >
> > > Doing #2 of the owner on the member device functionality do not work when
> > hypervisor do not have access to the member device.
> >
> > I don't get here, isn't 2) just what we invent for admin commands?
> > Driver sends commands to the owner, owner forward those requests to the
> > member?
> I am most with the term "driver" without notion of guest/hypervisor prefix.
>
> In one model,
> Member device does everything through its native interface = virtio config and device space, cvq, data vqs etc.
> Here member device do not forward anything to its owner.
>
> The live migration hypervisor driver who has the knowledge of live migration flow, accesses the owner device and get the side band member's information to control it.
> So member driver do not forward anything here to owner driver.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-13  3:32                                                                                   ` Jason Wang
@ 2023-11-15 17:39                                                                                     ` Parav Pandit
  2023-11-16  4:20                                                                                       ` Jason Wang
  2023-11-17 10:08                                                                                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-15 17:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 13, 2023 9:03 AM
> 
> On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 7, 2023 9:35 AM
> > >
> > > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > >
> > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > > > > >
> > > > > > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > > > > >
> > > > > > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf
> > > > > > > > > > > > > Of Jason Wang
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit
> > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59
> > > > > > > > > > > > > > > > > AM
> > > > > > > > > > > > > > > > > > For passthrough PASID assignment vq is not
> needed.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > > > > Because for passthrough, the hypervisor is
> > > > > > > > > > > > > > > > not involved in dealing with VQ at
> > > > > > > > > > > > > > > all.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Ok, so if I understand correctly, you are
> > > > > > > > > > > > > > > saying your design can't work for the case of PASID
> assignment.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > No. PASID assignment will happen from the
> > > > > > > > > > > > > > guest for its own use and device
> > > > > > > > > > > > > migration will just work fine because device
> > > > > > > > > > > > > context will capture
> > > > > this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's not about device context. We're discussing
> > > > > > > > > > > > > "passthrough",
> > > > > no?
> > > > > > > > > > > > >
> > > > > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > > > > A member device is passthrough to the guest,
> > > > > > > > > > > > dealing with its own PASIDs and
> > > > > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > > > > So VQ context captured by the hypervisor, will
> > > > > > > > > > > > have some PASID attached to
> > > > > > > > > > > this VQ.
> > > > > > > > > > > > Device context will be updated.
> > > > > > > > > > > >
> > > > > > > > > > > > > You want all virtio stuff to be "passthrough",
> > > > > > > > > > > > > but assigning a PASID to a specific virtqueue in
> > > > > > > > > > > > > the guest must be
> > > > > trapped.
> > > > > > > > > > > > >
> > > > > > > > > > > > No. PASID assignment to a specific virtqueue in
> > > > > > > > > > > > the guest must go directly
> > > > > > > > > > > from guest to device.
> > > > > > > > > > >
> > > > > > > > > > > This works like setting CR3, you can't simply let it
> > > > > > > > > > > go from guest to
> > > > > host.
> > > > > > > > > > >
> > > > > > > > > > > Host IOMMU driver needs to know the PASID to program
> > > > > > > > > > > the IO page tables correctly.
> > > > > > > > > > >
> > > > > > > > > > This will be done by the IOMMU.
> > > > > > > > > >
> > > > > > > > > > > > When guest iommu may need to communicate anything
> > > > > > > > > > > > for this PASID, it will
> > > > > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > > > > >
> > > > > > > > > > > Let's say using PASID X for queue 0, this knowledge
> > > > > > > > > > > is beyond the IOMMU scope but belongs to virtio. Or
> > > > > > > > > > > please explain how it can work when it goes directly
> > > > > > > > > > > from guest to
> > > device.
> > > > > > > > > > >
> > > > > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > > > > >
> > > > > > > > > It has one.
> > > > > > > > >
> > > > > > > > > > For ok for theory sake it is there.
> > > > > > > > > >
> > > > > > > > > > Virtio driver will assign the PASID directly from
> > > > > > > > > > guest driver to device using a
> > > > > > > > > create_vq(pasid=X) command.
> > > > > > > > > > Same process is somehow attached the PASID by the guest OS.
> > > > > > > > > > The whole PASID range is known to the hypervisor when
> > > > > > > > > > the device is handed
> > > > > > > > > over to the guest VM.
> > > > > > > > >
> > > > > > > > > How can it know?
> > > > > > > > >
> > > > > > > > > > So PASID mapping is setup by the hypervisor IOMMU at this
> point.
> > > > > > > > >
> > > > > > > > > You disallow the PASID to be virtualized here. What's
> > > > > > > > > more, such a PASID passthrough has security implications.
> > > > > > > > >
> > > > > > > > No. virtio spec is not disallowing. At least for sure,
> > > > > > > > this series is not the
> > > > > one.
> > > > > > > > My main point is, virtio device interface will not be the
> > > > > > > > source of hypercall to
> > > > > > > program IOMMU in the hypervisor.
> > > > > > > > It is something to be done by IOMMU side.
> > > > > > >
> > > > > > > So unless vPASID can be used by the hardware you need to
> > > > > > > trap the mapping from a PASID to a virtqueue. Then you need
> > > > > > > virtio specific
> > > > > knowledge.
> > > > > > >
> > > > > > vPASID by hardware is unlikely to be used by hw PCI EP devices
> > > > > > at least in any
> > > > > near term future.
> > > > > > This requires either vPASID to pPASID table in device or in IOMMU.
> > > > >
> > > > > So we are on the same page.
> > > > >
> > > > > Claiming a method that can only work for passthrough or
> > > > > emulation is not
> > > good.
> > > > > We all know virtualization is passthrough + emulation.
> > > > Again, I agree but I wont generalize it here.
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > Again, we are talking about different things, I've tried
> > > > > > > > > to show you that there are cases that passthrough can't
> > > > > > > > > work but if you think the only way for migration is to
> > > > > > > > > use passthrough in every case, you will
> > > > > > > probably fail.
> > > > > > > > >
> > > > > > > > I didn't say only way for migration is passthrough.
> > > > > > > > Passthrough is clearly one way.
> > > > > > > > Other ways may be possible.
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > There are works ongoing to make vPASID
> > > > > > > > > > > > > > > > > work for the guest like
> > > > > > > > > > > vSVA.
> > > > > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Great, you find another limitation of
> > > > > > > > > > > > > > > "passthrough" by
> > > > > yourself.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > No. it is not the limitation it is just the
> > > > > > > > > > > > > > way it does not need complex SVA to
> > > > > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > > > > > > > > >
> > > > > > > > > > > > He he, I am not limiting, again misunderstanding
> > > > > > > > > > > > or wrong
> > > > > attribution.
> > > > > > > > > > > > I explained that hypervisor for passthrough does not need
> SVA.
> > > > > > > > > > > > Guest can do anything it wants from the guest OS
> > > > > > > > > > > > with the member
> > > > > > > > > device.
> > > > > > > > > > >
> > > > > > > > > > > Ok, so the point stills, see above.
> > > > > > > > > >
> > > > > > > > > > I don’t think so. The guest owns its PASID space
> > > > > > > > >
> > > > > > > > > Again, vPASID to PASID can't be done hardware unless I
> > > > > > > > > miss some recent features of IOMMUs.
> > > > > > > > >
> > > > > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > > > > >
> > > > > > > At least for the current version of major IOMMU vendors,
> > > > > > > such translation (aka PASID remapping) is not implemented in
> > > > > > > the hardware so it needs to be trapped first.
> > > > > > >
> > > > > > Right. So it is really far in future, atleast few years away.
> > > > > >
> > > > > > > > It is still an early space for virtio.
> > > > > > > >
> > > > > > > > > > and directly communicates like any other device attribute.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Each passthrough device has PASID from its
> > > > > > > > > > > > > > > > own space fully managed by the
> > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > Some cpu required vPASID and SIOV is not
> > > > > > > > > > > > > > > > going this way
> > > > > > > anmore.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Then how to migrate? Invent a full set of
> > > > > > > > > > > > > > > something else through another giant series
> > > > > > > > > > > > > > > like this to migrate to the SIOV
> > > > > > > thing?
> > > > > > > > > > > > > > > That's a mess for
> > > > > > > > > > > > > sure.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > SIOV will for sure reuse most or all parts of
> > > > > > > > > > > > > > this work, almost entirely
> > > > > > > > > as_is.
> > > > > > > > > > > > > > vPASID is cpu/platform specific things not
> > > > > > > > > > > > > > part of the SIOV
> > > > > devices.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > If at all it is done, it will be done
> > > > > > > > > > > > > > > > > > from the guest by the driver using
> > > > > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Then you need to trap. Such things
> > > > > > > > > > > > > > > > > couldn't be passed through to guests
> > > > > > > > > > > > > > > directly.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Only PASID capability is trapped. PASID
> > > > > > > > > > > > > > > > allocation and usage is directly from
> > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > How can you achieve this? Assigning a PAISD
> > > > > > > > > > > > > > > to a device is completely
> > > > > > > > > > > > > > > device(virtio) specific. How can you use a
> > > > > > > > > > > > > > > general layer without the knowledge of virtio to trap
> that?
> > > > > > > > > > > > > > When one wants to map vPASID to pPASID a
> > > > > > > > > > > > > > platform needs to be
> > > > > > > > > > > involved.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm not talking about how to map vPASID to
> > > > > > > > > > > > > pPASID, it's out of the scope of virtio. I'm
> > > > > > > > > > > > > talking about assigning a vPASID to a specific
> > > > > > > > > > > > > virtqueue or other virtio function in the
> > > > > guest.
> > > > > > > > > > > > >
> > > > > > > > > > > > That can be done in the guest. The key is guest
> > > > > > > > > > > > wont know that it is dealing
> > > > > > > > > > > with vPASID.
> > > > > > > > > > > > It will follow the same principle from your paper
> > > > > > > > > > > > of equivalency, where virtio
> > > > > > > > > > > software layer will assign PASID to VQ and
> > > > > > > > > > > communicate to
> > > device.
> > > > > > > > > > > >
> > > > > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > > > > >
> > > > > > > > > > > It's not, as you mention that only MSI-X is trapped,
> > > > > > > > > > > I give you another
> > > > > > > one.
> > > > > > > > > > >
> > > > > > > > > > PASID access from the guest to be done fully by the guest
> IOMMU.
> > > > > > > > > > Not by virtio devices.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > You need a virtio specific queue or capability
> > > > > > > > > > > > > to assign a PASID to a specific virtqueue, and
> > > > > > > > > > > > > that can't be done without trapping and without
> > > > > > > > > > > > > virito specific
> > > knowledge.
> > > > > > > > > > > > >
> > > > > > > > > > > > I disagree. PASID assignment to a virqueue in
> > > > > > > > > > > > future from guest virtio driver to
> > > > > > > > > > > device is uniform method.
> > > > > > > > > > > > Whether its PF assigning PASID to VQ of self, Or
> > > > > > > > > > > > VF driver in the guest assigning PASID to VQ.
> > > > > > > > > > > >
> > > > > > > > > > > > All same.
> > > > > > > > > > > > Only IOMMU layer hypercalls will know how to deal
> > > > > > > > > > > > with PASID assignment at
> > > > > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > > > > >
> > > > > > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > > > > > By any means, if you were implying that somehow vq
> > > > > > > > > > > > to PASID assignment
> > > > > > > > > > > _may_ need trap+emulation, hence whole device
> > > > > > > > > > > migration to depend on some
> > > > > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > > > > >
> > > > > > > > > > > See above.
> > > > > > > > > > >
> > > > > > > > > > Yeah, I disagree to such implying.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD
> > > > > > > > > > > > isolating the guest process and
> > > > > > > > > > > all of that just works on efficiency and equivalence
> > > > > > > > > > > principle already for a decade now without any
> trap+emulation.
> > > > > > > > > > > >
> > > > > > > > > > > > > > When virtio passthrough device is in guest, it
> > > > > > > > > > > > > > has all its PASID
> > > > > > > > > accessible.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All these is large deviation from current
> > > > > > > > > > > > > > discussion of this series, so I will keep
> > > > > > > > > > > > > it short.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Regardless it is not relevant to
> > > > > > > > > > > > > > > > passthrough mode as PASID is yet another
> > > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > > And for some cpu if it is trapped, it is
> > > > > > > > > > > > > > > > generic layer, that does not require
> > > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > involvement.
> > > > > > > > > > > > > > > > So virtio interface asking to trap
> > > > > > > > > > > > > > > > something because generic facility has
> > > > > > > > > > > > > > > > done
> > > > > > > > > > > > > > > in not the approach.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This misses the point of PASID. How to use
> > > > > > > > > > > > > > > PASID is totally device
> > > > > > > > > > > specific.
> > > > > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID is
> > > > > > > > > > > > > > platform specific as single PASID
> > > > > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > > > > >
> > > > > > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Capabilities of #2 is generic across
> > > > > > > > > > > > > > > > > > all pci devices, so it will be handled
> > > > > > > > > > > > > > > > > > by the
> > > > > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > > > > ATS/PRI cap is also generic manner
> > > > > > > > > > > > > > > > > > handled by the HV and PCI
> > > > > > > > > > > device.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > No, ATS/PRI requires the cooperation
> > > > > > > > > > > > > > > > > from the
> > > vIOMMU.
> > > > > > > > > > > > > > > > > You can simply do ATS/PRI passthrough
> > > > > > > > > > > > > > > > > but with an emulated
> > > > > > > > > vIOMMU.
> > > > > > > > > > > > > > > > And that is not the reason for virtio
> > > > > > > > > > > > > > > > device to build
> > > > > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > vIOMMU is emulated by hypervisor with a PRI
> > > > > > > > > > > > > > > queue,
> > > > > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Shouldn't it arrive at platform IOMMU first? The
> > > > > > > > > > > > > path should be PRI
> > > > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU
> > > > > > > > > > > > > -> PRI
> > > > > > > > > > > > > -> -> guest
> > > > > > > IOMMU.
> > > > > > > > > > > > >
> > > > > > > > > > > > Above sequence seems write.
> > > > > > > > > > > >
> > > > > > > > > > > > > And things will be more complicated when (v)PASID is
> used.
> > > > > > > > > > > > > So you can't simply let PRI go directly to the
> > > > > > > > > > > > > guest with the current
> > > > > > > > > architecture.
> > > > > > > > > > > > >
> > > > > > > > > > > > In current architecture of the pci VF, PRI does
> > > > > > > > > > > > not go directly to the
> > > > > > > guest.
> > > > > > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > > > > > >
> > > > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we will
> > > > > > > > > > > probably trap other things in the future like PASID
> assignment.
> > > > > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > > > > >
> > > > > > > > > It's not about the capability, it's about the whole
> > > > > > > > > process of PRI request handling. We've agreed that the
> > > > > > > > > PRI request needs to be trapped by the hypervisor and
> > > > > > > > > then delivered to the
> > > vIOMMU.
> > > > > > > > >
> > > > > > > >
> > > > > > > > > > Trap+emulation done in generic manner without
> > > > > > > > > > Trap+involving virtio or other
> > > > > > > > > device types.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > how can you pass through a hardware PRI
> > > > > > > > > > > > > > > request to a guest directly without trapping
> > > > > > > > > > > > > > > it
> > > > > > > > > > > then?
> > > > > > > > > > > > > > > What's more, PCIE allows the PRI to be done
> > > > > > > > > > > > > > > in a vendor
> > > > > > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > > > > for virtio?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > > > > Do you have a reference to the ECN that
> > > > > > > > > > > > > > enables vendor specific way of PRI? I
> > > > > > > > > > > > > would like to read it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I mean it doesn't forbid us to build a virtio
> > > > > > > > > > > > > specific interface for I/O page fault report and recovery.
> > > > > > > > > > > > >
> > > > > > > > > > > > So PRI of PCI does not allow. It is ODP kind of
> > > > > > > > > > > > technique you meant
> > > > > > > above.
> > > > > > > > > > > > Yes one can build.
> > > > > > > > > > > > Ok. unrelated to device migration, so I will park
> > > > > > > > > > > > this good discussion for
> > > > > > > > > later.
> > > > > > > > > > >
> > > > > > > > > > > That's fine.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > > This will be very good to eliminate IOMMU PRI
> limitations.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Probably.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > PRI will directly go to the guest driver, and
> > > > > > > > > > > > > > guest would interact with IOMMU
> > > > > > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > > > > >
> > > > > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > > > > But again these PCI-SIG extensions of PASID are
> > > > > > > > > > > > not related to device
> > > > > > > > > > > migration, so I am differing it.
> > > > > > > > > > > >
> > > > > > > > > > > > > > For PRI in vendor specific way needs a
> > > > > > > > > > > > > > separate discussion. It is not related to
> > > > > > > > > > > > > live migration.
> > > > > > > > > > > > >
> > > > > > > > > > > > > PRI itself is not related. But the point is, you
> > > > > > > > > > > > > can't simply pass through ATS/PRI now.
> > > > > > > > > > > > >
> > > > > > > > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI
> > > > > > > > > > > > capabilities are located
> > > > > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > > > > So?
> > > > > > > > > > > > So do we start emulating virito interfaces too for
> passthrough?
> > > > > > > > > > > > No.
> > > > > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > > > > Sure why not?
> > > > > > > > > > >
> > > > > > > > > > > Then let's not limit your proposal to be used by "passthrough"
> > > only?
> > > > > > > > > > One can possibly build some variant of the existing
> > > > > > > > > > virtio member device
> > > > > > > > > using same owner and member scheme.
> > > > > > > > >
> > > > > > > > > It's not about the member/owner, it's about e.g whether
> > > > > > > > > the hypervisor can trap and emulate.
> > > > > > > > >
> > > > > > > > > I've pointed out that what you invent here is actually a
> > > > > > > > > partial new transport, for example, a hypervisor can
> > > > > > > > > trap and use things like device context in PF to bypass
> > > > > > > > > the registers in VF. This is the idea of
> > > > > > > transport commands/q.
> > > > > > > > >
> > > > > > > > I will not mix transport commands which are mainly useful
> > > > > > > > for actual device
> > > > > > > operation for SIOV only for backward compatibility that too
> optionally.
> > > > > > > > One may still choose to have virtio common and device
> > > > > > > > config in MMIO
> > > > > > > ofcourse at lower scale.
> > > > > > > >
> > > > > > > > Anyway, mixing migration context with actual SIOV specific
> > > > > > > > thing is not correct
> > > > > > > as device context is read/write incremental values.
> > > > > > >
> > > > > > > SIOV is transport level stuff, the transport virtqueue is
> > > > > > > designed in a way that is general enough to cover it. Let's
> > > > > > > not shift
> > > concepts.
> > > > > > >
> > > > > > Such TVQ is only for backward compatible vPCI composition.
> > > > > > For ground up work such TVQ must not be done through the owner
> > > device.
> > > > >
> > > > > That's the idea actually.
> > > > >
> > > > > > Each SIOV device to have its own channel to communicate
> > > > > > directly to the
> > > > > device.
> > > > > >
> > > > > > > One thing that you ignore is that, hypervisor can use what
> > > > > > > you invented as a transport for VF, no?
> > > > > > >
> > > > > > No. by design,
> > > > >
> > > > > It works like hypervisor traps the virito config and forwards it
> > > > > to admin virtqueue and starts the device via device context.
> > > > It needs more granular support than the management framework of
> > > > device
> > > context.
> > >
> > > It doesn't otherwise it is a design defect as you can't recover the
> > > device context in the destination.
> > >
> > > Let me give you an example:
> > >
> > > 1) in the case of live migration, dst receive migration byte flows
> > > and convert them into device context
> > > 2) in the case of transporting, hypervisor traps virtio config and
> > > convert them into the device context
> > >
> > > I don't see anything different in this case. Or can you give me an example?
> > In #1 dst received byte flows one or multiple times.
> 
> How can this be different?
> 
> Transport can also receive initial state incrementally.
> 
Transport is just simple register RW interface without any caching layer in-between.
More below.
> > And byte flows can be large.
> 
> So when doing transport, it is not that large, that's it. If it can work with large
> byte flow, why can't it work for small?
Write context can as used (abused) for different purpose.
Read cannot because it is meant to be incremental.
One can invent a cheap command to read it.


> 
> > So it does not always contain everything. It only contains the new delta of the
> device context.
> 
> Isn't it just how current PCI transport does?
> 
No. PCI transport has explicit API between device and driver to read or write at specific offset and value.

> Guest configure the following one by one:
> 
> 1) vq size
> 2) vq addresses
> 3) MSI-X
> 
> etc?
> 
I think you interpreted "incremental" differently than I described.
In the device context read, the incremental is:

If the hypervisor driver has read the device context twice, the second read won't return any new data if nothing changed.
For example, if RSS configuration didn’t change between two reads, the second read wont return the TLV for RSS Context.

While for transport the need is, when guest asked, one device must read it regardless of the change.

So notion of incremental is not by address, but by the value.

> > For example, VQ configuration is exchanged once between src and dst.
> > But VQ avail and used index may be updated multiple times.
> 
> If it can work with multiple times of updating, why can't it work if we just
> update it once?
Functionally it can work.
Performance wise, one does not want to update multiple times, unless there is a change.

Read as explained above is not meant to return same content again.

> 
> > So here hypervisor do not want to read any specific set of fields and
> hypervisor is not parsing them either.
> > It is just a byte stream for it.
> 
> Firstly, spec must define the device context format, so hypervisor can
> understand which byte is what otherwise you can't maintain migration
> compatibility.
Device context is defined already in the latest version.

> Secondly, you can't mandate how the hypervisor is written.
> 
> >
> > As opposed to that, in case of transport, the guest explicitly asks to read or
> write specific bytes.
> > Therefore, it is not incremental.
> 
> I'm totally lost. Which part of the transport is not incremental?
> 
> >
> > Additionally, if hypervisor has put the trap on virtio config, and
> > because the memory device already has the interface for virtio config,
> >
> > Hypervisor can directly write/read from the virtual config to the member's
> config space, without going through the device context, right?
> 
> If it can do it or it can choose to not. I don't see how it is related to the
> discussion here.
>
It is. I don’t see a point of hypervisor not using the native interface provided by the member device.

 > >
> > >
> > > >
> > > > >
> > > > > > it is not good idea to overload management commands with
> > > > > > actual run time
> > > > > guest commands.
> > > > > > The device context read writes are largely for incremental updates.
> > > > >
> > > > > It doesn't matter if it is incremental or not but
> > > > >
> > > > It does because you want different functionality only for purpose
> > > > of backward
> > > compatibility.
> > > > That also if the device does not offer them as portion of MMIO BAR.
> > >
> > > I don't see how it is related to the "incremental part".
> > >
> > > >
> > > > > 1) the function is there
> > > > > 2) hypervisor can use that function if they want and virtio
> > > > > (spec) can't forbid that
> > > > >
> > > > It is not about forbidding or supporting.
> > > > Its about what functionality to use for management plane and guest
> plane.
> > > > Both have different needs.
> > >
> > > People can have different views, there's nothing we can prevent a
> > > hypervisor from using it as a transport as far as I can see.
> > For device context write command, it can be used (or probably abused) to do
> write but I fail to see why to use it.
> 
> The function is there, you can't prevent people from doing that.
>
One can always mess up itself. :)
It is not prevented. It is just not right way to use the interface.
 
> > Because member device already has the interface to do config read/write and
> it is accessible to the hypervisor.
> 
> Well, it looks self-contradictory again. Are you saying another set of commands
> that is similar to device context is needed for non-PCI transport?
>
All these non pci transport discussion is just meaning less.
Let MMIO bring the concept of member device at that point something make sense to discuss.
PCI SIOV is also the PCI device at the end.
 
> >
> > The read as_is using device context cannot be done because the caller is not
> explicitly asking what to read.
> > And the interface does not have it, because member device has it.
> >
> > So lets find the need if incremental bit is needed in the device_Context read
> command or not or a bits to ask explicitly what to read optionally.
> >
> > >
> > > >
> > > > > >
> > > > > > For VF driver it has own direct channel via its own BAR to
> > > > > > talk to the
> > > device.
> > > > > So no need to transport via PF.
> > > > > > For SIOV for backward compat vPCI composition, it may be needed.
> > > > > > Hard to say, if that can be memory mapped as well on the BAR of the
> PF.
> > > > > > We have seen one device supporting it outside of the virtio.
> > > > > > For scale anyway, one needs to use the device own cvq for
> > > > > > complex
> > > > > configuration.
> > > > >
> > > > > That's the idea but I meant your current proposal overlaps those
> functions.
> > > > >
> > > > Not really. One can have simple virtio config space access
> > > > read/write
> > > functionality, in addition to what is done here.
> > > > And that is still fine. One is doing proxying for guest.
> > > > Management plane is doing more than just register proxy.
> > >
> > > See above, let's figure out whether it is possible as a transport first then.
> > >
> > Right. lets figure out.
> >
> > I would still promote to not mix management command with transport
> command.
> 
> It's not a mixing, it's just because they are functional equivalents.
> 
It is not.
I clarified the fundamental difference between the two.
One is explicit read and write.
Other is, return read data on change.
For write, it is explicit set and it does not take effect until the mode is changed back to active.

> > Commands are cheap in nature. For transport if needed, they can be explicit
> commands.
> 
> It will be a partial duplication of what is being proposed here.

There is always some overlap between management plane (hypervisor set/get) and control plane (guest driver get/set).
> 
> Thanks
> 
> 
> 
> >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > If for that is some admin commands are missing, may be
> > > > > > > > > > one can add
> > > > > > > them.
> > > > > > > > >
> > > > > > > > > I would then build the device context commands on top of
> > > > > > > > > the transport commands/q, then it would be complete.
> > > > > > > > >
> > > > > > > > > > No need to step on toes of use cases as they are different...
> > > > > > > > > >
> > > > > > > > > > > I've shown you that
> > > > > > > > > > >
> > > > > > > > > > > 1) you can't easily say you can pass through all the
> > > > > > > > > > > virtio facilities
> > > > > > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > > > > > >
> > > > > > > > > > It is not, it is well defined in v3, v2.
> > > > > > > > > > One can continue to argue and keep defining the
> > > > > > > > > > variant and still call it data
> > > > > > > > > path acceleration and then claim it as passthrough ...
> > > > > > > > > > But I won't debate this anymore as its just
> > > > > > > > > > non-technical aspects of least
> > > > > > > > > interest.
> > > > > > > > >
> > > > > > > > > You use this terminology in the spec which is all about
> > > > > > > > > technical, and you think how to define it is a matter of
> > > > > > > > > non-technical. This is self-contradictory. If you fail,
> > > > > > > > > it probably means it's
> > > > > ambiguous.
> > > > > > > > > Let's don't use that terminology.
> > > > > > > > >
> > > > > > > > What it means is described in theory of operation.
> > > > > > > >
> > > > > > > > > > We have technical tasks and more improved specs to
> > > > > > > > > > update going
> > > > > > > forward.
> > > > > > > > >
> > > > > > > > > It's a burden to do the synchronization.
> > > > > > > > We have discussed this.
> > > > > > > > In current proposed the member device is not bifurcated,
> > > > > > >
> > > > > > > It is. Part of the functions were carried via the PCI
> > > > > > > interface, some are carried via owner. You end up with two
> > > > > > > drivers to drive the
> > > > > devices.
> > > > > > >
> > > > > > Nop.
> > > > > > All admin work of device migration is carried out via the owner
> device.
> > > > > > All guest triggered work is carried out using VF itself.
> > > > >
> > > > > Guests don't (or can't) care about how the hypervisor is structured.
> > > > For passthrough mode, it just cannot be structured inside the VF.
> > >
> > > Well, again, we are talking about different things.
> > >
> > > >
> > > > > So we're discussing the view of device, member devices needs to
> > > > > server for
> > > > >
> > > > > 1) request from the transport (it's guest in your context)
> > > > > 2) request from the owner
> > > >
> > > > Doing #2 of the owner on the member device functionality do not
> > > > work when
> > > hypervisor do not have access to the member device.
> > >
> > > I don't get here, isn't 2) just what we invent for admin commands?
> > > Driver sends commands to the owner, owner forward those requests to
> > > the member?
> > I am most with the term "driver" without notion of guest/hypervisor prefix.
> >
> > In one model,
> > Member device does everything through its native interface = virtio config
> and device space, cvq, data vqs etc.
> > Here member device do not forward anything to its owner.
> >
> > The live migration hypervisor driver who has the knowledge of live migration
> flow, accesses the owner device and get the side band member's information to
> control it.
> > So member driver do not forward anything here to owner driver.
> >


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-15 17:39                                                                                     ` Parav Pandit
@ 2023-11-16  4:20                                                                                       ` Jason Wang
  2023-11-16  5:28                                                                                         ` Parav Pandit
  2023-11-17 10:08                                                                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-16  4:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 13, 2023 9:03 AM
> >
> > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 7, 2023 9:35 AM
> > > >
> > > > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > > >
> > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > > > > > >
> > > > > > > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > > > > > >
> > > > > > > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf
> > > > > > > > > > > > > > Of Jason Wang
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav Pandit
> > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav Pandit
> > > > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59
> > > > > > > > > > > > > > > > > > AM
> > > > > > > > > > > > > > > > > > > For passthrough PASID assignment vq is not
> > needed.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > > > > > Because for passthrough, the hypervisor is
> > > > > > > > > > > > > > > > > not involved in dealing with VQ at
> > > > > > > > > > > > > > > > all.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Ok, so if I understand correctly, you are
> > > > > > > > > > > > > > > > saying your design can't work for the case of PASID
> > assignment.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > No. PASID assignment will happen from the
> > > > > > > > > > > > > > > guest for its own use and device
> > > > > > > > > > > > > > migration will just work fine because device
> > > > > > > > > > > > > > context will capture
> > > > > > this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It's not about device context. We're discussing
> > > > > > > > > > > > > > "passthrough",
> > > > > > no?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > > > > > A member device is passthrough to the guest,
> > > > > > > > > > > > > dealing with its own PASIDs and
> > > > > > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > > > > > So VQ context captured by the hypervisor, will
> > > > > > > > > > > > > have some PASID attached to
> > > > > > > > > > > > this VQ.
> > > > > > > > > > > > > Device context will be updated.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > You want all virtio stuff to be "passthrough",
> > > > > > > > > > > > > > but assigning a PASID to a specific virtqueue in
> > > > > > > > > > > > > > the guest must be
> > > > > > trapped.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > No. PASID assignment to a specific virtqueue in
> > > > > > > > > > > > > the guest must go directly
> > > > > > > > > > > > from guest to device.
> > > > > > > > > > > >
> > > > > > > > > > > > This works like setting CR3, you can't simply let it
> > > > > > > > > > > > go from guest to
> > > > > > host.
> > > > > > > > > > > >
> > > > > > > > > > > > Host IOMMU driver needs to know the PASID to program
> > > > > > > > > > > > the IO page tables correctly.
> > > > > > > > > > > >
> > > > > > > > > > > This will be done by the IOMMU.
> > > > > > > > > > >
> > > > > > > > > > > > > When guest iommu may need to communicate anything
> > > > > > > > > > > > > for this PASID, it will
> > > > > > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > > > > > >
> > > > > > > > > > > > Let's say using PASID X for queue 0, this knowledge
> > > > > > > > > > > > is beyond the IOMMU scope but belongs to virtio. Or
> > > > > > > > > > > > please explain how it can work when it goes directly
> > > > > > > > > > > > from guest to
> > > > device.
> > > > > > > > > > > >
> > > > > > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > > > > > >
> > > > > > > > > > It has one.
> > > > > > > > > >
> > > > > > > > > > > For ok for theory sake it is there.
> > > > > > > > > > >
> > > > > > > > > > > Virtio driver will assign the PASID directly from
> > > > > > > > > > > guest driver to device using a
> > > > > > > > > > create_vq(pasid=X) command.
> > > > > > > > > > > Same process is somehow attached the PASID by the guest OS.
> > > > > > > > > > > The whole PASID range is known to the hypervisor when
> > > > > > > > > > > the device is handed
> > > > > > > > > > over to the guest VM.
> > > > > > > > > >
> > > > > > > > > > How can it know?
> > > > > > > > > >
> > > > > > > > > > > So PASID mapping is setup by the hypervisor IOMMU at this
> > point.
> > > > > > > > > >
> > > > > > > > > > You disallow the PASID to be virtualized here. What's
> > > > > > > > > > more, such a PASID passthrough has security implications.
> > > > > > > > > >
> > > > > > > > > No. virtio spec is not disallowing. At least for sure,
> > > > > > > > > this series is not the
> > > > > > one.
> > > > > > > > > My main point is, virtio device interface will not be the
> > > > > > > > > source of hypercall to
> > > > > > > > program IOMMU in the hypervisor.
> > > > > > > > > It is something to be done by IOMMU side.
> > > > > > > >
> > > > > > > > So unless vPASID can be used by the hardware you need to
> > > > > > > > trap the mapping from a PASID to a virtqueue. Then you need
> > > > > > > > virtio specific
> > > > > > knowledge.
> > > > > > > >
> > > > > > > vPASID by hardware is unlikely to be used by hw PCI EP devices
> > > > > > > at least in any
> > > > > > near term future.
> > > > > > > This requires either vPASID to pPASID table in device or in IOMMU.
> > > > > >
> > > > > > So we are on the same page.
> > > > > >
> > > > > > Claiming a method that can only work for passthrough or
> > > > > > emulation is not
> > > > good.
> > > > > > We all know virtualization is passthrough + emulation.
> > > > > Again, I agree but I wont generalize it here.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > Again, we are talking about different things, I've tried
> > > > > > > > > > to show you that there are cases that passthrough can't
> > > > > > > > > > work but if you think the only way for migration is to
> > > > > > > > > > use passthrough in every case, you will
> > > > > > > > probably fail.
> > > > > > > > > >
> > > > > > > > > I didn't say only way for migration is passthrough.
> > > > > > > > > Passthrough is clearly one way.
> > > > > > > > > Other ways may be possible.
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > There are works ongoing to make vPASID
> > > > > > > > > > > > > > > > > > work for the guest like
> > > > > > > > > > > > vSVA.
> > > > > > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Great, you find another limitation of
> > > > > > > > > > > > > > > > "passthrough" by
> > > > > > yourself.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > No. it is not the limitation it is just the
> > > > > > > > > > > > > > > way it does not need complex SVA to
> > > > > > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > How can you limit the user in the guest to not use vSVA?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > He he, I am not limiting, again misunderstanding
> > > > > > > > > > > > > or wrong
> > > > > > attribution.
> > > > > > > > > > > > > I explained that hypervisor for passthrough does not need
> > SVA.
> > > > > > > > > > > > > Guest can do anything it wants from the guest OS
> > > > > > > > > > > > > with the member
> > > > > > > > > > device.
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, so the point stills, see above.
> > > > > > > > > > >
> > > > > > > > > > > I don’t think so. The guest owns its PASID space
> > > > > > > > > >
> > > > > > > > > > Again, vPASID to PASID can't be done hardware unless I
> > > > > > > > > > miss some recent features of IOMMUs.
> > > > > > > > > >
> > > > > > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > > > > > >
> > > > > > > > At least for the current version of major IOMMU vendors,
> > > > > > > > such translation (aka PASID remapping) is not implemented in
> > > > > > > > the hardware so it needs to be trapped first.
> > > > > > > >
> > > > > > > Right. So it is really far in future, atleast few years away.
> > > > > > >
> > > > > > > > > It is still an early space for virtio.
> > > > > > > > >
> > > > > > > > > > > and directly communicates like any other device attribute.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Each passthrough device has PASID from its
> > > > > > > > > > > > > > > > > own space fully managed by the
> > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > > Some cpu required vPASID and SIOV is not
> > > > > > > > > > > > > > > > > going this way
> > > > > > > > anmore.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Then how to migrate? Invent a full set of
> > > > > > > > > > > > > > > > something else through another giant series
> > > > > > > > > > > > > > > > like this to migrate to the SIOV
> > > > > > > > thing?
> > > > > > > > > > > > > > > > That's a mess for
> > > > > > > > > > > > > > sure.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > SIOV will for sure reuse most or all parts of
> > > > > > > > > > > > > > > this work, almost entirely
> > > > > > > > > > as_is.
> > > > > > > > > > > > > > > vPASID is cpu/platform specific things not
> > > > > > > > > > > > > > > part of the SIOV
> > > > > > devices.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > If at all it is done, it will be done
> > > > > > > > > > > > > > > > > > > from the guest by the driver using
> > > > > > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Then you need to trap. Such things
> > > > > > > > > > > > > > > > > > couldn't be passed through to guests
> > > > > > > > > > > > > > > > directly.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Only PASID capability is trapped. PASID
> > > > > > > > > > > > > > > > > allocation and usage is directly from
> > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > How can you achieve this? Assigning a PAISD
> > > > > > > > > > > > > > > > to a device is completely
> > > > > > > > > > > > > > > > device(virtio) specific. How can you use a
> > > > > > > > > > > > > > > > general layer without the knowledge of virtio to trap
> > that?
> > > > > > > > > > > > > > > When one wants to map vPASID to pPASID a
> > > > > > > > > > > > > > > platform needs to be
> > > > > > > > > > > > involved.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm not talking about how to map vPASID to
> > > > > > > > > > > > > > pPASID, it's out of the scope of virtio. I'm
> > > > > > > > > > > > > > talking about assigning a vPASID to a specific
> > > > > > > > > > > > > > virtqueue or other virtio function in the
> > > > > > guest.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > That can be done in the guest. The key is guest
> > > > > > > > > > > > > wont know that it is dealing
> > > > > > > > > > > > with vPASID.
> > > > > > > > > > > > > It will follow the same principle from your paper
> > > > > > > > > > > > > of equivalency, where virtio
> > > > > > > > > > > > software layer will assign PASID to VQ and
> > > > > > > > > > > > communicate to
> > > > device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > > > > > >
> > > > > > > > > > > > It's not, as you mention that only MSI-X is trapped,
> > > > > > > > > > > > I give you another
> > > > > > > > one.
> > > > > > > > > > > >
> > > > > > > > > > > PASID access from the guest to be done fully by the guest
> > IOMMU.
> > > > > > > > > > > Not by virtio devices.
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > You need a virtio specific queue or capability
> > > > > > > > > > > > > > to assign a PASID to a specific virtqueue, and
> > > > > > > > > > > > > > that can't be done without trapping and without
> > > > > > > > > > > > > > virito specific
> > > > knowledge.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > I disagree. PASID assignment to a virqueue in
> > > > > > > > > > > > > future from guest virtio driver to
> > > > > > > > > > > > device is uniform method.
> > > > > > > > > > > > > Whether its PF assigning PASID to VQ of self, Or
> > > > > > > > > > > > > VF driver in the guest assigning PASID to VQ.
> > > > > > > > > > > > >
> > > > > > > > > > > > > All same.
> > > > > > > > > > > > > Only IOMMU layer hypercalls will know how to deal
> > > > > > > > > > > > > with PASID assignment at
> > > > > > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > > > > > >
> > > > > > > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > > > > > > By any means, if you were implying that somehow vq
> > > > > > > > > > > > > to PASID assignment
> > > > > > > > > > > > _may_ need trap+emulation, hence whole device
> > > > > > > > > > > > migration to depend on some
> > > > > > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > > > > > >
> > > > > > > > > > > > See above.
> > > > > > > > > > > >
> > > > > > > > > > > Yeah, I disagree to such implying.
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD
> > > > > > > > > > > > > isolating the guest process and
> > > > > > > > > > > > all of that just works on efficiency and equivalence
> > > > > > > > > > > > principle already for a decade now without any
> > trap+emulation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > When virtio passthrough device is in guest, it
> > > > > > > > > > > > > > > has all its PASID
> > > > > > > > > > accessible.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > All these is large deviation from current
> > > > > > > > > > > > > > > discussion of this series, so I will keep
> > > > > > > > > > > > > > it short.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Regardless it is not relevant to
> > > > > > > > > > > > > > > > > passthrough mode as PASID is yet another
> > > > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > > > And for some cpu if it is trapped, it is
> > > > > > > > > > > > > > > > > generic layer, that does not require
> > > > > > > > > > > > > > > > > virtio
> > > > > > > > > > > > > > > > involvement.
> > > > > > > > > > > > > > > > > So virtio interface asking to trap
> > > > > > > > > > > > > > > > > something because generic facility has
> > > > > > > > > > > > > > > > > done
> > > > > > > > > > > > > > > > in not the approach.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This misses the point of PASID. How to use
> > > > > > > > > > > > > > > > PASID is totally device
> > > > > > > > > > > > specific.
> > > > > > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID is
> > > > > > > > > > > > > > > platform specific as single PASID
> > > > > > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Capabilities of #2 is generic across
> > > > > > > > > > > > > > > > > > > all pci devices, so it will be handled
> > > > > > > > > > > > > > > > > > > by the
> > > > > > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > > > > > ATS/PRI cap is also generic manner
> > > > > > > > > > > > > > > > > > > handled by the HV and PCI
> > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > No, ATS/PRI requires the cooperation
> > > > > > > > > > > > > > > > > > from the
> > > > vIOMMU.
> > > > > > > > > > > > > > > > > > You can simply do ATS/PRI passthrough
> > > > > > > > > > > > > > > > > > but with an emulated
> > > > > > > > > > vIOMMU.
> > > > > > > > > > > > > > > > > And that is not the reason for virtio
> > > > > > > > > > > > > > > > > device to build
> > > > > > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > vIOMMU is emulated by hypervisor with a PRI
> > > > > > > > > > > > > > > > queue,
> > > > > > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Shouldn't it arrive at platform IOMMU first? The
> > > > > > > > > > > > > > path should be PRI
> > > > > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU
> > > > > > > > > > > > > > -> PRI
> > > > > > > > > > > > > > -> -> guest
> > > > > > > > IOMMU.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > Above sequence seems write.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > And things will be more complicated when (v)PASID is
> > used.
> > > > > > > > > > > > > > So you can't simply let PRI go directly to the
> > > > > > > > > > > > > > guest with the current
> > > > > > > > > > architecture.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > In current architecture of the pci VF, PRI does
> > > > > > > > > > > > > not go directly to the
> > > > > > > > guest.
> > > > > > > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we will
> > > > > > > > > > > > probably trap other things in the future like PASID
> > assignment.
> > > > > > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > > > > > >
> > > > > > > > > > It's not about the capability, it's about the whole
> > > > > > > > > > process of PRI request handling. We've agreed that the
> > > > > > > > > > PRI request needs to be trapped by the hypervisor and
> > > > > > > > > > then delivered to the
> > > > vIOMMU.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > > > Trap+emulation done in generic manner without
> > > > > > > > > > > Trap+involving virtio or other
> > > > > > > > > > device types.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > how can you pass through a hardware PRI
> > > > > > > > > > > > > > > > request to a guest directly without trapping
> > > > > > > > > > > > > > > > it
> > > > > > > > > > > > then?
> > > > > > > > > > > > > > > > What's more, PCIE allows the PRI to be done
> > > > > > > > > > > > > > > > in a vendor
> > > > > > > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > > > > > for virtio?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > > > > > Do you have a reference to the ECN that
> > > > > > > > > > > > > > > enables vendor specific way of PRI? I
> > > > > > > > > > > > > > would like to read it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I mean it doesn't forbid us to build a virtio
> > > > > > > > > > > > > > specific interface for I/O page fault report and recovery.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > So PRI of PCI does not allow. It is ODP kind of
> > > > > > > > > > > > > technique you meant
> > > > > > > > above.
> > > > > > > > > > > > > Yes one can build.
> > > > > > > > > > > > > Ok. unrelated to device migration, so I will park
> > > > > > > > > > > > > this good discussion for
> > > > > > > > > > later.
> > > > > > > > > > > >
> > > > > > > > > > > > That's fine.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > This will be very good to eliminate IOMMU PRI
> > limitations.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Probably.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > PRI will directly go to the guest driver, and
> > > > > > > > > > > > > > > guest would interact with IOMMU
> > > > > > > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > > > > > But again these PCI-SIG extensions of PASID are
> > > > > > > > > > > > > not related to device
> > > > > > > > > > > > migration, so I am differing it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > For PRI in vendor specific way needs a
> > > > > > > > > > > > > > > separate discussion. It is not related to
> > > > > > > > > > > > > > live migration.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > PRI itself is not related. But the point is, you
> > > > > > > > > > > > > > can't simply pass through ATS/PRI now.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI
> > > > > > > > > > > > > capabilities are located
> > > > > > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > > > > > So?
> > > > > > > > > > > > > So do we start emulating virito interfaces too for
> > passthrough?
> > > > > > > > > > > > > No.
> > > > > > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > > > > > Sure why not?
> > > > > > > > > > > >
> > > > > > > > > > > > Then let's not limit your proposal to be used by "passthrough"
> > > > only?
> > > > > > > > > > > One can possibly build some variant of the existing
> > > > > > > > > > > virtio member device
> > > > > > > > > > using same owner and member scheme.
> > > > > > > > > >
> > > > > > > > > > It's not about the member/owner, it's about e.g whether
> > > > > > > > > > the hypervisor can trap and emulate.
> > > > > > > > > >
> > > > > > > > > > I've pointed out that what you invent here is actually a
> > > > > > > > > > partial new transport, for example, a hypervisor can
> > > > > > > > > > trap and use things like device context in PF to bypass
> > > > > > > > > > the registers in VF. This is the idea of
> > > > > > > > transport commands/q.
> > > > > > > > > >
> > > > > > > > > I will not mix transport commands which are mainly useful
> > > > > > > > > for actual device
> > > > > > > > operation for SIOV only for backward compatibility that too
> > optionally.
> > > > > > > > > One may still choose to have virtio common and device
> > > > > > > > > config in MMIO
> > > > > > > > ofcourse at lower scale.
> > > > > > > > >
> > > > > > > > > Anyway, mixing migration context with actual SIOV specific
> > > > > > > > > thing is not correct
> > > > > > > > as device context is read/write incremental values.
> > > > > > > >
> > > > > > > > SIOV is transport level stuff, the transport virtqueue is
> > > > > > > > designed in a way that is general enough to cover it. Let's
> > > > > > > > not shift
> > > > concepts.
> > > > > > > >
> > > > > > > Such TVQ is only for backward compatible vPCI composition.
> > > > > > > For ground up work such TVQ must not be done through the owner
> > > > device.
> > > > > >
> > > > > > That's the idea actually.
> > > > > >
> > > > > > > Each SIOV device to have its own channel to communicate
> > > > > > > directly to the
> > > > > > device.
> > > > > > >
> > > > > > > > One thing that you ignore is that, hypervisor can use what
> > > > > > > > you invented as a transport for VF, no?
> > > > > > > >
> > > > > > > No. by design,
> > > > > >
> > > > > > It works like hypervisor traps the virito config and forwards it
> > > > > > to admin virtqueue and starts the device via device context.
> > > > > It needs more granular support than the management framework of
> > > > > device
> > > > context.
> > > >
> > > > It doesn't otherwise it is a design defect as you can't recover the
> > > > device context in the destination.
> > > >
> > > > Let me give you an example:
> > > >
> > > > 1) in the case of live migration, dst receive migration byte flows
> > > > and convert them into device context
> > > > 2) in the case of transporting, hypervisor traps virtio config and
> > > > convert them into the device context
> > > >
> > > > I don't see anything different in this case. Or can you give me an example?
> > > In #1 dst received byte flows one or multiple times.
> >
> > How can this be different?
> >
> > Transport can also receive initial state incrementally.
> >
> Transport is just simple register RW interface without any caching layer in-between.
> More below.
> > > And byte flows can be large.
> >
> > So when doing transport, it is not that large, that's it. If it can work with large
> > byte flow, why can't it work for small?
> Write context can as used (abused) for different purpose.
> Read cannot because it is meant to be incremental.

Well hypervisor can just cache what it reads since the last, what's
wrong with it?

> One can invent a cheap command to read it.

For sure, but it's not the context here.

>
>
> >
> > > So it does not always contain everything. It only contains the new delta of the
> > device context.
> >
> > Isn't it just how current PCI transport does?
> >
> No. PCI transport has explicit API between device and driver to read or write at specific offset and value.

The point is that they are functional equivalents.

>
> > Guest configure the following one by one:
> >
> > 1) vq size
> > 2) vq addresses
> > 3) MSI-X
> >
> > etc?
> >
> I think you interpreted "incremental" differently than I described.
> In the device context read, the incremental is:
>
> If the hypervisor driver has read the device context twice, the second read won't return any new data if nothing changed.

See above.

> For example, if RSS configuration didn’t change between two reads, the second read wont return the TLV for RSS Context.
>
> While for transport the need is, when guest asked, one device must read it regardless of the change.
>
> So notion of incremental is not by address, but by the value.
>
> > > For example, VQ configuration is exchanged once between src and dst.
> > > But VQ avail and used index may be updated multiple times.
> >
> > If it can work with multiple times of updating, why can't it work if we just
> > update it once?
> Functionally it can work.

I think you answer yourself.

> Performance wise, one does not want to update multiple times, unless there is a change.
>
> Read as explained above is not meant to return same content again.
>
> >
> > > So here hypervisor do not want to read any specific set of fields and
> > hypervisor is not parsing them either.
> > > It is just a byte stream for it.
> >
> > Firstly, spec must define the device context format, so hypervisor can
> > understand which byte is what otherwise you can't maintain migration
> > compatibility.
> Device context is defined already in the latest version.
>
> > Secondly, you can't mandate how the hypervisor is written.
> >
> > >
> > > As opposed to that, in case of transport, the guest explicitly asks to read or
> > write specific bytes.
> > > Therefore, it is not incremental.
> >
> > I'm totally lost. Which part of the transport is not incremental?
> >
> > >
> > > Additionally, if hypervisor has put the trap on virtio config, and
> > > because the memory device already has the interface for virtio config,
> > >
> > > Hypervisor can directly write/read from the virtual config to the member's
> > config space, without going through the device context, right?
> >
> > If it can do it or it can choose to not. I don't see how it is related to the
> > discussion here.
> >
> It is. I don’t see a point of hypervisor not using the native interface provided by the member device.

It really depends on the case, and I see how it duplicates with the
functionality that is provided by both:

1) The existing PCI transport

or

2) The transport virtqueue

>
>  > >
> > > >
> > > > >
> > > > > >
> > > > > > > it is not good idea to overload management commands with
> > > > > > > actual run time
> > > > > > guest commands.
> > > > > > > The device context read writes are largely for incremental updates.
> > > > > >
> > > > > > It doesn't matter if it is incremental or not but
> > > > > >
> > > > > It does because you want different functionality only for purpose
> > > > > of backward
> > > > compatibility.
> > > > > That also if the device does not offer them as portion of MMIO BAR.
> > > >
> > > > I don't see how it is related to the "incremental part".
> > > >
> > > > >
> > > > > > 1) the function is there
> > > > > > 2) hypervisor can use that function if they want and virtio
> > > > > > (spec) can't forbid that
> > > > > >
> > > > > It is not about forbidding or supporting.
> > > > > Its about what functionality to use for management plane and guest
> > plane.
> > > > > Both have different needs.
> > > >
> > > > People can have different views, there's nothing we can prevent a
> > > > hypervisor from using it as a transport as far as I can see.
> > > For device context write command, it can be used (or probably abused) to do
> > write but I fail to see why to use it.
> >
> > The function is there, you can't prevent people from doing that.
> >
> One can always mess up itself. :)
> It is not prevented. It is just not right way to use the interface.
>
> > > Because member device already has the interface to do config read/write and
> > it is accessible to the hypervisor.
> >
> > Well, it looks self-contradictory again. Are you saying another set of commands
> > that is similar to device context is needed for non-PCI transport?
> >
> All these non pci transport discussion is just meaning less.
> Let MMIO bring the concept of member device at that point something make sense to discuss.

It's not necessarily MMIO. For example the SIOV, which I don't think
can use the existing PCI transport.

> PCI SIOV is also the PCI device at the end.

We don't want to end up with two sets of commands to save/load SRIOV
and SIOV at least.

Thanks



>
> > >
> > > The read as_is using device context cannot be done because the caller is not
> > explicitly asking what to read.
> > > And the interface does not have it, because member device has it.
> > >
> > > So lets find the need if incremental bit is needed in the device_Context read
> > command or not or a bits to ask explicitly what to read optionally.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > For VF driver it has own direct channel via its own BAR to
> > > > > > > talk to the
> > > > device.
> > > > > > So no need to transport via PF.
> > > > > > > For SIOV for backward compat vPCI composition, it may be needed.
> > > > > > > Hard to say, if that can be memory mapped as well on the BAR of the
> > PF.
> > > > > > > We have seen one device supporting it outside of the virtio.
> > > > > > > For scale anyway, one needs to use the device own cvq for
> > > > > > > complex
> > > > > > configuration.
> > > > > >
> > > > > > That's the idea but I meant your current proposal overlaps those
> > functions.
> > > > > >
> > > > > Not really. One can have simple virtio config space access
> > > > > read/write
> > > > functionality, in addition to what is done here.
> > > > > And that is still fine. One is doing proxying for guest.
> > > > > Management plane is doing more than just register proxy.
> > > >
> > > > See above, let's figure out whether it is possible as a transport first then.
> > > >
> > > Right. lets figure out.
> > >
> > > I would still promote to not mix management command with transport
> > command.
> >
> > It's not a mixing, it's just because they are functional equivalents.
> >
> It is not.
> I clarified the fundamental difference between the two.
> One is explicit read and write.
> Other is, return read data on change.
> For write, it is explicit set and it does not take effect until the mode is changed back to active.
>
> > > Commands are cheap in nature. For transport if needed, they can be explicit
> > commands.
> >
> > It will be a partial duplication of what is being proposed here.
>
> There is always some overlap between management plane (hypervisor set/get) and control plane (guest driver get/set).
> >
> > Thanks
> >
> >
> >
> > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > If for that is some admin commands are missing, may be
> > > > > > > > > > > one can add
> > > > > > > > them.
> > > > > > > > > >
> > > > > > > > > > I would then build the device context commands on top of
> > > > > > > > > > the transport commands/q, then it would be complete.
> > > > > > > > > >
> > > > > > > > > > > No need to step on toes of use cases as they are different...
> > > > > > > > > > >
> > > > > > > > > > > > I've shown you that
> > > > > > > > > > > >
> > > > > > > > > > > > 1) you can't easily say you can pass through all the
> > > > > > > > > > > > virtio facilities
> > > > > > > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > > > > > > >
> > > > > > > > > > > It is not, it is well defined in v3, v2.
> > > > > > > > > > > One can continue to argue and keep defining the
> > > > > > > > > > > variant and still call it data
> > > > > > > > > > path acceleration and then claim it as passthrough ...
> > > > > > > > > > > But I won't debate this anymore as its just
> > > > > > > > > > > non-technical aspects of least
> > > > > > > > > > interest.
> > > > > > > > > >
> > > > > > > > > > You use this terminology in the spec which is all about
> > > > > > > > > > technical, and you think how to define it is a matter of
> > > > > > > > > > non-technical. This is self-contradictory. If you fail,
> > > > > > > > > > it probably means it's
> > > > > > ambiguous.
> > > > > > > > > > Let's don't use that terminology.
> > > > > > > > > >
> > > > > > > > > What it means is described in theory of operation.
> > > > > > > > >
> > > > > > > > > > > We have technical tasks and more improved specs to
> > > > > > > > > > > update going
> > > > > > > > forward.
> > > > > > > > > >
> > > > > > > > > > It's a burden to do the synchronization.
> > > > > > > > > We have discussed this.
> > > > > > > > > In current proposed the member device is not bifurcated,
> > > > > > > >
> > > > > > > > It is. Part of the functions were carried via the PCI
> > > > > > > > interface, some are carried via owner. You end up with two
> > > > > > > > drivers to drive the
> > > > > > devices.
> > > > > > > >
> > > > > > > Nop.
> > > > > > > All admin work of device migration is carried out via the owner
> > device.
> > > > > > > All guest triggered work is carried out using VF itself.
> > > > > >
> > > > > > Guests don't (or can't) care about how the hypervisor is structured.
> > > > > For passthrough mode, it just cannot be structured inside the VF.
> > > >
> > > > Well, again, we are talking about different things.
> > > >
> > > > >
> > > > > > So we're discussing the view of device, member devices needs to
> > > > > > server for
> > > > > >
> > > > > > 1) request from the transport (it's guest in your context)
> > > > > > 2) request from the owner
> > > > >
> > > > > Doing #2 of the owner on the member device functionality do not
> > > > > work when
> > > > hypervisor do not have access to the member device.
> > > >
> > > > I don't get here, isn't 2) just what we invent for admin commands?
> > > > Driver sends commands to the owner, owner forward those requests to
> > > > the member?
> > > I am most with the term "driver" without notion of guest/hypervisor prefix.
> > >
> > > In one model,
> > > Member device does everything through its native interface = virtio config
> > and device space, cvq, data vqs etc.
> > > Here member device do not forward anything to its owner.
> > >
> > > The live migration hypervisor driver who has the knowledge of live migration
> > flow, accesses the owner device and get the side band member's information to
> > control it.
> > > So member driver do not forward anything here to owner driver.
> > >
>



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  4:20                                                                                       ` Jason Wang
@ 2023-11-16  5:28                                                                                         ` Parav Pandit
  2023-11-16  6:23                                                                                           ` Michael S. Tsirkin
  2023-11-21  7:24                                                                                           ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-16  5:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 16, 2023 9:50 AM
> 
> On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 13, 2023 9:03 AM
> > >
> > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Tuesday, November 7, 2023 9:35 AM
> > > > >
> > > > > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > > > >
> > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > > > > > > >
> > > > > > > > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit
> > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On
> > > > > > > > > > > > > > > Behalf Of Jason Wang
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav
> > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav
> > > > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > From: Jason Wang
> > > > > > > > > > > > > > > > > > > <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023
> > > > > > > > > > > > > > > > > > > 6:59 AM
> > > > > > > > > > > > > > > > > > > > For passthrough PASID assignment
> > > > > > > > > > > > > > > > > > > > vq is not
> > > needed.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > > > > > > Because for passthrough, the
> > > > > > > > > > > > > > > > > > hypervisor is not involved in dealing
> > > > > > > > > > > > > > > > > > with VQ at
> > > > > > > > > > > > > > > > > all.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Ok, so if I understand correctly, you
> > > > > > > > > > > > > > > > > are saying your design can't work for
> > > > > > > > > > > > > > > > > the case of PASID
> > > assignment.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > No. PASID assignment will happen from the
> > > > > > > > > > > > > > > > guest for its own use and device
> > > > > > > > > > > > > > > migration will just work fine because device
> > > > > > > > > > > > > > > context will capture
> > > > > > > this.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It's not about device context. We're
> > > > > > > > > > > > > > > discussing "passthrough",
> > > > > > > no?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > > > > > > A member device is passthrough to the guest,
> > > > > > > > > > > > > > dealing with its own PASIDs and
> > > > > > > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > > > > > > So VQ context captured by the hypervisor, will
> > > > > > > > > > > > > > have some PASID attached to
> > > > > > > > > > > > > this VQ.
> > > > > > > > > > > > > > Device context will be updated.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You want all virtio stuff to be
> > > > > > > > > > > > > > > "passthrough", but assigning a PASID to a
> > > > > > > > > > > > > > > specific virtqueue in the guest must be
> > > > > > > trapped.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > No. PASID assignment to a specific virtqueue
> > > > > > > > > > > > > > in the guest must go directly
> > > > > > > > > > > > > from guest to device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This works like setting CR3, you can't simply
> > > > > > > > > > > > > let it go from guest to
> > > > > > > host.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Host IOMMU driver needs to know the PASID to
> > > > > > > > > > > > > program the IO page tables correctly.
> > > > > > > > > > > > >
> > > > > > > > > > > > This will be done by the IOMMU.
> > > > > > > > > > > >
> > > > > > > > > > > > > > When guest iommu may need to communicate
> > > > > > > > > > > > > > anything for this PASID, it will
> > > > > > > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Let's say using PASID X for queue 0, this
> > > > > > > > > > > > > knowledge is beyond the IOMMU scope but belongs
> > > > > > > > > > > > > to virtio. Or please explain how it can work
> > > > > > > > > > > > > when it goes directly from guest to
> > > > > device.
> > > > > > > > > > > > >
> > > > > > > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > > > > > > >
> > > > > > > > > > > It has one.
> > > > > > > > > > >
> > > > > > > > > > > > For ok for theory sake it is there.
> > > > > > > > > > > >
> > > > > > > > > > > > Virtio driver will assign the PASID directly from
> > > > > > > > > > > > guest driver to device using a
> > > > > > > > > > > create_vq(pasid=X) command.
> > > > > > > > > > > > Same process is somehow attached the PASID by the guest
> OS.
> > > > > > > > > > > > The whole PASID range is known to the hypervisor
> > > > > > > > > > > > when the device is handed
> > > > > > > > > > > over to the guest VM.
> > > > > > > > > > >
> > > > > > > > > > > How can it know?
> > > > > > > > > > >
> > > > > > > > > > > > So PASID mapping is setup by the hypervisor IOMMU
> > > > > > > > > > > > at this
> > > point.
> > > > > > > > > > >
> > > > > > > > > > > You disallow the PASID to be virtualized here.
> > > > > > > > > > > What's more, such a PASID passthrough has security
> implications.
> > > > > > > > > > >
> > > > > > > > > > No. virtio spec is not disallowing. At least for sure,
> > > > > > > > > > this series is not the
> > > > > > > one.
> > > > > > > > > > My main point is, virtio device interface will not be
> > > > > > > > > > the source of hypercall to
> > > > > > > > > program IOMMU in the hypervisor.
> > > > > > > > > > It is something to be done by IOMMU side.
> > > > > > > > >
> > > > > > > > > So unless vPASID can be used by the hardware you need to
> > > > > > > > > trap the mapping from a PASID to a virtqueue. Then you
> > > > > > > > > need virtio specific
> > > > > > > knowledge.
> > > > > > > > >
> > > > > > > > vPASID by hardware is unlikely to be used by hw PCI EP
> > > > > > > > devices at least in any
> > > > > > > near term future.
> > > > > > > > This requires either vPASID to pPASID table in device or in IOMMU.
> > > > > > >
> > > > > > > So we are on the same page.
> > > > > > >
> > > > > > > Claiming a method that can only work for passthrough or
> > > > > > > emulation is not
> > > > > good.
> > > > > > > We all know virtualization is passthrough + emulation.
> > > > > > Again, I agree but I wont generalize it here.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Again, we are talking about different things, I've
> > > > > > > > > > > tried to show you that there are cases that
> > > > > > > > > > > passthrough can't work but if you think the only way
> > > > > > > > > > > for migration is to use passthrough in every case,
> > > > > > > > > > > you will
> > > > > > > > > probably fail.
> > > > > > > > > > >
> > > > > > > > > > I didn't say only way for migration is passthrough.
> > > > > > > > > > Passthrough is clearly one way.
> > > > > > > > > > Other ways may be possible.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > There are works ongoing to make
> > > > > > > > > > > > > > > > > > > vPASID work for the guest like
> > > > > > > > > > > > > vSVA.
> > > > > > > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Great, you find another limitation of
> > > > > > > > > > > > > > > > > "passthrough" by
> > > > > > > yourself.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > No. it is not the limitation it is just
> > > > > > > > > > > > > > > > the way it does not need complex SVA to
> > > > > > > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > How can you limit the user in the guest to not use
> vSVA?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > He he, I am not limiting, again
> > > > > > > > > > > > > > misunderstanding or wrong
> > > > > > > attribution.
> > > > > > > > > > > > > > I explained that hypervisor for passthrough
> > > > > > > > > > > > > > does not need
> > > SVA.
> > > > > > > > > > > > > > Guest can do anything it wants from the guest
> > > > > > > > > > > > > > OS with the member
> > > > > > > > > > > device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, so the point stills, see above.
> > > > > > > > > > > >
> > > > > > > > > > > > I don’t think so. The guest owns its PASID space
> > > > > > > > > > >
> > > > > > > > > > > Again, vPASID to PASID can't be done hardware unless
> > > > > > > > > > > I miss some recent features of IOMMUs.
> > > > > > > > > > >
> > > > > > > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > > > > > > >
> > > > > > > > > At least for the current version of major IOMMU vendors,
> > > > > > > > > such translation (aka PASID remapping) is not
> > > > > > > > > implemented in the hardware so it needs to be trapped first.
> > > > > > > > >
> > > > > > > > Right. So it is really far in future, atleast few years away.
> > > > > > > >
> > > > > > > > > > It is still an early space for virtio.
> > > > > > > > > >
> > > > > > > > > > > > and directly communicates like any other device attribute.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Each passthrough device has PASID from
> > > > > > > > > > > > > > > > > > its own space fully managed by the
> > > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > > > Some cpu required vPASID and SIOV is
> > > > > > > > > > > > > > > > > > not going this way
> > > > > > > > > anmore.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Then how to migrate? Invent a full set
> > > > > > > > > > > > > > > > > of something else through another giant
> > > > > > > > > > > > > > > > > series like this to migrate to the SIOV
> > > > > > > > > thing?
> > > > > > > > > > > > > > > > > That's a mess for
> > > > > > > > > > > > > > > sure.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > SIOV will for sure reuse most or all parts
> > > > > > > > > > > > > > > > of this work, almost entirely
> > > > > > > > > > > as_is.
> > > > > > > > > > > > > > > > vPASID is cpu/platform specific things not
> > > > > > > > > > > > > > > > part of the SIOV
> > > > > > > devices.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > If at all it is done, it will be
> > > > > > > > > > > > > > > > > > > > done from the guest by the driver
> > > > > > > > > > > > > > > > > > > > using virtio
> > > > > > > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Then you need to trap. Such things
> > > > > > > > > > > > > > > > > > > couldn't be passed through to guests
> > > > > > > > > > > > > > > > > directly.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Only PASID capability is trapped.
> > > > > > > > > > > > > > > > > > PASID allocation and usage is directly
> > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > How can you achieve this? Assigning a
> > > > > > > > > > > > > > > > > PAISD to a device is completely
> > > > > > > > > > > > > > > > > device(virtio) specific. How can you use
> > > > > > > > > > > > > > > > > a general layer without the knowledge of
> > > > > > > > > > > > > > > > > virtio to trap
> > > that?
> > > > > > > > > > > > > > > > When one wants to map vPASID to pPASID a
> > > > > > > > > > > > > > > > platform needs to be
> > > > > > > > > > > > > involved.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm not talking about how to map vPASID to
> > > > > > > > > > > > > > > pPASID, it's out of the scope of virtio. I'm
> > > > > > > > > > > > > > > talking about assigning a vPASID to a
> > > > > > > > > > > > > > > specific virtqueue or other virtio function
> > > > > > > > > > > > > > > in the
> > > > > > > guest.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > That can be done in the guest. The key is
> > > > > > > > > > > > > > guest wont know that it is dealing
> > > > > > > > > > > > > with vPASID.
> > > > > > > > > > > > > > It will follow the same principle from your
> > > > > > > > > > > > > > paper of equivalency, where virtio
> > > > > > > > > > > > > software layer will assign PASID to VQ and
> > > > > > > > > > > > > communicate to
> > > > > device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's not, as you mention that only MSI-X is
> > > > > > > > > > > > > trapped, I give you another
> > > > > > > > > one.
> > > > > > > > > > > > >
> > > > > > > > > > > > PASID access from the guest to be done fully by
> > > > > > > > > > > > the guest
> > > IOMMU.
> > > > > > > > > > > > Not by virtio devices.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You need a virtio specific queue or
> > > > > > > > > > > > > > > capability to assign a PASID to a specific
> > > > > > > > > > > > > > > virtqueue, and that can't be done without
> > > > > > > > > > > > > > > trapping and without virito specific
> > > > > knowledge.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > I disagree. PASID assignment to a virqueue in
> > > > > > > > > > > > > > future from guest virtio driver to
> > > > > > > > > > > > > device is uniform method.
> > > > > > > > > > > > > > Whether its PF assigning PASID to VQ of self,
> > > > > > > > > > > > > > Or VF driver in the guest assigning PASID to VQ.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All same.
> > > > > > > > > > > > > > Only IOMMU layer hypercalls will know how to
> > > > > > > > > > > > > > deal with PASID assignment at
> > > > > > > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > > > > > > > By any means, if you were implying that
> > > > > > > > > > > > > > somehow vq to PASID assignment
> > > > > > > > > > > > > _may_ need trap+emulation, hence whole device
> > > > > > > > > > > > > migration to depend on some
> > > > > > > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > See above.
> > > > > > > > > > > > >
> > > > > > > > > > > > Yeah, I disagree to such implying.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD
> > > > > > > > > > > > > > isolating the guest process and
> > > > > > > > > > > > > all of that just works on efficiency and
> > > > > > > > > > > > > equivalence principle already for a decade now
> > > > > > > > > > > > > without any
> > > trap+emulation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When virtio passthrough device is in
> > > > > > > > > > > > > > > > guest, it has all its PASID
> > > > > > > > > > > accessible.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > All these is large deviation from current
> > > > > > > > > > > > > > > > discussion of this series, so I will keep
> > > > > > > > > > > > > > > it short.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Regardless it is not relevant to
> > > > > > > > > > > > > > > > > > passthrough mode as PASID is yet
> > > > > > > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > > > > And for some cpu if it is trapped, it
> > > > > > > > > > > > > > > > > > is generic layer, that does not
> > > > > > > > > > > > > > > > > > require virtio
> > > > > > > > > > > > > > > > > involvement.
> > > > > > > > > > > > > > > > > > So virtio interface asking to trap
> > > > > > > > > > > > > > > > > > something because generic facility has
> > > > > > > > > > > > > > > > > > done
> > > > > > > > > > > > > > > > > in not the approach.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This misses the point of PASID. How to
> > > > > > > > > > > > > > > > > use PASID is totally device
> > > > > > > > > > > > > specific.
> > > > > > > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID
> > > > > > > > > > > > > > > > is platform specific as single PASID
> > > > > > > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Capabilities of #2 is generic
> > > > > > > > > > > > > > > > > > > > across all pci devices, so it will
> > > > > > > > > > > > > > > > > > > > be handled by the
> > > > > > > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > > > > > > ATS/PRI cap is also generic manner
> > > > > > > > > > > > > > > > > > > > handled by the HV and PCI
> > > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > No, ATS/PRI requires the cooperation
> > > > > > > > > > > > > > > > > > > from the
> > > > > vIOMMU.
> > > > > > > > > > > > > > > > > > > You can simply do ATS/PRI
> > > > > > > > > > > > > > > > > > > passthrough but with an emulated
> > > > > > > > > > > vIOMMU.
> > > > > > > > > > > > > > > > > > And that is not the reason for virtio
> > > > > > > > > > > > > > > > > > device to build
> > > > > > > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > vIOMMU is emulated by hypervisor with a
> > > > > > > > > > > > > > > > > PRI queue,
> > > > > > > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Shouldn't it arrive at platform IOMMU first?
> > > > > > > > > > > > > > > The path should be PRI
> > > > > > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor ->
> > > > > > > > > > > > > > > -> vIOMMU PRI
> > > > > > > > > > > > > > > -> -> guest
> > > > > > > > > IOMMU.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > Above sequence seems write.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And things will be more complicated when
> > > > > > > > > > > > > > > (v)PASID is
> > > used.
> > > > > > > > > > > > > > > So you can't simply let PRI go directly to
> > > > > > > > > > > > > > > the guest with the current
> > > > > > > > > > > architecture.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > In current architecture of the pci VF, PRI
> > > > > > > > > > > > > > does not go directly to the
> > > > > > > > > guest.
> > > > > > > > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we
> > > > > > > > > > > > > will probably trap other things in the future
> > > > > > > > > > > > > like PASID
> > > assignment.
> > > > > > > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > > > > > > >
> > > > > > > > > > > It's not about the capability, it's about the whole
> > > > > > > > > > > process of PRI request handling. We've agreed that
> > > > > > > > > > > the PRI request needs to be trapped by the
> > > > > > > > > > > hypervisor and then delivered to the
> > > > > vIOMMU.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > Trap+emulation done in generic manner without
> > > > > > > > > > > > Trap+involving virtio or other
> > > > > > > > > > > device types.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > how can you pass through a hardware PRI
> > > > > > > > > > > > > > > > > request to a guest directly without
> > > > > > > > > > > > > > > > > trapping it
> > > > > > > > > > > > > then?
> > > > > > > > > > > > > > > > > What's more, PCIE allows the PRI to be
> > > > > > > > > > > > > > > > > done in a vendor
> > > > > > > > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > > > > > > for virtio?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > > > > > > Do you have a reference to the ECN that
> > > > > > > > > > > > > > > > enables vendor specific way of PRI? I
> > > > > > > > > > > > > > > would like to read it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I mean it doesn't forbid us to build a
> > > > > > > > > > > > > > > virtio specific interface for I/O page fault report and
> recovery.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > So PRI of PCI does not allow. It is ODP kind
> > > > > > > > > > > > > > of technique you meant
> > > > > > > > > above.
> > > > > > > > > > > > > > Yes one can build.
> > > > > > > > > > > > > > Ok. unrelated to device migration, so I will
> > > > > > > > > > > > > > park this good discussion for
> > > > > > > > > > > later.
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's fine.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This will be very good to eliminate IOMMU
> > > > > > > > > > > > > > > > PRI
> > > limitations.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Probably.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > PRI will directly go to the guest driver,
> > > > > > > > > > > > > > > > and guest would interact with IOMMU
> > > > > > > > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > > > > > > But again these PCI-SIG extensions of PASID
> > > > > > > > > > > > > > are not related to device
> > > > > > > > > > > > > migration, so I am differing it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > For PRI in vendor specific way needs a
> > > > > > > > > > > > > > > > separate discussion. It is not related to
> > > > > > > > > > > > > > > live migration.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > PRI itself is not related. But the point is,
> > > > > > > > > > > > > > > you can't simply pass through ATS/PRI now.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ah ok. the whole 4K PCI config space where
> > > > > > > > > > > > > > ATS/PRI capabilities are located
> > > > > > > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > > > > > > So?
> > > > > > > > > > > > > > So do we start emulating virito interfaces too
> > > > > > > > > > > > > > for
> > > passthrough?
> > > > > > > > > > > > > > No.
> > > > > > > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > > > > > > Sure why not?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Then let's not limit your proposal to be used by
> "passthrough"
> > > > > only?
> > > > > > > > > > > > One can possibly build some variant of the
> > > > > > > > > > > > existing virtio member device
> > > > > > > > > > > using same owner and member scheme.
> > > > > > > > > > >
> > > > > > > > > > > It's not about the member/owner, it's about e.g
> > > > > > > > > > > whether the hypervisor can trap and emulate.
> > > > > > > > > > >
> > > > > > > > > > > I've pointed out that what you invent here is
> > > > > > > > > > > actually a partial new transport, for example, a
> > > > > > > > > > > hypervisor can trap and use things like device
> > > > > > > > > > > context in PF to bypass the registers in VF. This is
> > > > > > > > > > > the idea of
> > > > > > > > > transport commands/q.
> > > > > > > > > > >
> > > > > > > > > > I will not mix transport commands which are mainly
> > > > > > > > > > useful for actual device
> > > > > > > > > operation for SIOV only for backward compatibility that
> > > > > > > > > too
> > > optionally.
> > > > > > > > > > One may still choose to have virtio common and device
> > > > > > > > > > config in MMIO
> > > > > > > > > ofcourse at lower scale.
> > > > > > > > > >
> > > > > > > > > > Anyway, mixing migration context with actual SIOV
> > > > > > > > > > specific thing is not correct
> > > > > > > > > as device context is read/write incremental values.
> > > > > > > > >
> > > > > > > > > SIOV is transport level stuff, the transport virtqueue
> > > > > > > > > is designed in a way that is general enough to cover it.
> > > > > > > > > Let's not shift
> > > > > concepts.
> > > > > > > > >
> > > > > > > > Such TVQ is only for backward compatible vPCI composition.
> > > > > > > > For ground up work such TVQ must not be done through the
> > > > > > > > owner
> > > > > device.
> > > > > > >
> > > > > > > That's the idea actually.
> > > > > > >
> > > > > > > > Each SIOV device to have its own channel to communicate
> > > > > > > > directly to the
> > > > > > > device.
> > > > > > > >
> > > > > > > > > One thing that you ignore is that, hypervisor can use
> > > > > > > > > what you invented as a transport for VF, no?
> > > > > > > > >
> > > > > > > > No. by design,
> > > > > > >
> > > > > > > It works like hypervisor traps the virito config and
> > > > > > > forwards it to admin virtqueue and starts the device via device
> context.
> > > > > > It needs more granular support than the management framework
> > > > > > of device
> > > > > context.
> > > > >
> > > > > It doesn't otherwise it is a design defect as you can't recover
> > > > > the device context in the destination.
> > > > >
> > > > > Let me give you an example:
> > > > >
> > > > > 1) in the case of live migration, dst receive migration byte
> > > > > flows and convert them into device context
> > > > > 2) in the case of transporting, hypervisor traps virtio config
> > > > > and convert them into the device context
> > > > >
> > > > > I don't see anything different in this case. Or can you give me an
> example?
> > > > In #1 dst received byte flows one or multiple times.
> > >
> > > How can this be different?
> > >
> > > Transport can also receive initial state incrementally.
> > >
> > Transport is just simple register RW interface without any caching layer in-
> between.
> > More below.
> > > > And byte flows can be large.
> > >
> > > So when doing transport, it is not that large, that's it. If it can
> > > work with large byte flow, why can't it work for small?
> > Write context can as used (abused) for different purpose.
> > Read cannot because it is meant to be incremental.
> 
> Well hypervisor can just cache what it reads since the last, what's wrong with it?
> 
But hypervisor does not know what changed, so it does do guess work to find out what to query.

> > One can invent a cheap command to read it.
> 
> For sure, but it's not the context here.
>
It is.  
> >
> >
> > >
> > > > So it does not always contain everything. It only contains the new
> > > > delta of the
> > > device context.
> > >
> > > Isn't it just how current PCI transport does?
> > >
> > No. PCI transport has explicit API between device and driver to read or write
> at specific offset and value.
> 
> The point is that they are functional equivalents.
> 
I disagree.
There are two different functionalities.

Functionality_1: explicit ask for read or write
Functionality_2: read what has changed

Should one merge 1 and 2 and complicate the command? 
I prefer not to.

Now having two different commands help for debugging to differentiate between mgmt. commands and guest initiated commands. :)

> >
> > > Guest configure the following one by one:
> > >
> > > 1) vq size
> > > 2) vq addresses
> > > 3) MSI-X
> > >
> > > etc?
> > >
> > I think you interpreted "incremental" differently than I described.
> > In the device context read, the incremental is:
> >
> > If the hypervisor driver has read the device context twice, the second read
> won't return any new data if nothing changed.
> 
> See above.
>
Yeah, two separate commands needed.
 
> > For example, if RSS configuration didn’t change between two reads, the
> second read wont return the TLV for RSS Context.
> >
> > While for transport the need is, when guest asked, one device must read it
> regardless of the change.
> >
> > So notion of incremental is not by address, but by the value.
> >
> > > > For example, VQ configuration is exchanged once between src and dst.
> > > > But VQ avail and used index may be updated multiple times.
> > >
> > > If it can work with multiple times of updating, why can't it work if
> > > we just update it once?
> > Functionally it can work.
> 
> I think you answer yourself.
>
Yes, I don’t like abuse of command.
 
> > Performance wise, one does not want to update multiple times, unless there
> is a change.
> >
> > Read as explained above is not meant to return same content again.
> >
> > >
> > > > So here hypervisor do not want to read any specific set of fields
> > > > and
> > > hypervisor is not parsing them either.
> > > > It is just a byte stream for it.
> > >
> > > Firstly, spec must define the device context format, so hypervisor
> > > can understand which byte is what otherwise you can't maintain
> > > migration compatibility.
> > Device context is defined already in the latest version.
> >
> > > Secondly, you can't mandate how the hypervisor is written.
> > >
> > > >
> > > > As opposed to that, in case of transport, the guest explicitly
> > > > asks to read or
> > > write specific bytes.
> > > > Therefore, it is not incremental.
> > >
> > > I'm totally lost. Which part of the transport is not incremental?
> > >
> > > >
> > > > Additionally, if hypervisor has put the trap on virtio config, and
> > > > because the memory device already has the interface for virtio
> > > > config,
> > > >
> > > > Hypervisor can directly write/read from the virtual config to the
> > > > member's
> > > config space, without going through the device context, right?
> > >
> > > If it can do it or it can choose to not. I don't see how it is
> > > related to the discussion here.
> > >
> > It is. I don’t see a point of hypervisor not using the native interface provided
> by the member device.
> 
> It really depends on the case, and I see how it duplicates with the functionality
> that is provided by both:
> 
> 1) The existing PCI transport
> 
> or
> 
> 2) The transport virtqueue
> 
I would like to conclude that we disagree in our approaches.
PCI transport is for member device to directly communicate from guest driver to the device.
This is uniform across PF, VFs, SIOV.

Admin commands are transport independent and their task is device migration.
One is not replacing the other.

Transport virtqueue will never transport driver notifications, hence it does not qualify at "transport".

For the vdpa case, there is no need for extra admin commands as the mediation layer can directly use the interface available from the member device itself.

You continue to want to overload admin commands for dual purpose, does not make sense to me.

> >
> >  > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > it is not good idea to overload management commands with
> > > > > > > > actual run time
> > > > > > > guest commands.
> > > > > > > > The device context read writes are largely for incremental updates.
> > > > > > >
> > > > > > > It doesn't matter if it is incremental or not but
> > > > > > >
> > > > > > It does because you want different functionality only for
> > > > > > purpose of backward
> > > > > compatibility.
> > > > > > That also if the device does not offer them as portion of MMIO BAR.
> > > > >
> > > > > I don't see how it is related to the "incremental part".
> > > > >
> > > > > >
> > > > > > > 1) the function is there
> > > > > > > 2) hypervisor can use that function if they want and virtio
> > > > > > > (spec) can't forbid that
> > > > > > >
> > > > > > It is not about forbidding or supporting.
> > > > > > Its about what functionality to use for management plane and
> > > > > > guest
> > > plane.
> > > > > > Both have different needs.
> > > > >
> > > > > People can have different views, there's nothing we can prevent
> > > > > a hypervisor from using it as a transport as far as I can see.
> > > > For device context write command, it can be used (or probably
> > > > abused) to do
> > > write but I fail to see why to use it.
> > >
> > > The function is there, you can't prevent people from doing that.
> > >
> > One can always mess up itself. :)
> > It is not prevented. It is just not right way to use the interface.
> >
> > > > Because member device already has the interface to do config
> > > > read/write and
> > > it is accessible to the hypervisor.
> > >
> > > Well, it looks self-contradictory again. Are you saying another set
> > > of commands that is similar to device context is needed for non-PCI
> transport?
> > >
> > All these non pci transport discussion is just meaning less.
> > Let MMIO bring the concept of member device at that point something make
> sense to discuss.
> 
> It's not necessarily MMIO. For example the SIOV, which I don't think can use the
> existing PCI transport.
> 
> > PCI SIOV is also the PCI device at the end.
> 
> We don't want to end up with two sets of commands to save/load SRIOV and
> SIOV at least.
> 
This proposal ensures that SRIOV and SIOV devices are treated equally.
How brand new non-compatible SIOV device to transport this, is outside of the scope of this work.

> Thanks
> 
> 
> 
> >
> > > >
> > > > The read as_is using device context cannot be done because the
> > > > caller is not
> > > explicitly asking what to read.
> > > > And the interface does not have it, because member device has it.
> > > >
> > > > So lets find the need if incremental bit is needed in the
> > > > device_Context read
> > > command or not or a bits to ask explicitly what to read optionally.
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > For VF driver it has own direct channel via its own BAR to
> > > > > > > > talk to the
> > > > > device.
> > > > > > > So no need to transport via PF.
> > > > > > > > For SIOV for backward compat vPCI composition, it may be needed.
> > > > > > > > Hard to say, if that can be memory mapped as well on the
> > > > > > > > BAR of the
> > > PF.
> > > > > > > > We have seen one device supporting it outside of the virtio.
> > > > > > > > For scale anyway, one needs to use the device own cvq for
> > > > > > > > complex
> > > > > > > configuration.
> > > > > > >
> > > > > > > That's the idea but I meant your current proposal overlaps
> > > > > > > those
> > > functions.
> > > > > > >
> > > > > > Not really. One can have simple virtio config space access
> > > > > > read/write
> > > > > functionality, in addition to what is done here.
> > > > > > And that is still fine. One is doing proxying for guest.
> > > > > > Management plane is doing more than just register proxy.
> > > > >
> > > > > See above, let's figure out whether it is possible as a transport first then.
> > > > >
> > > > Right. lets figure out.
> > > >
> > > > I would still promote to not mix management command with transport
> > > command.
> > >
> > > It's not a mixing, it's just because they are functional equivalents.
> > >
> > It is not.
> > I clarified the fundamental difference between the two.
> > One is explicit read and write.
> > Other is, return read data on change.
> > For write, it is explicit set and it does not take effect until the mode is changed
> back to active.
> >
> > > > Commands are cheap in nature. For transport if needed, they can be
> > > > explicit
> > > commands.
> > >
> > > It will be a partial duplication of what is being proposed here.
> >
> > There is always some overlap between management plane (hypervisor
> set/get) and control plane (guest driver get/set).
> > >
> > > Thanks
> > >
> > >
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > If for that is some admin commands are missing,
> > > > > > > > > > > > may be one can add
> > > > > > > > > them.
> > > > > > > > > > >
> > > > > > > > > > > I would then build the device context commands on
> > > > > > > > > > > top of the transport commands/q, then it would be complete.
> > > > > > > > > > >
> > > > > > > > > > > > No need to step on toes of use cases as they are different...
> > > > > > > > > > > >
> > > > > > > > > > > > > I've shown you that
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) you can't easily say you can pass through all
> > > > > > > > > > > > > the virtio facilities
> > > > > > > > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > > > > > > > >
> > > > > > > > > > > > It is not, it is well defined in v3, v2.
> > > > > > > > > > > > One can continue to argue and keep defining the
> > > > > > > > > > > > variant and still call it data
> > > > > > > > > > > path acceleration and then claim it as passthrough ...
> > > > > > > > > > > > But I won't debate this anymore as its just
> > > > > > > > > > > > non-technical aspects of least
> > > > > > > > > > > interest.
> > > > > > > > > > >
> > > > > > > > > > > You use this terminology in the spec which is all
> > > > > > > > > > > about technical, and you think how to define it is a
> > > > > > > > > > > matter of non-technical. This is self-contradictory.
> > > > > > > > > > > If you fail, it probably means it's
> > > > > > > ambiguous.
> > > > > > > > > > > Let's don't use that terminology.
> > > > > > > > > > >
> > > > > > > > > > What it means is described in theory of operation.
> > > > > > > > > >
> > > > > > > > > > > > We have technical tasks and more improved specs to
> > > > > > > > > > > > update going
> > > > > > > > > forward.
> > > > > > > > > > >
> > > > > > > > > > > It's a burden to do the synchronization.
> > > > > > > > > > We have discussed this.
> > > > > > > > > > In current proposed the member device is not
> > > > > > > > > > bifurcated,
> > > > > > > > >
> > > > > > > > > It is. Part of the functions were carried via the PCI
> > > > > > > > > interface, some are carried via owner. You end up with
> > > > > > > > > two drivers to drive the
> > > > > > > devices.
> > > > > > > > >
> > > > > > > > Nop.
> > > > > > > > All admin work of device migration is carried out via the
> > > > > > > > owner
> > > device.
> > > > > > > > All guest triggered work is carried out using VF itself.
> > > > > > >
> > > > > > > Guests don't (or can't) care about how the hypervisor is structured.
> > > > > > For passthrough mode, it just cannot be structured inside the VF.
> > > > >
> > > > > Well, again, we are talking about different things.
> > > > >
> > > > > >
> > > > > > > So we're discussing the view of device, member devices needs
> > > > > > > to server for
> > > > > > >
> > > > > > > 1) request from the transport (it's guest in your context)
> > > > > > > 2) request from the owner
> > > > > >
> > > > > > Doing #2 of the owner on the member device functionality do
> > > > > > not work when
> > > > > hypervisor do not have access to the member device.
> > > > >
> > > > > I don't get here, isn't 2) just what we invent for admin commands?
> > > > > Driver sends commands to the owner, owner forward those requests
> > > > > to the member?
> > > > I am most with the term "driver" without notion of guest/hypervisor
> prefix.
> > > >
> > > > In one model,
> > > > Member device does everything through its native interface =
> > > > virtio config
> > > and device space, cvq, data vqs etc.
> > > > Here member device do not forward anything to its owner.
> > > >
> > > > The live migration hypervisor driver who has the knowledge of live
> > > > migration
> > > flow, accesses the owner device and get the side band member's
> > > information to control it.
> > > > So member driver do not forward anything here to owner driver.
> > > >
> >


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  5:28                                                                                         ` Parav Pandit
@ 2023-11-16  6:23                                                                                           ` Michael S. Tsirkin
  2023-11-16  6:34                                                                                             ` Parav Pandit
  2023-11-21  7:24                                                                                           ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16  6:23 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> You continue to want to overload admin commands for dual purpose, does not make sense to me.

dual -> as a transport and for migration? why can't they be used for
this? I was really hoping to cover these two cases when I proposed them.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  6:23                                                                                           ` Michael S. Tsirkin
@ 2023-11-16  6:34                                                                                             ` Parav Pandit
  2023-11-16  6:38                                                                                               ` Michael S. Tsirkin
  2023-11-21  4:22                                                                                               ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-16  6:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 16, 2023 11:53 AM
> 
> On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > You continue to want to overload admin commands for dual purpose, does
> not make sense to me.
> 
> dual -> as a transport and for migration? why can't they be used for this? I was
> really hoping to cover these two cases when I proposed them.
For following reasons.

1. migration needs incremental reads of only changed context between two reads

2. migration writes covers large part of the configurations not just virtio common config and device config.
Such as configuration occurred through the CVQ. All of these is not needed when done from guest directly via member's own CVQ.

For backward compatible SIOV transport, one may need them to transport without above two properties.

3. None of this transport is needed for PFs, VFs and non-backward compatible SIOVs.
Each device to have its own transport that is not intercepted by the hypervisor and follow the equivalency principle uniformly for all 3 device types.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  6:34                                                                                             ` Parav Pandit
@ 2023-11-16  6:38                                                                                               ` Michael S. Tsirkin
  2023-11-16  6:43                                                                                                 ` Parav Pandit
  2023-11-21  4:22                                                                                               ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16  6:38 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 06:34:23AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 11:53 AM
> > 
> > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > You continue to want to overload admin commands for dual purpose, does
> > not make sense to me.
> > 
> > dual -> as a transport and for migration? why can't they be used for this? I was
> > really hoping to cover these two cases when I proposed them.
> For following reasons.
> 
> 1. migration needs incremental reads of only changed context between two reads
> 
> 2. migration writes covers large part of the configurations not just virtio common config and device config.
> Such as configuration occurred through the CVQ. All of these is not needed when done from guest directly via member's own CVQ.
> 
> For backward compatible SIOV transport, one may need them to transport without above two properties.
> 
> 3. None of this transport is needed for PFs, VFs and non-backward compatible SIOVs.
> Each device to have its own transport that is not intercepted by the hypervisor and follow the equivalency principle uniformly for all 3 device types.
> 

To clarify. Above seems to justify why the admin commands for migration
must be distinct from admin commands for transport. But I don't see why
(e.g. two sets of) admin commands can not be used for both. Do you?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  6:38                                                                                               ` Michael S. Tsirkin
@ 2023-11-16  6:43                                                                                                 ` Parav Pandit
  2023-11-16  6:56                                                                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-16  6:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 16, 2023 12:09 PM
> 
> On Thu, Nov 16, 2023 at 06:34:23AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 16, 2023 11:53 AM
> > >
> > > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > > You continue to want to overload admin commands for dual purpose,
> > > > does
> > > not make sense to me.
> > >
> > > dual -> as a transport and for migration? why can't they be used for
> > > this? I was really hoping to cover these two cases when I proposed them.
> > For following reasons.
> >
> > 1. migration needs incremental reads of only changed context between
> > two reads
> >
> > 2. migration writes covers large part of the configurations not just virtio
> common config and device config.
> > Such as configuration occurred through the CVQ. All of these is not needed
> when done from guest directly via member's own CVQ.
> >
> > For backward compatible SIOV transport, one may need them to transport
> without above two properties.
> >
> > 3. None of this transport is needed for PFs, VFs and non-backward compatible
> SIOVs.
> > Each device to have its own transport that is not intercepted by the hypervisor
> and follow the equivalency principle uniformly for all 3 device types.
> >
> 
> To clarify. Above seems to justify why the admin commands for migration must
> be distinct from admin commands for transport. But I don't see why (e.g. two
> sets of) admin commands can not be used for both. Do you?

I didn't follow, "used for both".
Can you please explain?
Both meaning, 
a. for device migration and 
b. for transporting configuration by owner device on behalf of member device?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  6:43                                                                                                 ` Parav Pandit
@ 2023-11-16  6:56                                                                                                   ` Michael S. Tsirkin
  2023-11-16  7:02                                                                                                     ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16  6:56 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 06:43:05AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 12:09 PM
> > 
> > On Thu, Nov 16, 2023 at 06:34:23AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 16, 2023 11:53 AM
> > > >
> > > > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > > > You continue to want to overload admin commands for dual purpose,
> > > > > does
> > > > not make sense to me.
> > > >
> > > > dual -> as a transport and for migration? why can't they be used for
> > > > this? I was really hoping to cover these two cases when I proposed them.
> > > For following reasons.
> > >
> > > 1. migration needs incremental reads of only changed context between
> > > two reads
> > >
> > > 2. migration writes covers large part of the configurations not just virtio
> > common config and device config.
> > > Such as configuration occurred through the CVQ. All of these is not needed
> > when done from guest directly via member's own CVQ.
> > >
> > > For backward compatible SIOV transport, one may need them to transport
> > without above two properties.
> > >
> > > 3. None of this transport is needed for PFs, VFs and non-backward compatible
> > SIOVs.
> > > Each device to have its own transport that is not intercepted by the hypervisor
> > and follow the equivalency principle uniformly for all 3 device types.
> > >
> > 
> > To clarify. Above seems to justify why the admin commands for migration must
> > be distinct from admin commands for transport. But I don't see why (e.g. two
> > sets of) admin commands can not be used for both. Do you?
> 
> I didn't follow, "used for both".
> Can you please explain?
> Both meaning, 
> a. for device migration and 
> b. for transporting configuration by owner device on behalf of member device?

Yes, so one set of commands for migration another for passing config
space accesses. We do in fact have admin commands as transport for
legacy, do we not? And in this model we can have new group types,
e.g. SIOV's subfunction or even a "self" group.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  6:56                                                                                                   ` Michael S. Tsirkin
@ 2023-11-16  7:02                                                                                                     ` Parav Pandit
  2023-11-16  7:14                                                                                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-16  7:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 16, 2023 12:27 PM
> 
> On Thu, Nov 16, 2023 at 06:43:05AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 16, 2023 12:09 PM
> > >
> > > On Thu, Nov 16, 2023 at 06:34:23AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 11:53 AM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > > > > You continue to want to overload admin commands for dual
> > > > > > purpose, does
> > > > > not make sense to me.
> > > > >
> > > > > dual -> as a transport and for migration? why can't they be used
> > > > > for this? I was really hoping to cover these two cases when I proposed
> them.
> > > > For following reasons.
> > > >
> > > > 1. migration needs incremental reads of only changed context
> > > > between two reads
> > > >
> > > > 2. migration writes covers large part of the configurations not
> > > > just virtio
> > > common config and device config.
> > > > Such as configuration occurred through the CVQ. All of these is
> > > > not needed
> > > when done from guest directly via member's own CVQ.
> > > >
> > > > For backward compatible SIOV transport, one may need them to
> > > > transport
> > > without above two properties.
> > > >
> > > > 3. None of this transport is needed for PFs, VFs and non-backward
> > > > compatible
> > > SIOVs.
> > > > Each device to have its own transport that is not intercepted by
> > > > the hypervisor
> > > and follow the equivalency principle uniformly for all 3 device types.
> > > >
> > >
> > > To clarify. Above seems to justify why the admin commands for
> > > migration must be distinct from admin commands for transport. But I
> > > don't see why (e.g. two sets of) admin commands can not be used for both.
> Do you?
> >
> > I didn't follow, "used for both".
> > Can you please explain?
> > Both meaning,
> > a. for device migration and
> > b. for transporting configuration by owner device on behalf of member
> device?
> 
> Yes, so one set of commands for migration another for passing config space
> accesses. We do in fact have admin commands as transport for legacy, do we
> not? And in this model we can have new group types, e.g. SIOV's subfunction or
> even a "self" group.

This is only need for backward compatible SIOV device.
And I am not sure if one should create such or not.
In some internal test we see device and platform tend to saturate at various levels beyond a certain scale, due to which building backward compatible SIOV is not very useful.

For non-backward compatible SIOV device, PFs, and VFs all the configurations must be done directly from the driver to device without mediation layers as listed in above #3.
Hence, there is really no point in doing transport VQ for future.
Each device doing its runtime configuration using its own transport method solves scale and security both uniformly across PF, VF, SIOV.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  7:02                                                                                                     ` Parav Pandit
@ 2023-11-16  7:14                                                                                                       ` Michael S. Tsirkin
  2023-11-16  9:45                                                                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16  7:14 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 07:02:23AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 12:27 PM
> > 
> > On Thu, Nov 16, 2023 at 06:43:05AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 16, 2023 12:09 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 06:34:23AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 11:53 AM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > > > > > You continue to want to overload admin commands for dual
> > > > > > > purpose, does
> > > > > > not make sense to me.
> > > > > >
> > > > > > dual -> as a transport and for migration? why can't they be used
> > > > > > for this? I was really hoping to cover these two cases when I proposed
> > them.
> > > > > For following reasons.
> > > > >
> > > > > 1. migration needs incremental reads of only changed context
> > > > > between two reads
> > > > >
> > > > > 2. migration writes covers large part of the configurations not
> > > > > just virtio
> > > > common config and device config.
> > > > > Such as configuration occurred through the CVQ. All of these is
> > > > > not needed
> > > > when done from guest directly via member's own CVQ.
> > > > >
> > > > > For backward compatible SIOV transport, one may need them to
> > > > > transport
> > > > without above two properties.
> > > > >
> > > > > 3. None of this transport is needed for PFs, VFs and non-backward
> > > > > compatible
> > > > SIOVs.
> > > > > Each device to have its own transport that is not intercepted by
> > > > > the hypervisor
> > > > and follow the equivalency principle uniformly for all 3 device types.
> > > > >
> > > >
> > > > To clarify. Above seems to justify why the admin commands for
> > > > migration must be distinct from admin commands for transport. But I
> > > > don't see why (e.g. two sets of) admin commands can not be used for both.
> > Do you?
> > >
> > > I didn't follow, "used for both".
> > > Can you please explain?
> > > Both meaning,
> > > a. for device migration and
> > > b. for transporting configuration by owner device on behalf of member
> > device?
> > 
> > Yes, so one set of commands for migration another for passing config space
> > accesses. We do in fact have admin commands as transport for legacy, do we
> > not? And in this model we can have new group types, e.g. SIOV's subfunction or
> > even a "self" group.
> 
> This is only need for backward compatible SIOV device.
> And I am not sure if one should create such or not.
> In some internal test we see device and platform tend to saturate at various levels beyond a certain scale, due to which building backward compatible SIOV is not very useful.

Can we see some reports of all this performance testing you are doing
behind the scenes? It would be really benefitial since everyone
is just discussing things theoretically and here finally someone
did some experiments. This is a strong point in your favor but not
if you just obliquely refer to "certain scale".

> For non-backward compatible SIOV device, PFs, and VFs all the
> configurations must be done directly from the driver to device without
> mediation layers as listed in above #3.

/facepalm. You keep bringing this up.  There's no must here because all
hypervisor have some kind of mediation. Whenever you have some solution
in mind you immediately brand whatever it does not do "not mediation" or
"out of scope" and whatever it does "mediation". The distinction does
not matter.


> Hence, there is really no point in doing transport VQ for future.
> Each device doing its runtime configuration using its own transport method solves scale and security both uniformly across PF, VF, SIOV.

Yes, there's a point.
The point is that low end uses of virtio dwarf the high end scalable
things you are so preoccupied with - understandably as representative of
a hardware vendor who wants high margins. This is why we need to keep
simple things in config space and this means that yes, we will keep
expanding config space. And this in turn means that if you want
to do things over DMA then you need a way to access config space over DMA.
This way needs to use *some kind of command* and maybe that can be admin
command format.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  7:14                                                                                                       ` Michael S. Tsirkin
@ 2023-11-16  9:45                                                                                                         ` Parav Pandit
  0 siblings, 0 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-16  9:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Thursday, November 16, 2023 12:44 PM
> 
> On Thu, Nov 16, 2023 at 07:02:23AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 16, 2023 12:27 PM
> > >
> > > On Thu, Nov 16, 2023 at 06:43:05AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 12:09 PM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 06:34:23AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Thursday, November 16, 2023 11:53 AM
> > > > > > >
> > > > > > > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > > > > > > You continue to want to overload admin commands for dual
> > > > > > > > purpose, does
> > > > > > > not make sense to me.
> > > > > > >
> > > > > > > dual -> as a transport and for migration? why can't they be
> > > > > > > used for this? I was really hoping to cover these two cases
> > > > > > > when I proposed
> > > them.
> > > > > > For following reasons.
> > > > > >
> > > > > > 1. migration needs incremental reads of only changed context
> > > > > > between two reads
> > > > > >
> > > > > > 2. migration writes covers large part of the configurations
> > > > > > not just virtio
> > > > > common config and device config.
> > > > > > Such as configuration occurred through the CVQ. All of these
> > > > > > is not needed
> > > > > when done from guest directly via member's own CVQ.
> > > > > >
> > > > > > For backward compatible SIOV transport, one may need them to
> > > > > > transport
> > > > > without above two properties.
> > > > > >
> > > > > > 3. None of this transport is needed for PFs, VFs and
> > > > > > non-backward compatible
> > > > > SIOVs.
> > > > > > Each device to have its own transport that is not intercepted
> > > > > > by the hypervisor
> > > > > and follow the equivalency principle uniformly for all 3 device types.
> > > > > >
> > > > >
> > > > > To clarify. Above seems to justify why the admin commands for
> > > > > migration must be distinct from admin commands for transport.
> > > > > But I don't see why (e.g. two sets of) admin commands can not be used
> for both.
> > > Do you?
> > > >
> > > > I didn't follow, "used for both".
> > > > Can you please explain?
> > > > Both meaning,
> > > > a. for device migration and
> > > > b. for transporting configuration by owner device on behalf of
> > > > member
> > > device?
> > >
> > > Yes, so one set of commands for migration another for passing config
> > > space accesses. We do in fact have admin commands as transport for
> > > legacy, do we not? And in this model we can have new group types,
> > > e.g. SIOV's subfunction or even a "self" group.
> >
> > This is only need for backward compatible SIOV device.
> > And I am not sure if one should create such or not.
> > In some internal test we see device and platform tend to saturate at various
> levels beyond a certain scale, due to which building backward compatible SIOV
> is not very useful.
> 
> > For non-backward compatible SIOV device, PFs, and VFs all the
> > configurations must be done directly from the driver to device without
> > mediation layers as listed in above #3.
> 
> /facepalm. You keep bringing this up.  There's no must here because all
> hypervisor have some kind of mediation. Whenever you have some solution in
> mind you immediately brand whatever it does not do "not mediation" or "out of
> scope" and whatever it does "mediation". The distinction does not matter.
> 
> 
> > Hence, there is really no point in doing transport VQ for future.
> > Each device doing its runtime configuration using its own transport method
> solves scale and security both uniformly across PF, VF, SIOV.
> 
> Yes, there's a point.
> The point is that low end uses of virtio dwarf the high end scalable things you
> are so preoccupied with - understandably as representative of a hardware
> vendor who wants high margins. This is why we need to keep simple things in
> config space and this means that yes, we will keep expanding config space. And
> this in turn means that if you want to do things over DMA then you need a way
> to access config space over DMA.
> This way needs to use *some kind of command* and maybe that can be admin
> command format.

The low-end uses can always use the comm and interface anyway as they do not care for power, speed, hypercall, scale efficiency and security construct.
And the sw maintenance cost this config work is so negligible compared to the rest of the problems, it is less interesting to do config space anymore.

We have debated this many times, Michael.
I also explained in past that config space over DMA has provisioning problem that the cloud admin does not know if VM is old or new, so it ends up reserving things that may never be used.

Hence, endless growth of config space does not help.

The fact is no device is growing the config space anymore.
I will put all the points in the google doc and share one time, so we don't need to keep debating it over.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-15 17:39                                                                                     ` Parav Pandit
  2023-11-16  4:20                                                                                       ` Jason Wang
@ 2023-11-17 10:08                                                                                       ` Michael S. Tsirkin
  2023-11-17 10:20                                                                                         ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 10:08 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > >
> > > Additionally, if hypervisor has put the trap on virtio config, and
> > > because the memory device already has the interface for virtio config,
> > >
> > > Hypervisor can directly write/read from the virtual config to the member's
> > config space, without going through the device context, right?
> > 
> > If it can do it or it can choose to not. I don't see how it is related to the
> > discussion here.
> >
> It is. I don’t see a point of hypervisor not using the native interface provided by the member device.

So for example, it seems reasonable to a member supporting both
existing pci register interface for compatibility and the future
DMA based one for scale. In such a case, it seems possible that
DMA will expose more features than pci. And then a hypervisor
might decide to use that in preference to pci registers.



>  > >
> > > >
> > > > >
> > > > > >
> > > > > > > it is not good idea to overload management commands with
> > > > > > > actual run time
> > > > > > guest commands.
> > > > > > > The device context read writes are largely for incremental updates.
> > > > > >
> > > > > > It doesn't matter if it is incremental or not but
> > > > > >
> > > > > It does because you want different functionality only for purpose
> > > > > of backward
> > > > compatibility.
> > > > > That also if the device does not offer them as portion of MMIO BAR.
> > > >
> > > > I don't see how it is related to the "incremental part".
> > > >
> > > > >
> > > > > > 1) the function is there
> > > > > > 2) hypervisor can use that function if they want and virtio
> > > > > > (spec) can't forbid that
> > > > > >
> > > > > It is not about forbidding or supporting.
> > > > > Its about what functionality to use for management plane and guest
> > plane.
> > > > > Both have different needs.
> > > >
> > > > People can have different views, there's nothing we can prevent a
> > > > hypervisor from using it as a transport as far as I can see.
> > > For device context write command, it can be used (or probably abused) to do
> > write but I fail to see why to use it.
> > 
> > The function is there, you can't prevent people from doing that.
> >
> One can always mess up itself. :)
> It is not prevented. It is just not right way to use the interface.
>  
> > > Because member device already has the interface to do config read/write and
> > it is accessible to the hypervisor.
> > 
> > Well, it looks self-contradictory again. Are you saying another set of commands
> > that is similar to device context is needed for non-PCI transport?
> >
> All these non pci transport discussion is just meaning less.
> Let MMIO bring the concept of member device at that point something make sense to discuss.
> PCI SIOV is also the PCI device at the end.
>  
> > >
> > > The read as_is using device context cannot be done because the caller is not
> > explicitly asking what to read.
> > > And the interface does not have it, because member device has it.
> > >
> > > So lets find the need if incremental bit is needed in the device_Context read
> > command or not or a bits to ask explicitly what to read optionally.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > For VF driver it has own direct channel via its own BAR to
> > > > > > > talk to the
> > > > device.
> > > > > > So no need to transport via PF.
> > > > > > > For SIOV for backward compat vPCI composition, it may be needed.
> > > > > > > Hard to say, if that can be memory mapped as well on the BAR of the
> > PF.
> > > > > > > We have seen one device supporting it outside of the virtio.
> > > > > > > For scale anyway, one needs to use the device own cvq for
> > > > > > > complex
> > > > > > configuration.
> > > > > >
> > > > > > That's the idea but I meant your current proposal overlaps those
> > functions.
> > > > > >
> > > > > Not really. One can have simple virtio config space access
> > > > > read/write
> > > > functionality, in addition to what is done here.
> > > > > And that is still fine. One is doing proxying for guest.
> > > > > Management plane is doing more than just register proxy.
> > > >
> > > > See above, let's figure out whether it is possible as a transport first then.
> > > >
> > > Right. lets figure out.
> > >
> > > I would still promote to not mix management command with transport
> > command.
> > 
> > It's not a mixing, it's just because they are functional equivalents.
> > 
> It is not.
> I clarified the fundamental difference between the two.
> One is explicit read and write.
> Other is, return read data on change.
> For write, it is explicit set and it does not take effect until the mode is changed back to active.
> 
> > > Commands are cheap in nature. For transport if needed, they can be explicit
> > commands.
> > 
> > It will be a partial duplication of what is being proposed here.
> 
> There is always some overlap between management plane (hypervisor set/get) and control plane (guest driver get/set).
> > 
> > Thanks
> > 
> > 
> > 
> > >
> > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > If for that is some admin commands are missing, may be
> > > > > > > > > > > one can add
> > > > > > > > them.
> > > > > > > > > >
> > > > > > > > > > I would then build the device context commands on top of
> > > > > > > > > > the transport commands/q, then it would be complete.
> > > > > > > > > >
> > > > > > > > > > > No need to step on toes of use cases as they are different...
> > > > > > > > > > >
> > > > > > > > > > > > I've shown you that
> > > > > > > > > > > >
> > > > > > > > > > > > 1) you can't easily say you can pass through all the
> > > > > > > > > > > > virtio facilities
> > > > > > > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > > > > > > >
> > > > > > > > > > > It is not, it is well defined in v3, v2.
> > > > > > > > > > > One can continue to argue and keep defining the
> > > > > > > > > > > variant and still call it data
> > > > > > > > > > path acceleration and then claim it as passthrough ...
> > > > > > > > > > > But I won't debate this anymore as its just
> > > > > > > > > > > non-technical aspects of least
> > > > > > > > > > interest.
> > > > > > > > > >
> > > > > > > > > > You use this terminology in the spec which is all about
> > > > > > > > > > technical, and you think how to define it is a matter of
> > > > > > > > > > non-technical. This is self-contradictory. If you fail,
> > > > > > > > > > it probably means it's
> > > > > > ambiguous.
> > > > > > > > > > Let's don't use that terminology.
> > > > > > > > > >
> > > > > > > > > What it means is described in theory of operation.
> > > > > > > > >
> > > > > > > > > > > We have technical tasks and more improved specs to
> > > > > > > > > > > update going
> > > > > > > > forward.
> > > > > > > > > >
> > > > > > > > > > It's a burden to do the synchronization.
> > > > > > > > > We have discussed this.
> > > > > > > > > In current proposed the member device is not bifurcated,
> > > > > > > >
> > > > > > > > It is. Part of the functions were carried via the PCI
> > > > > > > > interface, some are carried via owner. You end up with two
> > > > > > > > drivers to drive the
> > > > > > devices.
> > > > > > > >
> > > > > > > Nop.
> > > > > > > All admin work of device migration is carried out via the owner
> > device.
> > > > > > > All guest triggered work is carried out using VF itself.
> > > > > >
> > > > > > Guests don't (or can't) care about how the hypervisor is structured.
> > > > > For passthrough mode, it just cannot be structured inside the VF.
> > > >
> > > > Well, again, we are talking about different things.
> > > >
> > > > >
> > > > > > So we're discussing the view of device, member devices needs to
> > > > > > server for
> > > > > >
> > > > > > 1) request from the transport (it's guest in your context)
> > > > > > 2) request from the owner
> > > > >
> > > > > Doing #2 of the owner on the member device functionality do not
> > > > > work when
> > > > hypervisor do not have access to the member device.
> > > >
> > > > I don't get here, isn't 2) just what we invent for admin commands?
> > > > Driver sends commands to the owner, owner forward those requests to
> > > > the member?
> > > I am most with the term "driver" without notion of guest/hypervisor prefix.
> > >
> > > In one model,
> > > Member device does everything through its native interface = virtio config
> > and device space, cvq, data vqs etc.
> > > Here member device do not forward anything to its owner.
> > >
> > > The live migration hypervisor driver who has the knowledge of live migration
> > flow, accesses the owner device and get the side band member's information to
> > control it.
> > > So member driver do not forward anything here to owner driver.
> > >
> 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 10:08                                                                                       ` Michael S. Tsirkin
@ 2023-11-17 10:20                                                                                         ` Parav Pandit
  2023-11-17 11:11                                                                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-17 10:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 3:38 PM
> 
> On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > >
> > > > Additionally, if hypervisor has put the trap on virtio config, and
> > > > because the memory device already has the interface for virtio
> > > > config,
> > > >
> > > > Hypervisor can directly write/read from the virtual config to the
> > > > member's
> > > config space, without going through the device context, right?
> > >
> > > If it can do it or it can choose to not. I don't see how it is
> > > related to the discussion here.
> > >
> > It is. I don’t see a point of hypervisor not using the native interface provided
> by the member device.
> 
> So for example, it seems reasonable to a member supporting both existing pci
> register interface for compatibility and the future DMA based one for scale. In
> such a case, it seems possible that DMA will expose more features than pci. And
> then a hypervisor might decide to use that in preference to pci registers.

We don’t find it right to involve owner device for mediating at current scale and to not break TDISP efforts in upcoming time by such design.
And for future scale, having new SIOV interface makes more sense which has its own direct interface to device.

I finally captured all past discussions in form of a FAQ at [1].

[1] https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x1s73Ed6rOsmn6LfXc/edit?usp=sharing


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 10:20                                                                                         ` Parav Pandit
@ 2023-11-17 11:11                                                                                           ` Michael S. Tsirkin
  2023-11-17 11:20                                                                                             ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 11:11 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 3:38 PM
> > 
> > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > >
> > > > > Additionally, if hypervisor has put the trap on virtio config, and
> > > > > because the memory device already has the interface for virtio
> > > > > config,
> > > > >
> > > > > Hypervisor can directly write/read from the virtual config to the
> > > > > member's
> > > > config space, without going through the device context, right?
> > > >
> > > > If it can do it or it can choose to not. I don't see how it is
> > > > related to the discussion here.
> > > >
> > > It is. I don’t see a point of hypervisor not using the native interface provided
> > by the member device.
> > 
> > So for example, it seems reasonable to a member supporting both existing pci
> > register interface for compatibility and the future DMA based one for scale. In
> > such a case, it seems possible that DMA will expose more features than pci. And
> > then a hypervisor might decide to use that in preference to pci registers.
> 
> We don’t find it right to involve owner device for mediating at
> current scale

In this model, device will be its own owner. Should not be a problem.

> and to not break TDISP efforts in upcoming time by such
> design.

Look you either stop mentioning TDISP as motivation or actually
try to address it. Safe migration with TDISP is really hard.
For example, your current patches are clearly broken for TDISP:
owner can control queue state at any time making device modify
memory in any way it wants.

> And for future scale, having new SIOV interface makes more sense which has its own direct interface to device.
> 
> I finally captured all past discussions in form of a FAQ at [1].
> 
> [1] https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x1s73Ed6rOsmn6LfXc/edit?usp=sharing

Yea skimmed that, "Cons: None". Are you 100% sure? Anyway, discussion
will take place on the mailing list please. 

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 11:11                                                                                           ` Michael S. Tsirkin
@ 2023-11-17 11:20                                                                                             ` Parav Pandit
  2023-11-17 11:43                                                                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-17 11:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 4:41 PM
> 
> On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 3:38 PM
> > >
> > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > Additionally, if hypervisor has put the trap on virtio config,
> > > > > > and because the memory device already has the interface for
> > > > > > virtio config,
> > > > > >
> > > > > > Hypervisor can directly write/read from the virtual config to
> > > > > > the member's
> > > > > config space, without going through the device context, right?
> > > > >
> > > > > If it can do it or it can choose to not. I don't see how it is
> > > > > related to the discussion here.
> > > > >
> > > > It is. I don’t see a point of hypervisor not using the native
> > > > interface provided
> > > by the member device.
> > >
> > > So for example, it seems reasonable to a member supporting both
> > > existing pci register interface for compatibility and the future DMA
> > > based one for scale. In such a case, it seems possible that DMA will
> > > expose more features than pci. And then a hypervisor might decide to use
> that in preference to pci registers.
> >
> > We don’t find it right to involve owner device for mediating at
> > current scale
> 
> In this model, device will be its own owner. Should not be a problem.
>
I didn’t understand above comment.
 
> > and to not break TDISP efforts in upcoming time by such design.
> 
> Look you either stop mentioning TDISP as motivation or actually try to address
> it. Safe migration with TDISP is really hard.
But that is not an excuse to say that TDISP migration is not present, hence involve the owner device for config space access.
This is another hurdle added that further blocks us away from TDISP.
Hence, we don’t want to take the route of involving owner device for any config access.

> For example, your current patches are clearly broken for TDISP:
> owner can control queue state at any time making device modify memory in
> any way it wants.
>
When TDISP migration is needed, the admin device can be another TVM outside the HV scope.
Or an alternative would have device context encrypted not visible to HV at all.
Such encryption is not possible, with the trap+emulation method, where HV will have to decrypt the data coming over MMIO writes.
 
> > And for future scale, having new SIOV interface makes more sense which has
> its own direct interface to device.
> >
> > I finally captured all past discussions in form of a FAQ at [1].
> >
> > [1]
> > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x1s73Ed6r
> > Osmn6LfXc/edit?usp=sharing
> 
> Yea skimmed that, "Cons: None". Are you 100% sure? Anyway, discussion will
> take place on the mailing list please.

We cannot keep discussing the register interface every week.
I remember we have discussed this many times already in following series.

1. legacy series
2. tvq v4 series
3. dynamic vq creation series
4. again during suspend series under tvq head
5. right now
6. May be more that I forgot.

I captured all the direction and options in the doc. One can refer when those questions arise there.
If we don’t work cohesively same reasoning repetition does not help.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 11:20                                                                                             ` Parav Pandit
@ 2023-11-17 11:43                                                                                               ` Michael S. Tsirkin
  2023-11-17 12:02                                                                                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 11:43 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 4:41 PM
> > 
> > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 3:38 PM
> > > >
> > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > > Additionally, if hypervisor has put the trap on virtio config,
> > > > > > > and because the memory device already has the interface for
> > > > > > > virtio config,
> > > > > > >
> > > > > > > Hypervisor can directly write/read from the virtual config to
> > > > > > > the member's
> > > > > > config space, without going through the device context, right?
> > > > > >
> > > > > > If it can do it or it can choose to not. I don't see how it is
> > > > > > related to the discussion here.
> > > > > >
> > > > > It is. I don’t see a point of hypervisor not using the native
> > > > > interface provided
> > > > by the member device.
> > > >
> > > > So for example, it seems reasonable to a member supporting both
> > > > existing pci register interface for compatibility and the future DMA
> > > > based one for scale. In such a case, it seems possible that DMA will
> > > > expose more features than pci. And then a hypervisor might decide to use
> > that in preference to pci registers.
> > >
> > > We don’t find it right to involve owner device for mediating at
> > > current scale
> > 
> > In this model, device will be its own owner. Should not be a problem.
> >
> I didn’t understand above comment.

We'd add a new group type "self". You can then send admin commands
through VF itself not through PF.


> > > and to not break TDISP efforts in upcoming time by such design.
> > 
> > Look you either stop mentioning TDISP as motivation or actually try to address
> > it. Safe migration with TDISP is really hard.
> But that is not an excuse to say that TDISP migration is not present, hence involve the owner device for config space access.
> This is another hurdle added that further blocks us away from TDISP.
> Hence, we don’t want to take the route of involving owner device for any config access.

This "blocks" is all just wild hunches. hypervisor controls some
aspects of TDISP devices for sure - maybe we actually should use
pci config space as that is generally hypervisor controlled.

> > For example, your current patches are clearly broken for TDISP:
> > owner can control queue state at any time making device modify memory in
> > any way it wants.
> >
> When TDISP migration is needed, the admin device can be another TVM outside the HV scope.
> Or an alternative would have device context encrypted not visible to HV at all.

Maybe. Fact remains your patches do conflict with TDISP and you seem to
be fine with it because you have a hunch you can fix it. But we can't
do development based on your hunches.


> Such encryption is not possible, with the trap+emulation method, where HV will have to decrypt the data coming over MMIO writes.

I don't how what trap+emulation has to do with it. Do you refer to the
shadow vq thing? I am guessing modern platforms with TDISP support are
likely to also support dirty bit in the IOMMU.


> > > And for future scale, having new SIOV interface makes more sense which has
> > its own direct interface to device.
> > >
> > > I finally captured all past discussions in form of a FAQ at [1].
> > >
> > > [1]
> > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x1s73Ed6r
> > > Osmn6LfXc/edit?usp=sharing
> > 
> > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway, discussion will
> > take place on the mailing list please.
> 
> We cannot keep discussing the register interface every week.
> I remember we have discussed this many times already in following series.
> 
> 1. legacy series
> 2. tvq v4 series
> 3. dynamic vq creation series
> 4. again during suspend series under tvq head
> 5. right now
> 6. May be more that I forgot.
> 
> I captured all the direction and options in the doc. One can refer when those questions arise there.
> If we don’t work cohesively same reasoning repetition does not help.

It's still the same too, doc or no doc. You want to build a device
without registers fine but don't force it down everyone's throat. And
now with 8MBytes of on-device memory that's needed for migration and
that's apparently fine I am even less interested in saving 256 bytes for
config space.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 11:43                                                                                               ` Michael S. Tsirkin
@ 2023-11-17 12:02                                                                                                 ` Parav Pandit
  2023-11-17 12:30                                                                                                   ` Michael S. Tsirkin
  2023-11-21  5:25                                                                                                   ` Jason Wang
  0 siblings, 2 replies; 341+ messages in thread
From: Parav Pandit @ 2023-11-17 12:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 5:13 PM
> 
> On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 4:41 PM
> > >
> > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > >
> > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > config, and because the memory device already has the
> > > > > > > > interface for virtio config,
> > > > > > > >
> > > > > > > > Hypervisor can directly write/read from the virtual config
> > > > > > > > to the member's
> > > > > > > config space, without going through the device context, right?
> > > > > > >
> > > > > > > If it can do it or it can choose to not. I don't see how it
> > > > > > > is related to the discussion here.
> > > > > > >
> > > > > > It is. I don’t see a point of hypervisor not using the native
> > > > > > interface provided
> > > > > by the member device.
> > > > >
> > > > > So for example, it seems reasonable to a member supporting both
> > > > > existing pci register interface for compatibility and the future
> > > > > DMA based one for scale. In such a case, it seems possible that
> > > > > DMA will expose more features than pci. And then a hypervisor
> > > > > might decide to use
> > > that in preference to pci registers.
> > > >
> > > > We don’t find it right to involve owner device for mediating at
> > > > current scale
> > >
> > > In this model, device will be its own owner. Should not be a problem.
> > >
> > I didn’t understand above comment.
> 
> We'd add a new group type "self". You can then send admin commands through
> VF itself not through PF.
>
How? The device is owned by the guest. FLR and device reset cannot send the admin command reliably.
 
> 
> > > > and to not break TDISP efforts in upcoming time by such design.
> > >
> > > Look you either stop mentioning TDISP as motivation or actually try
> > > to address it. Safe migration with TDISP is really hard.
> > But that is not an excuse to say that TDISP migration is not present, hence
> involve the owner device for config space access.
> > This is another hurdle added that further blocks us away from TDISP.
> > Hence, we don’t want to take the route of involving owner device for any
> config access.
> 
> This "blocks" is all just wild hunches. hypervisor controls some aspects of TDISP
> devices for sure - maybe we actually should use pci config space as that is
> generally hypervisor controlled.
Even bad to do hypercalls.
I showed you last time the role of the PCI config space snippet from the spec.
Do you see we are repeating the discussion again?

> 
> > > For example, your current patches are clearly broken for TDISP:
> > > owner can control queue state at any time making device modify
> > > memory in any way it wants.
> > >
> > When TDISP migration is needed, the admin device can be another TVM
> outside the HV scope.
> > Or an alternative would have device context encrypted not visible to HV at all.
> 
> Maybe. Fact remains your patches do conflict with TDISP and you seem to be
> fine with it because you have a hunch you can fix it. But we can't do
> development based on your hunches.
> 
We have different view.
My patches do not conflict with TDISP because TDISP has clear definition of not involving hypervisor for transport.
And that part is still preserved.
Delegating the migration to another TDISP or encrypting is yet to be defined.
And current patches will align to both the approaches in future.

So you need to re-evaluate your judgment.

> 
> > Such encryption is not possible, with the trap+emulation method, where HV
> will have to decrypt the data coming over MMIO writes.
> 
> I don't how what trap+emulation has to do with it. Do you refer to the shadow
> vq thing? 

The method proposed here does not hinder any TDISP direction.

Without my proposal, do you have a method that does not involve hypervisor intervention for virtio common and device config space, cvq and shadow vq?
If so, I would like to hear that as well because that will align with TDISP.

> I am guessing modern platforms with TDISP support are likely to also
> support dirty bit in the IOMMU.
> 
It will be some day.

> 
> > > > And for future scale, having new SIOV interface makes more sense
> > > > which has
> > > its own direct interface to device.
> > > >
> > > > I finally captured all past discussions in form of a FAQ at [1].
> > > >
> > > > [1]
> > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x1s73
> > > > Ed6r
> > > > Osmn6LfXc/edit?usp=sharing
> > >
> > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > discussion will take place on the mailing list please.
> >
> > We cannot keep discussing the register interface every week.
> > I remember we have discussed this many times already in following series.
> >
> > 1. legacy series
> > 2. tvq v4 series
> > 3. dynamic vq creation series
> > 4. again during suspend series under tvq head 5. right now 6. May be
> > more that I forgot.
> >
> > I captured all the direction and options in the doc. One can refer when those
> questions arise there.
> > If we don’t work cohesively same reasoning repetition does not help.
> 
> It's still the same too, doc or no doc. You want to build a device without
> registers fine but don't force it down everyone's throat. 
I don’t see any compelling reason for inventing new method really.
Nor continuing in register mode.
Virtio already has VQ.
If CVQ is so problematic, one should put everything on registers and not run on double standards.

I captured all the reasoning and thoughts. I don’t have much to say in support of infinite register scale.

People who wants to push SIOV does not show single performance reason on why SIOV to be done.
I have upstreamed SIOVs in Linux as SFs without PASID, and in all our scale tests, before the device chocks, the system chocks.

So when someone pushes the SIOV series, I will be the first one interested in reading the performance numbers to proceed with patches.

> And now with 8MBytes
> of on-device memory that's needed for migration and that's apparently fine I
> am even less interested in saving 256 bytes for config space.

Again, not the right comparison. When and how to use 256 matters.
I haven’t come across any device that prefers infinite register scale.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 12:02                                                                                                 ` Parav Pandit
@ 2023-11-17 12:30                                                                                                   ` Michael S. Tsirkin
  2023-11-17 12:46                                                                                                     ` Parav Pandit
  2023-11-21  5:25                                                                                                   ` Jason Wang
  1 sibling, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 12:30 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 5:13 PM
> > 
> > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 4:41 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > >
> > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > > config, and because the memory device already has the
> > > > > > > > > interface for virtio config,
> > > > > > > > >
> > > > > > > > > Hypervisor can directly write/read from the virtual config
> > > > > > > > > to the member's
> > > > > > > > config space, without going through the device context, right?
> > > > > > > >
> > > > > > > > If it can do it or it can choose to not. I don't see how it
> > > > > > > > is related to the discussion here.
> > > > > > > >
> > > > > > > It is. I don’t see a point of hypervisor not using the native
> > > > > > > interface provided
> > > > > > by the member device.
> > > > > >
> > > > > > So for example, it seems reasonable to a member supporting both
> > > > > > existing pci register interface for compatibility and the future
> > > > > > DMA based one for scale. In such a case, it seems possible that
> > > > > > DMA will expose more features than pci. And then a hypervisor
> > > > > > might decide to use
> > > > that in preference to pci registers.
> > > > >
> > > > > We don’t find it right to involve owner device for mediating at
> > > > > current scale
> > > >
> > > > In this model, device will be its own owner. Should not be a problem.
> > > >
> > > I didn’t understand above comment.
> > 
> > We'd add a new group type "self". You can then send admin commands through
> > VF itself not through PF.
> >
> How? The device is owned by the guest. FLR and device reset cannot send the admin command reliably.

It's of the "it hurts when I do this - don't do this then" category.


> > 
> > > > > and to not break TDISP efforts in upcoming time by such design.
> > > >
> > > > Look you either stop mentioning TDISP as motivation or actually try
> > > > to address it. Safe migration with TDISP is really hard.
> > > But that is not an excuse to say that TDISP migration is not present, hence
> > involve the owner device for config space access.
> > > This is another hurdle added that further blocks us away from TDISP.
> > > Hence, we don’t want to take the route of involving owner device for any
> > config access.
> > 
> > This "blocks" is all just wild hunches. hypervisor controls some aspects of TDISP
> > devices for sure - maybe we actually should use pci config space as that is
> > generally hypervisor controlled.
> Even bad to do hypercalls.
> I showed you last time the role of the PCI config space snippet from the spec.

Yes I remember. This is just an example though. My point is maybe it is
solvable maybe it is not.

> Do you see we are repeating the discussion again?

One of the reasons is that people bring up irrelevances. TDISP is
important but has to be addressed or deferred not vaguely referred to.

> > 
> > > > For example, your current patches are clearly broken for TDISP:
> > > > owner can control queue state at any time making device modify
> > > > memory in any way it wants.
> > > >
> > > When TDISP migration is needed, the admin device can be another TVM
> > outside the HV scope.
> > > Or an alternative would have device context encrypted not visible to HV at all.
> > 
> > Maybe. Fact remains your patches do conflict with TDISP and you seem to be
> > fine with it because you have a hunch you can fix it. But we can't do
> > development based on your hunches.
> > 
> We have different view.
> My patches do not conflict with TDISP because TDISP has clear definition of not involving hypervisor for transport.
> And that part is still preserved.
> Delegating the migration to another TDISP or encrypting is yet to be defined.
> And current patches will align to both the approaches in future.
> 
> So you need to re-evaluate your judgment.

If you like they do not "conflict".  But if used with TDISP they just
make it insecure and thus completely worthless.  If hypervisor can
change ring state to make device poke at random guest memory then it
is game over and all the effort spent was security theater.
But you know this, don't you? This is why you mentioned encrypting
device. Maybe that works. It just does not work *as is*.


> > 
> > > Such encryption is not possible, with the trap+emulation method, where HV
> > will have to decrypt the data coming over MMIO writes.
> > 
> > I don't how what trap+emulation has to do with it. Do you refer to the shadow
> > vq thing? 
> 
> The method proposed here does not hinder any TDISP direction.

direction? No, why would it. we can always add more commands that are
safe for TDISP. commands you propose here are unsafe for TDISP.

> Without my proposal, do you have a method that does not involve hypervisor intervention for virtio common and device config space, cvq and shadow vq?
> If so, I would like to hear that as well because that will align with TDISP.

I really did not give it much thought.  I suspect for TDISP it just
might be cleaner to have guest agent migrate device. Certainly removes
all the messy questions. That, to me impliest there needs to be a way to
send migration commands through VF itself. Does this
"involve hypervisor intervention"? No one should care I think.


> > I am guessing modern platforms with TDISP support are likely to also
> > support dirty bit in the IOMMU.
> > 
> It will be some day.

What does this mean? Which platforms support TDISP and which IOMMUs
do they use?

> > 
> > > > > And for future scale, having new SIOV interface makes more sense
> > > > > which has
> > > > its own direct interface to device.
> > > > >
> > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > >
> > > > > [1]
> > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x1s73
> > > > > Ed6r
> > > > > Osmn6LfXc/edit?usp=sharing
> > > >
> > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > discussion will take place on the mailing list please.
> > >
> > > We cannot keep discussing the register interface every week.
> > > I remember we have discussed this many times already in following series.
> > >
> > > 1. legacy series
> > > 2. tvq v4 series
> > > 3. dynamic vq creation series
> > > 4. again during suspend series under tvq head 5. right now 6. May be
> > > more that I forgot.
> > >
> > > I captured all the direction and options in the doc. One can refer when those
> > questions arise there.
> > > If we don’t work cohesively same reasoning repetition does not help.
> > 
> > It's still the same too, doc or no doc. You want to build a device without
> > registers fine but don't force it down everyone's throat. 
> I don’t see any compelling reason for inventing new method really.
> Nor continuing in register mode.
> Virtio already has VQ.
> If CVQ is so problematic, one should put everything on registers and not run on double standards.

We should not and neither should we put everything behind a VQ.


> I captured all the reasoning and thoughts. I don’t have much to say in support of infinite register scale.
> 
> People who wants to push SIOV does not show single performance reason on why SIOV to be done.
> I have upstreamed SIOVs in Linux as SFs without PASID, and in all our scale tests, before the device chocks, the system chocks.
> 
> So when someone pushes the SIOV series, I will be the first one interested in reading the performance numbers to proceed with patches.
> 
> > And now with 8MBytes
> > of on-device memory that's needed for migration and that's apparently fine I
> > am even less interested in saving 256 bytes for config space.
> 
> Again, not the right comparison. When and how to use 256 matters.
> I haven’t come across any device that prefers infinite register scale.

Why resort to hyperbole? 256 bytes is pretty far from infinite.  But
again, if you don't want it in registers just add an option to move
*all* of config space out of registers. cheaper devices will require
newer guests.  Or, 10 years will pass and you will be able to drop
compat with old guests. I know it's too long a game for you to care but
I've been virtio spec editor for more than 10 years so to me it seems
reasonable to plan like that.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 12:30                                                                                                   ` Michael S. Tsirkin
@ 2023-11-17 12:46                                                                                                     ` Parav Pandit
  2023-11-17 13:54                                                                                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-17 12:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 6:00 PM
> To: Parav Pandit <parav@nvidia.com>
> 
> On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 5:13 PM
> > >
> > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > >
> > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > > > config, and because the memory device already has the
> > > > > > > > > > interface for virtio config,
> > > > > > > > > >
> > > > > > > > > > Hypervisor can directly write/read from the virtual
> > > > > > > > > > config to the member's
> > > > > > > > > config space, without going through the device context, right?
> > > > > > > > >
> > > > > > > > > If it can do it or it can choose to not. I don't see how
> > > > > > > > > it is related to the discussion here.
> > > > > > > > >
> > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > native interface provided
> > > > > > > by the member device.
> > > > > > >
> > > > > > > So for example, it seems reasonable to a member supporting
> > > > > > > both existing pci register interface for compatibility and
> > > > > > > the future DMA based one for scale. In such a case, it seems
> > > > > > > possible that DMA will expose more features than pci. And
> > > > > > > then a hypervisor might decide to use
> > > > > that in preference to pci registers.
> > > > > >
> > > > > > We don’t find it right to involve owner device for mediating
> > > > > > at current scale
> > > > >
> > > > > In this model, device will be its own owner. Should not be a problem.
> > > > >
> > > > I didn’t understand above comment.
> > >
> > > We'd add a new group type "self". You can then send admin commands
> > > through VF itself not through PF.
> > >
> > How? The device is owned by the guest. FLR and device reset cannot send the
> admin command reliably.
> 
> It's of the "it hurts when I do this - don't do this then" category.
>
it is don’t do medication category, yes due all this weirdness that has been asked.
 
> 
> > >
> > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > >
> > > > > Look you either stop mentioning TDISP as motivation or actually
> > > > > try to address it. Safe migration with TDISP is really hard.
> > > > But that is not an excuse to say that TDISP migration is not
> > > > present, hence
> > > involve the owner device for config space access.
> > > > This is another hurdle added that further blocks us away from TDISP.
> > > > Hence, we don’t want to take the route of involving owner device
> > > > for any
> > > config access.
> > >
> > > This "blocks" is all just wild hunches. hypervisor controls some
> > > aspects of TDISP devices for sure - maybe we actually should use pci
> > > config space as that is generally hypervisor controlled.
> > Even bad to do hypercalls.
> > I showed you last time the role of the PCI config space snippet from the spec.
> 
> Yes I remember. This is just an example though. My point is maybe it is solvable
> maybe it is not.
> 
> > Do you see we are repeating the discussion again?
> 
> One of the reasons is that people bring up irrelevances. TDISP is important but
> has to be addressed or deferred not vaguely referred to.

So lets continue to follow the current TDISP direction of not involving hypervisor for virtio common and device config.

> 
> > >
> > > > > For example, your current patches are clearly broken for TDISP:
> > > > > owner can control queue state at any time making device modify
> > > > > memory in any way it wants.
> > > > >
> > > > When TDISP migration is needed, the admin device can be another
> > > > TVM
> > > outside the HV scope.
> > > > Or an alternative would have device context encrypted not visible to HV at
> all.
> > >
> > > Maybe. Fact remains your patches do conflict with TDISP and you seem
> > > to be fine with it because you have a hunch you can fix it. But we
> > > can't do development based on your hunches.
> > >
> > We have different view.
> > My patches do not conflict with TDISP because TDISP has clear definition of
> not involving hypervisor for transport.
> > And that part is still preserved.
> > Delegating the migration to another TDISP or encrypting is yet to be defined.
> > And current patches will align to both the approaches in future.
> >
> > So you need to re-evaluate your judgment.
> 
> If you like they do not "conflict".  But if used with TDISP they just make it
> insecure and thus completely worthless.  If hypervisor can change ring state to
> make device poke at random guest memory then it is game over and all the
> effort spent was security theater.
Not really, I proposed two options.
1. delegate the task of LM to the TVM. (proposed by two cpu vendors).
In this case all the infra we build here, just works fine.
It also does not require any hypervisor mediation for control plane.

2. Encrypt the owner device workload to be not seen by hypervisor

Both methods does not affect the current direction.

But if we force trap+emulation, it is 100% broken for TDISP.
And I would not promote that.

> But you know this, don't you? This is why you mentioned encrypting device.
> Maybe that works. It just does not work *as is*.
It works as_is. But current infrastructure does not block the future work.

> 
> 
> > >
> > > > Such encryption is not possible, with the trap+emulation method,
> > > > where HV
> > > will have to decrypt the data coming over MMIO writes.
> > >
> > > I don't how what trap+emulation has to do with it. Do you refer to
> > > the shadow vq thing?
> >
> > The method proposed here does not hinder any TDISP direction.
> 
> direction? No, why would it. we can always add more commands that are safe
> for TDISP. commands you propose here are unsafe for TDISP.
> 
> > Without my proposal, do you have a method that does not involve hypervisor
> intervention for virtio common and device config space, cvq and shadow vq?
> > If so, I would like to hear that as well because that will align with TDISP.
> 	
> I really did not give it much thought.  I suspect for TDISP it just might be cleaner
> to have guest agent migrate device. Certainly removes all the messy questions.
> That, to me impliest there needs to be a way to send migration commands
> through VF itself. Does this "involve hypervisor intervention"? No one should
> care I think.
Too far of the future to envision. May be yes. When such platform is built, for sure whoever migrates need migrate its device side too.
Some knowledge of migration driver is needed.

> 
> 
> > > I am guessing modern platforms with TDISP support are likely to also
> > > support dirty bit in the IOMMU.
> > >
> > It will be some day.
> 
> What does this mean? Which platforms support TDISP and which IOMMUs do
> they use?
I said it will be some day, not right now.

> 
> > >
> > > > > > And for future scale, having new SIOV interface makes more
> > > > > > sense which has
> > > > > its own direct interface to device.
> > > > > >
> > > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > > >
> > > > > > [1]
> > > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x
> > > > > > 1s73
> > > > > > Ed6r
> > > > > > Osmn6LfXc/edit?usp=sharing
> > > > >
> > > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > > discussion will take place on the mailing list please.
> > > >
> > > > We cannot keep discussing the register interface every week.
> > > > I remember we have discussed this many times already in following series.
> > > >
> > > > 1. legacy series
> > > > 2. tvq v4 series
> > > > 3. dynamic vq creation series
> > > > 4. again during suspend series under tvq head 5. right now 6. May
> > > > be more that I forgot.
> > > >
> > > > I captured all the direction and options in the doc. One can refer
> > > > when those
> > > questions arise there.
> > > > If we don’t work cohesively same reasoning repetition does not help.
> > >
> > > It's still the same too, doc or no doc. You want to build a device
> > > without registers fine but don't force it down everyone's throat.
> > I don’t see any compelling reason for inventing new method really.
> > Nor continuing in register mode.
> > Virtio already has VQ.
> > If CVQ is so problematic, one should put everything on registers and not run
> on double standards.
> 
> We should not and neither should we put everything behind a VQ.
>
Why?
 
> 
> > I captured all the reasoning and thoughts. I don’t have much to say in support
> of infinite register scale.
> >
> > People who wants to push SIOV does not show single performance reason on
> why SIOV to be done.
> > I have upstreamed SIOVs in Linux as SFs without PASID, and in all our scale
> tests, before the device chocks, the system chocks.
> >
> > So when someone pushes the SIOV series, I will be the first one interested in
> reading the performance numbers to proceed with patches.
> >
> > > And now with 8MBytes
> > > of on-device memory that's needed for migration and that's
> > > apparently fine I am even less interested in saving 256 bytes for config
> space.
> >
> > Again, not the right comparison. When and how to use 256 matters.
> > I haven’t come across any device that prefers infinite register scale.
> 
> Why resort to hyperbole? 256 bytes is pretty far from infinite.  But again, if you
> don't want it in registers just add an option to move
> *all* of config space out of registers. 
This does not work in backward compatible way, not brings the predictability.
The proposed method and current direction is just fine. No changes needed.
No one is extending the config space or config registers in virtio anymore anyway.

> cheaper devices will require newer guests.
> Or, 10 years will pass and you will be able to drop compat with old guests. I
> know it's too long a game for you to care but I've been virtio spec editor for
> more than 10 years so to me it seems reasonable to plan like that.

Current proposal of using CVQ is just working fine across most* virtio spec contributors.
I don’t see technical reason to change it.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 12:46                                                                                                     ` Parav Pandit
@ 2023-11-17 13:54                                                                                                       ` Michael S. Tsirkin
  2023-11-17 14:51                                                                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 13:54 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 12:46:21PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 6:00 PM
> > To: Parav Pandit <parav@nvidia.com>
> > 
> > On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 5:13 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > >
> > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > > > >
> > > > > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > > > > config, and because the memory device already has the
> > > > > > > > > > > interface for virtio config,
> > > > > > > > > > >
> > > > > > > > > > > Hypervisor can directly write/read from the virtual
> > > > > > > > > > > config to the member's
> > > > > > > > > > config space, without going through the device context, right?
> > > > > > > > > >
> > > > > > > > > > If it can do it or it can choose to not. I don't see how
> > > > > > > > > > it is related to the discussion here.
> > > > > > > > > >
> > > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > > native interface provided
> > > > > > > > by the member device.
> > > > > > > >
> > > > > > > > So for example, it seems reasonable to a member supporting
> > > > > > > > both existing pci register interface for compatibility and
> > > > > > > > the future DMA based one for scale. In such a case, it seems
> > > > > > > > possible that DMA will expose more features than pci. And
> > > > > > > > then a hypervisor might decide to use
> > > > > > that in preference to pci registers.
> > > > > > >
> > > > > > > We don’t find it right to involve owner device for mediating
> > > > > > > at current scale
> > > > > >
> > > > > > In this model, device will be its own owner. Should not be a problem.
> > > > > >
> > > > > I didn’t understand above comment.
> > > >
> > > > We'd add a new group type "self". You can then send admin commands
> > > > through VF itself not through PF.
> > > >
> > > How? The device is owned by the guest. FLR and device reset cannot send the
> > admin command reliably.
> > 
> > It's of the "it hurts when I do this - don't do this then" category.
> >
> it is don’t do medication category, yes due all this weirdness that has been asked.
>  
> > 
> > > >
> > > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > > >
> > > > > > Look you either stop mentioning TDISP as motivation or actually
> > > > > > try to address it. Safe migration with TDISP is really hard.
> > > > > But that is not an excuse to say that TDISP migration is not
> > > > > present, hence
> > > > involve the owner device for config space access.
> > > > > This is another hurdle added that further blocks us away from TDISP.
> > > > > Hence, we don’t want to take the route of involving owner device
> > > > > for any
> > > > config access.
> > > >
> > > > This "blocks" is all just wild hunches. hypervisor controls some
> > > > aspects of TDISP devices for sure - maybe we actually should use pci
> > > > config space as that is generally hypervisor controlled.
> > > Even bad to do hypercalls.
> > > I showed you last time the role of the PCI config space snippet from the spec.
> > 
> > Yes I remember. This is just an example though. My point is maybe it is solvable
> > maybe it is not.
> > 
> > > Do you see we are repeating the discussion again?
> > 
> > One of the reasons is that people bring up irrelevances. TDISP is important but
> > has to be addressed or deferred not vaguely referred to.
> 
> So lets continue to follow the current TDISP direction of not involving hypervisor for virtio common and device config.
> 
> > 
> > > >
> > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > owner can control queue state at any time making device modify
> > > > > > memory in any way it wants.
> > > > > >
> > > > > When TDISP migration is needed, the admin device can be another
> > > > > TVM
> > > > outside the HV scope.
> > > > > Or an alternative would have device context encrypted not visible to HV at
> > all.
> > > >
> > > > Maybe. Fact remains your patches do conflict with TDISP and you seem
> > > > to be fine with it because you have a hunch you can fix it. But we
> > > > can't do development based on your hunches.
> > > >
> > > We have different view.
> > > My patches do not conflict with TDISP because TDISP has clear definition of
> > not involving hypervisor for transport.
> > > And that part is still preserved.
> > > Delegating the migration to another TDISP or encrypting is yet to be defined.
> > > And current patches will align to both the approaches in future.
> > >
> > > So you need to re-evaluate your judgment.
> > 
> > If you like they do not "conflict".  But if used with TDISP they just make it
> > insecure and thus completely worthless.  If hypervisor can change ring state to
> > make device poke at random guest memory then it is game over and all the
> > effort spent was security theater.
> Not really, I proposed two options.
> 1. delegate the task of LM to the TVM. (proposed by two cpu vendors).
> In this case all the infra we build here, just works fine.

I think modification will be needed: currently commands are sent
through the PF, and that is under hypervisor control.
You should not assign PF to TVM.

> It also does not require any hypervisor mediation for control plane.
> 
> 2. Encrypt the owner device workload to be not seen by hypervisor
> 
> Both methods does not affect the current direction.
> 
> But if we force trap+emulation, it is 100% broken for TDISP.
> And I would not promote that.
> 
> > But you know this, don't you? This is why you mentioned encrypting device.
> > Maybe that works. It just does not work *as is*.
> It works as_is. But current infrastructure does not block the future work.
> 
> > 
> > 
> > > >
> > > > > Such encryption is not possible, with the trap+emulation method,
> > > > > where HV
> > > > will have to decrypt the data coming over MMIO writes.
> > > >
> > > > I don't how what trap+emulation has to do with it. Do you refer to
> > > > the shadow vq thing?
> > >
> > > The method proposed here does not hinder any TDISP direction.
> > 
> > direction? No, why would it. we can always add more commands that are safe
> > for TDISP. commands you propose here are unsafe for TDISP.
> > 
> > > Without my proposal, do you have a method that does not involve hypervisor
> > intervention for virtio common and device config space, cvq and shadow vq?
> > > If so, I would like to hear that as well because that will align with TDISP.
> > 	
> > I really did not give it much thought.  I suspect for TDISP it just might be cleaner
> > to have guest agent migrate device. Certainly removes all the messy questions.
> > That, to me impliest there needs to be a way to send migration commands
> > through VF itself. Does this "involve hypervisor intervention"? No one should
> > care I think.
> Too far of the future to envision. May be yes. When such platform is
> built, for sure whoever migrates need migrate its device side too.
> Some knowledge of migration driver is needed.

So TDISP migration is so far in the future you do not need to bother
about it. Fine. Then don't bring it up pls.

> > 
> > 
> > > > I am guessing modern platforms with TDISP support are likely to also
> > > > support dirty bit in the IOMMU.
> > > >
> > > It will be some day.
> > 
> > What does this mean? Which platforms support TDISP and which IOMMUs do
> > they use?
> I said it will be some day, not right now.

That would be a stupid decision by a platform vendor. Let's just hope
they don't do it.

> > 
> > > >
> > > > > > > And for future scale, having new SIOV interface makes more
> > > > > > > sense which has
> > > > > > its own direct interface to device.
> > > > > > >
> > > > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > > > >
> > > > > > > [1]
> > > > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x
> > > > > > > 1s73
> > > > > > > Ed6r
> > > > > > > Osmn6LfXc/edit?usp=sharing
> > > > > >
> > > > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > > > discussion will take place on the mailing list please.
> > > > >
> > > > > We cannot keep discussing the register interface every week.
> > > > > I remember we have discussed this many times already in following series.
> > > > >
> > > > > 1. legacy series
> > > > > 2. tvq v4 series
> > > > > 3. dynamic vq creation series
> > > > > 4. again during suspend series under tvq head 5. right now 6. May
> > > > > be more that I forgot.
> > > > >
> > > > > I captured all the direction and options in the doc. One can refer
> > > > > when those
> > > > questions arise there.
> > > > > If we don’t work cohesively same reasoning repetition does not help.
> > > >
> > > > It's still the same too, doc or no doc. You want to build a device
> > > > without registers fine but don't force it down everyone's throat.
> > > I don’t see any compelling reason for inventing new method really.
> > > Nor continuing in register mode.
> > > Virtio already has VQ.
> > > If CVQ is so problematic, one should put everything on registers and not run
> > on double standards.
> > 
> > We should not and neither should we put everything behind a VQ.
> >
> Why?

Because simple things should be simple, complex things should be
possible.

> > 
> > > I captured all the reasoning and thoughts. I don’t have much to say in support
> > of infinite register scale.
> > >
> > > People who wants to push SIOV does not show single performance reason on
> > why SIOV to be done.
> > > I have upstreamed SIOVs in Linux as SFs without PASID, and in all our scale
> > tests, before the device chocks, the system chocks.
> > >
> > > So when someone pushes the SIOV series, I will be the first one interested in
> > reading the performance numbers to proceed with patches.
> > >
> > > > And now with 8MBytes
> > > > of on-device memory that's needed for migration and that's
> > > > apparently fine I am even less interested in saving 256 bytes for config
> > space.
> > >
> > > Again, not the right comparison. When and how to use 256 matters.
> > > I haven’t come across any device that prefers infinite register scale.
> > 
> > Why resort to hyperbole? 256 bytes is pretty far from infinite.  But again, if you
> > don't want it in registers just add an option to move
> > *all* of config space out of registers. 
> This does not work in backward compatible way, not brings the predictability.

It will work if we want it to.

> The proposed method and current direction is just fine. No changes needed.

So one of the things that for example I care about is cross-vendor
compatibility. And if that compatibility is expressed through
config space then that is a single consistent interface.
How do you provision? Supply config space context.
This is what I am looking for, a single place where init time values can
be mapped, and for all device type. CVQ is not that.


> No one is extending the config space or config registers in virtio anymore anyway.

Then you tell me.  How much on-device memory per VF is fine?
If the answer is "as little as possible" then we can do better than
the current pci transport. If the answer is "a couple 100 bytes is not a
problem, as long as it's only be init time stuff" then that is exactly
what we always had.

> > cheaper devices will require newer guests.
> > Or, 10 years will pass and you will be able to drop compat with old guests. I
> > know it's too long a game for you to care but I've been virtio spec editor for
> > more than 10 years so to me it seems reasonable to plan like that.
> 
> Current proposal of using CVQ is just working fine across most* virtio spec contributors.
> I don’t see technical reason to change it.

We have flamewars re-erupting all the time. What I want us to do is
to draw a line in the sand what is and what is not reasonable in config
space. It has to have a model behind it too that a reasonable developer
can understand.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 13:54                                                                                                       ` Michael S. Tsirkin
@ 2023-11-17 14:51                                                                                                         ` Parav Pandit
  2023-11-17 15:09                                                                                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-17 14:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 17, 2023 7:24 PM
> 
> On Fri, Nov 17, 2023 at 12:46:21PM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 6:00 PM
> > > To: Parav Pandit <parav@nvidia.com>
> > >
> > > On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > >
> > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Additionally, if hypervisor has put the trap on
> > > > > > > > > > > > virtio config, and because the memory device
> > > > > > > > > > > > already has the interface for virtio config,
> > > > > > > > > > > >
> > > > > > > > > > > > Hypervisor can directly write/read from the
> > > > > > > > > > > > virtual config to the member's
> > > > > > > > > > > config space, without going through the device context, right?
> > > > > > > > > > >
> > > > > > > > > > > If it can do it or it can choose to not. I don't see
> > > > > > > > > > > how it is related to the discussion here.
> > > > > > > > > > >
> > > > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > > > native interface provided
> > > > > > > > > by the member device.
> > > > > > > > >
> > > > > > > > > So for example, it seems reasonable to a member
> > > > > > > > > supporting both existing pci register interface for
> > > > > > > > > compatibility and the future DMA based one for scale. In
> > > > > > > > > such a case, it seems possible that DMA will expose more
> > > > > > > > > features than pci. And then a hypervisor might decide to
> > > > > > > > > use
> > > > > > > that in preference to pci registers.
> > > > > > > >
> > > > > > > > We don’t find it right to involve owner device for
> > > > > > > > mediating at current scale
> > > > > > >
> > > > > > > In this model, device will be its own owner. Should not be a problem.
> > > > > > >
> > > > > > I didn’t understand above comment.
> > > > >
> > > > > We'd add a new group type "self". You can then send admin
> > > > > commands through VF itself not through PF.
> > > > >
> > > > How? The device is owned by the guest. FLR and device reset cannot
> > > > send the
> > > admin command reliably.
> > >
> > > It's of the "it hurts when I do this - don't do this then" category.
> > >
> > it is don’t do medication category, yes due all this weirdness that has been
> asked.
> >
> > >
> > > > >
> > > > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > > > >
> > > > > > > Look you either stop mentioning TDISP as motivation or
> > > > > > > actually try to address it. Safe migration with TDISP is really hard.
> > > > > > But that is not an excuse to say that TDISP migration is not
> > > > > > present, hence
> > > > > involve the owner device for config space access.
> > > > > > This is another hurdle added that further blocks us away from TDISP.
> > > > > > Hence, we don’t want to take the route of involving owner
> > > > > > device for any
> > > > > config access.
> > > > >
> > > > > This "blocks" is all just wild hunches. hypervisor controls some
> > > > > aspects of TDISP devices for sure - maybe we actually should use
> > > > > pci config space as that is generally hypervisor controlled.
> > > > Even bad to do hypercalls.
> > > > I showed you last time the role of the PCI config space snippet from the
> spec.
> > >
> > > Yes I remember. This is just an example though. My point is maybe it
> > > is solvable maybe it is not.
> > >
> > > > Do you see we are repeating the discussion again?
> > >
> > > One of the reasons is that people bring up irrelevances. TDISP is
> > > important but has to be addressed or deferred not vaguely referred to.
> >
> > So lets continue to follow the current TDISP direction of not involving
> hypervisor for virtio common and device config.

If you disagree to it, please speak now, so that we don’t debate on this again in next 3 days.
Because this is the fundamental design considerations it relied on.
There is no point going forward if you want to disagree to it.
Other variants are fine, but other variants cannot be the only choice.

> >
> > >
> > > > >
> > > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > > owner can control queue state at any time making device
> > > > > > > modify memory in any way it wants.
> > > > > > >
> > > > > > When TDISP migration is needed, the admin device can be
> > > > > > another TVM
> > > > > outside the HV scope.
> > > > > > Or an alternative would have device context encrypted not
> > > > > > visible to HV at
> > > all.
> > > > >
> > > > > Maybe. Fact remains your patches do conflict with TDISP and you
> > > > > seem to be fine with it because you have a hunch you can fix it.
> > > > > But we can't do development based on your hunches.
> > > > >
> > > > We have different view.
> > > > My patches do not conflict with TDISP because TDISP has clear
> > > > definition of
> > > not involving hypervisor for transport.
> > > > And that part is still preserved.
> > > > Delegating the migration to another TDISP or encrypting is yet to be
> defined.
> > > > And current patches will align to both the approaches in future.
> > > >
> > > > So you need to re-evaluate your judgment.
> > >
> > > If you like they do not "conflict".  But if used with TDISP they
> > > just make it insecure and thus completely worthless.  If hypervisor
> > > can change ring state to make device poke at random guest memory
> > > then it is game over and all the effort spent was security theater.
> > Not really, I proposed two options.
> > 1. delegate the task of LM to the TVM. (proposed by two cpu vendors).
> > In this case all the infra we build here, just works fine.
> 
> I think modification will be needed: currently commands are sent through the
> PF, and that is under hypervisor control.
> You should not assign PF to TVM.
Yes, an admin virtio function will be there which will do the admin commands listed.

> 
> > It also does not require any hypervisor mediation for control plane.
> >
> > 2. Encrypt the owner device workload to be not seen by hypervisor
> >
> > Both methods does not affect the current direction.
> >
> > But if we force trap+emulation, it is 100% broken for TDISP.
> > And I would not promote that.
> >
> > > But you know this, don't you? This is why you mentioned encrypting device.
> > > Maybe that works. It just does not work *as is*.
> > It works as_is. But current infrastructure does not block the future work.
> >
> > >
> > >
> > > > >
> > > > > > Such encryption is not possible, with the trap+emulation
> > > > > > method, where HV
> > > > > will have to decrypt the data coming over MMIO writes.
> > > > >
> > > > > I don't how what trap+emulation has to do with it. Do you refer
> > > > > to the shadow vq thing?
> > > >
> > > > The method proposed here does not hinder any TDISP direction.
> > >
> > > direction? No, why would it. we can always add more commands that
> > > are safe for TDISP. commands you propose here are unsafe for TDISP.
> > >
> > > > Without my proposal, do you have a method that does not involve
> > > > hypervisor
> > > intervention for virtio common and device config space, cvq and shadow
> vq?
> > > > If so, I would like to hear that as well because that will align with TDISP.
> > >
> > > I really did not give it much thought.  I suspect for TDISP it just
> > > might be cleaner to have guest agent migrate device. Certainly removes all
> the messy questions.
> > > That, to me impliest there needs to be a way to send migration
> > > commands through VF itself. Does this "involve hypervisor
> > > intervention"? No one should care I think.
> > Too far of the future to envision. May be yes. When such platform is
> > built, for sure whoever migrates need migrate its device side too.
> > Some knowledge of migration driver is needed.
> 
> So TDISP migration is so far in the future you do not need to bother about it.
> Fine. Then don't bring it up pls.
> 
As long as we are aligned to the requirement that a virtio member device is mapped to the guest VM without mediating the virtio interface, I am good.
Again, other variants are fine, but above listed mapped variant is the minimum variant needed.

> > >
> > >
> > > > > I am guessing modern platforms with TDISP support are likely to
> > > > > also support dirty bit in the IOMMU.
> > > > >
> > > > It will be some day.
> > >
> > > What does this mean? Which platforms support TDISP and which IOMMUs
> > > do they use?
> > I said it will be some day, not right now.
> 
> That would be a stupid decision by a platform vendor. Let's just hope they don't
> do it.
> 
> > >
> > > > >
> > > > > > > > And for future scale, having new SIOV interface makes more
> > > > > > > > sense which has
> > > > > > > its own direct interface to device.
> > > > > > > >
> > > > > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > > > > >
> > > > > > > > [1]
> > > > > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZi
> > > > > > > > Vj8x
> > > > > > > > 1s73
> > > > > > > > Ed6r
> > > > > > > > Osmn6LfXc/edit?usp=sharing
> > > > > > >
> > > > > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > > > > discussion will take place on the mailing list please.
> > > > > >
> > > > > > We cannot keep discussing the register interface every week.
> > > > > > I remember we have discussed this many times already in following
> series.
> > > > > >
> > > > > > 1. legacy series
> > > > > > 2. tvq v4 series
> > > > > > 3. dynamic vq creation series
> > > > > > 4. again during suspend series under tvq head 5. right now 6.
> > > > > > May be more that I forgot.
> > > > > >
> > > > > > I captured all the direction and options in the doc. One can
> > > > > > refer when those
> > > > > questions arise there.
> > > > > > If we don’t work cohesively same reasoning repetition does not help.
> > > > >
> > > > > It's still the same too, doc or no doc. You want to build a
> > > > > device without registers fine but don't force it down everyone's throat.
> > > > I don’t see any compelling reason for inventing new method really.
> > > > Nor continuing in register mode.
> > > > Virtio already has VQ.
> > > > If CVQ is so problematic, one should put everything on registers
> > > > and not run
> > > on double standards.
> > >
> > > We should not and neither should we put everything behind a VQ.
> > >
> > Why?
> 
> Because simple things should be simple, complex things should be possible.
CVQ != complex.
And if it is, all things placed on the CVQ must be moved to registers first.

> 
> > >
> > > > I captured all the reasoning and thoughts. I don’t have much to
> > > > say in support
> > > of infinite register scale.
> > > >
> > > > People who wants to push SIOV does not show single performance
> > > > reason on
> > > why SIOV to be done.
> > > > I have upstreamed SIOVs in Linux as SFs without PASID, and in all
> > > > our scale
> > > tests, before the device chocks, the system chocks.
> > > >
> > > > So when someone pushes the SIOV series, I will be the first one
> > > > interested in
> > > reading the performance numbers to proceed with patches.
> > > >
> > > > > And now with 8MBytes
> > > > > of on-device memory that's needed for migration and that's
> > > > > apparently fine I am even less interested in saving 256 bytes
> > > > > for config
> > > space.
> > > >
> > > > Again, not the right comparison. When and how to use 256 matters.
> > > > I haven’t come across any device that prefers infinite register scale.
> > >
> > > Why resort to hyperbole? 256 bytes is pretty far from infinite.  But
> > > again, if you don't want it in registers just add an option to move
> > > *all* of config space out of registers.
> > This does not work in backward compatible way, not brings the predictability.
> 
> It will work if we want it to.
> 
> > The proposed method and current direction is just fine. No changes needed.
> 
> So one of the things that for example I care about is cross-vendor compatibility.
> And if that compatibility is expressed through config space then that is a single
> consistent interface.
It just does not make sense to be expressed as config space by burning more registers.
There is already command in v4 that shows the member function capabilities.

> How do you provision? Supply config space context.
> This is what I am looking for, a single place where init time values can be
> mapped, and for all device type. CVQ is not that.
Provision the device for the resources using admin command and query capabilities via admin command.

No need to place once in lifetime structure in such registers.
This is not simplicity.

> 
> 
> > No one is extending the config space or config registers in virtio anymore
> anyway.
> 
> Then you tell me.  How much on-device memory per VF is fine?
As I listed in the document, few bytes to boot strap the device.
I anticipate less than 64B is really enough.

> If the answer is "as little as possible" then we can do better than the current pci
> transport. If the answer is "a couple 100 bytes is not a problem, as long as it's
> only be init time stuff" then that is exactly what we always had.
Right. So let's not add giant cross vendor capabilities there. It is surely going to cross 100 bytes.

> 
> > > cheaper devices will require newer guests.
> > > Or, 10 years will pass and you will be able to drop compat with old
> > > guests. I know it's too long a game for you to care but I've been
> > > virtio spec editor for more than 10 years so to me it seems reasonable to
> plan like that.
> >
> > Current proposal of using CVQ is just working fine across most* virtio spec
> contributors.
> > I don’t see technical reason to change it.
> 
> We have flamewars re-erupting all the time. What I want us to do is to draw a
> line in the sand what is and what is not reasonable in config space. It has to
> have a model behind it too that a reasonable developer can understand.
This is why I wrote all the points and trade off in the document.

The current model defined in virtio spec Appendix B B.2 is really good enough and practical.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 14:51                                                                                                         ` Parav Pandit
@ 2023-11-17 15:09                                                                                                           ` Michael S. Tsirkin
  2023-11-21  4:44                                                                                                             ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 15:09 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 02:51:04PM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 7:24 PM
> > 
> > On Fri, Nov 17, 2023 at 12:46:21PM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 6:00 PM
> > > > To: Parav Pandit <parav@nvidia.com>
> > > >
> > > > On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Additionally, if hypervisor has put the trap on
> > > > > > > > > > > > > virtio config, and because the memory device
> > > > > > > > > > > > > already has the interface for virtio config,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hypervisor can directly write/read from the
> > > > > > > > > > > > > virtual config to the member's
> > > > > > > > > > > > config space, without going through the device context, right?
> > > > > > > > > > > >
> > > > > > > > > > > > If it can do it or it can choose to not. I don't see
> > > > > > > > > > > > how it is related to the discussion here.
> > > > > > > > > > > >
> > > > > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > > > > native interface provided
> > > > > > > > > > by the member device.
> > > > > > > > > >
> > > > > > > > > > So for example, it seems reasonable to a member
> > > > > > > > > > supporting both existing pci register interface for
> > > > > > > > > > compatibility and the future DMA based one for scale. In
> > > > > > > > > > such a case, it seems possible that DMA will expose more
> > > > > > > > > > features than pci. And then a hypervisor might decide to
> > > > > > > > > > use
> > > > > > > > that in preference to pci registers.
> > > > > > > > >
> > > > > > > > > We don’t find it right to involve owner device for
> > > > > > > > > mediating at current scale
> > > > > > > >
> > > > > > > > In this model, device will be its own owner. Should not be a problem.
> > > > > > > >
> > > > > > > I didn’t understand above comment.
> > > > > >
> > > > > > We'd add a new group type "self". You can then send admin
> > > > > > commands through VF itself not through PF.
> > > > > >
> > > > > How? The device is owned by the guest. FLR and device reset cannot
> > > > > send the
> > > > admin command reliably.
> > > >
> > > > It's of the "it hurts when I do this - don't do this then" category.
> > > >
> > > it is don’t do medication category, yes due all this weirdness that has been
> > asked.
> > >
> > > >
> > > > > >
> > > > > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > > > > >
> > > > > > > > Look you either stop mentioning TDISP as motivation or
> > > > > > > > actually try to address it. Safe migration with TDISP is really hard.
> > > > > > > But that is not an excuse to say that TDISP migration is not
> > > > > > > present, hence
> > > > > > involve the owner device for config space access.
> > > > > > > This is another hurdle added that further blocks us away from TDISP.
> > > > > > > Hence, we don’t want to take the route of involving owner
> > > > > > > device for any
> > > > > > config access.
> > > > > >
> > > > > > This "blocks" is all just wild hunches. hypervisor controls some
> > > > > > aspects of TDISP devices for sure - maybe we actually should use
> > > > > > pci config space as that is generally hypervisor controlled.
> > > > > Even bad to do hypercalls.
> > > > > I showed you last time the role of the PCI config space snippet from the
> > spec.
> > > >
> > > > Yes I remember. This is just an example though. My point is maybe it
> > > > is solvable maybe it is not.
> > > >
> > > > > Do you see we are repeating the discussion again?
> > > >
> > > > One of the reasons is that people bring up irrelevances. TDISP is
> > > > important but has to be addressed or deferred not vaguely referred to.
> > >
> > > So lets continue to follow the current TDISP direction of not involving
> > hypervisor for virtio common and device config.
> 
> If you disagree to it, please speak now, so that we don’t debate on this again in next 3 days.
> Because this is the fundamental design considerations it relied on.
> There is no point going forward if you want to disagree to it.
> Other variants are fine, but other variants cannot be the only choice.
> 
> > >
> > > >
> > > > > >
> > > > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > > > owner can control queue state at any time making device
> > > > > > > > modify memory in any way it wants.
> > > > > > > >
> > > > > > > When TDISP migration is needed, the admin device can be
> > > > > > > another TVM
> > > > > > outside the HV scope.
> > > > > > > Or an alternative would have device context encrypted not
> > > > > > > visible to HV at
> > > > all.
> > > > > >
> > > > > > Maybe. Fact remains your patches do conflict with TDISP and you
> > > > > > seem to be fine with it because you have a hunch you can fix it.
> > > > > > But we can't do development based on your hunches.
> > > > > >
> > > > > We have different view.
> > > > > My patches do not conflict with TDISP because TDISP has clear
> > > > > definition of
> > > > not involving hypervisor for transport.
> > > > > And that part is still preserved.
> > > > > Delegating the migration to another TDISP or encrypting is yet to be
> > defined.
> > > > > And current patches will align to both the approaches in future.
> > > > >
> > > > > So you need to re-evaluate your judgment.
> > > >
> > > > If you like they do not "conflict".  But if used with TDISP they
> > > > just make it insecure and thus completely worthless.  If hypervisor
> > > > can change ring state to make device poke at random guest memory
> > > > then it is game over and all the effort spent was security theater.
> > > Not really, I proposed two options.
> > > 1. delegate the task of LM to the TVM. (proposed by two cpu vendors).
> > > In this case all the infra we build here, just works fine.
> > 
> > I think modification will be needed: currently commands are sent through the
> > PF, and that is under hypervisor control.
> > You should not assign PF to TVM.
> Yes, an admin virtio function will be there which will do the admin commands listed.

So it can't be PF, so at least we need a new group type.
I am inclined to then say, operate it through VF itself.


> > 
> > > It also does not require any hypervisor mediation for control plane.
> > >
> > > 2. Encrypt the owner device workload to be not seen by hypervisor
> > >
> > > Both methods does not affect the current direction.
> > >
> > > But if we force trap+emulation, it is 100% broken for TDISP.
> > > And I would not promote that.
> > >
> > > > But you know this, don't you? This is why you mentioned encrypting device.
> > > > Maybe that works. It just does not work *as is*.
> > > It works as_is. But current infrastructure does not block the future work.
> > >
> > > >
> > > >
> > > > > >
> > > > > > > Such encryption is not possible, with the trap+emulation
> > > > > > > method, where HV
> > > > > > will have to decrypt the data coming over MMIO writes.
> > > > > >
> > > > > > I don't how what trap+emulation has to do with it. Do you refer
> > > > > > to the shadow vq thing?
> > > > >
> > > > > The method proposed here does not hinder any TDISP direction.
> > > >
> > > > direction? No, why would it. we can always add more commands that
> > > > are safe for TDISP. commands you propose here are unsafe for TDISP.
> > > >
> > > > > Without my proposal, do you have a method that does not involve
> > > > > hypervisor
> > > > intervention for virtio common and device config space, cvq and shadow
> > vq?
> > > > > If so, I would like to hear that as well because that will align with TDISP.
> > > >
> > > > I really did not give it much thought.  I suspect for TDISP it just
> > > > might be cleaner to have guest agent migrate device. Certainly removes all
> > the messy questions.
> > > > That, to me impliest there needs to be a way to send migration
> > > > commands through VF itself. Does this "involve hypervisor
> > > > intervention"? No one should care I think.
> > > Too far of the future to envision. May be yes. When such platform is
> > > built, for sure whoever migrates need migrate its device side too.
> > > Some knowledge of migration driver is needed.
> > 
> > So TDISP migration is so far in the future you do not need to bother about it.
> > Fine. Then don't bring it up pls.
> > 
> As long as we are aligned to the requirement that a virtio member device is mapped to the guest VM without mediating the virtio interface, I am good.
> Again, other variants are fine, but above listed mapped variant is the minimum variant needed.

I think it's worth supporting this. I wouldn't call this minimum there
are other approaches. And I am not so sure it's worth trying
to support this in all kind of systems such as IOMMU without
dirty bit support. If some old systems will need mediation,
this is kind of like legacy interface. Not a big deal.



> > > >
> > > >
> > > > > > I am guessing modern platforms with TDISP support are likely to
> > > > > > also support dirty bit in the IOMMU.
> > > > > >
> > > > > It will be some day.
> > > >
> > > > What does this mean? Which platforms support TDISP and which IOMMUs
> > > > do they use?
> > > I said it will be some day, not right now.
> > 
> > That would be a stupid decision by a platform vendor. Let's just hope they don't
> > do it.
> > 
> > > >
> > > > > >
> > > > > > > > > And for future scale, having new SIOV interface makes more
> > > > > > > > > sense which has
> > > > > > > > its own direct interface to device.
> > > > > > > > >
> > > > > > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZi
> > > > > > > > > Vj8x
> > > > > > > > > 1s73
> > > > > > > > > Ed6r
> > > > > > > > > Osmn6LfXc/edit?usp=sharing
> > > > > > > >
> > > > > > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > > > > > discussion will take place on the mailing list please.
> > > > > > >
> > > > > > > We cannot keep discussing the register interface every week.
> > > > > > > I remember we have discussed this many times already in following
> > series.
> > > > > > >
> > > > > > > 1. legacy series
> > > > > > > 2. tvq v4 series
> > > > > > > 3. dynamic vq creation series
> > > > > > > 4. again during suspend series under tvq head 5. right now 6.
> > > > > > > May be more that I forgot.
> > > > > > >
> > > > > > > I captured all the direction and options in the doc. One can
> > > > > > > refer when those
> > > > > > questions arise there.
> > > > > > > If we don’t work cohesively same reasoning repetition does not help.
> > > > > >
> > > > > > It's still the same too, doc or no doc. You want to build a
> > > > > > device without registers fine but don't force it down everyone's throat.
> > > > > I don’t see any compelling reason for inventing new method really.
> > > > > Nor continuing in register mode.
> > > > > Virtio already has VQ.
> > > > > If CVQ is so problematic, one should put everything on registers
> > > > > and not run
> > > > on double standards.
> > > >
> > > > We should not and neither should we put everything behind a VQ.
> > > >
> > > Why?
> > 
> > Because simple things should be simple, complex things should be possible.
> CVQ != complex.
> And if it is, all things placed on the CVQ must be moved to registers first.

It is more complex than a register.

> > 
> > > >
> > > > > I captured all the reasoning and thoughts. I don’t have much to
> > > > > say in support
> > > > of infinite register scale.
> > > > >
> > > > > People who wants to push SIOV does not show single performance
> > > > > reason on
> > > > why SIOV to be done.
> > > > > I have upstreamed SIOVs in Linux as SFs without PASID, and in all
> > > > > our scale
> > > > tests, before the device chocks, the system chocks.
> > > > >
> > > > > So when someone pushes the SIOV series, I will be the first one
> > > > > interested in
> > > > reading the performance numbers to proceed with patches.
> > > > >
> > > > > > And now with 8MBytes
> > > > > > of on-device memory that's needed for migration and that's
> > > > > > apparently fine I am even less interested in saving 256 bytes
> > > > > > for config
> > > > space.
> > > > >
> > > > > Again, not the right comparison. When and how to use 256 matters.
> > > > > I haven’t come across any device that prefers infinite register scale.
> > > >
> > > > Why resort to hyperbole? 256 bytes is pretty far from infinite.  But
> > > > again, if you don't want it in registers just add an option to move
> > > > *all* of config space out of registers.
> > > This does not work in backward compatible way, not brings the predictability.
> > 
> > It will work if we want it to.
> > 
> > > The proposed method and current direction is just fine. No changes needed.
> > 
> > So one of the things that for example I care about is cross-vendor compatibility.
> > And if that compatibility is expressed through config space then that is a single
> > consistent interface.
> It just does not make sense to be expressed as config space by burning more registers.
>
>
> There is already command in v4 that shows the member function capabilities.
> 
> > How do you provision? Supply config space context.
> > This is what I am looking for, a single place where init time values can be
> > mapped, and for all device type. CVQ is not that.
> Provision the device for the resources using admin command and query capabilities via admin command.
> 
> No need to place once in lifetime structure in such registers.
> This is not simplicity.
> 
> > 
> > 
> > > No one is extending the config space or config registers in virtio anymore
> > anyway.
> > 
> > Then you tell me.  How much on-device memory per VF is fine?
> As I listed in the document, few bytes to boot strap the device.
> I anticipate less than 64B is really enough.
> 
> > If the answer is "as little as possible" then we can do better than the current pci
> > transport. If the answer is "a couple 100 bytes is not a problem, as long as it's
> > only be init time stuff" then that is exactly what we always had.
> Right. So let's not add giant cross vendor capabilities there. It is surely going to cross 100 bytes.
> 
> > 
> > > > cheaper devices will require newer guests.
> > > > Or, 10 years will pass and you will be able to drop compat with old
> > > > guests. I know it's too long a game for you to care but I've been
> > > > virtio spec editor for more than 10 years so to me it seems reasonable to
> > plan like that.
> > >
> > > Current proposal of using CVQ is just working fine across most* virtio spec
> > contributors.
> > > I don’t see technical reason to change it.
> > 
> > We have flamewars re-erupting all the time. What I want us to do is to draw a
> > line in the sand what is and what is not reasonable in config space. It has to
> > have a model behind it too that a reasonable developer can understand.
> This is why I wrote all the points and trade off in the document.
> 
> The current model defined in virtio spec Appendix B B.2 is really good enough and practical.

It says:
	Device configuration space should only be used for initialization-time
	parameters.

So for example, you want to know whether device supports offloads
during driver init, thus offloads are fine as config space or
feature bits. Makes sense? As another example, you need to know
# of admin vqs during init time so again, config.


-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  6:34                                                                                             ` Parav Pandit
  2023-11-16  6:38                                                                                               ` Michael S. Tsirkin
@ 2023-11-21  4:22                                                                                               ` Jason Wang
  2023-11-21 16:25                                                                                                 ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-21  4:22 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 2:34 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 16, 2023 11:53 AM
> >
> > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > You continue to want to overload admin commands for dual purpose, does
> > not make sense to me.
> >
> > dual -> as a transport and for migration? why can't they be used for this? I was
> > really hoping to cover these two cases when I proposed them.
> For following reasons.
>
> 1. migration needs incremental reads of only changed context between two reads

This is wrong. We need to invent general facilities. I've pointed out
sufficient issues, and what's more delta doesn't work for the
following cases:

1) migration but fail the another try for migration
2) save vm state twice

It request a lot of tricks in hypervisor to do that (e.g cache the last state?).

>
> 2. migration writes covers large part of the configurations not just virtio common config and device config.

You invent a duplication of common_cfg structure no?

What's wrong if we just allow them to be R/W over adminq/cmmands?

> Such as configuration occurred through the CVQ. All of these is not needed when done from guest directly via member's own CVQ.

That's the device type specific state which requires new commands
forsure. I don't see any connection. The SIOV device needs to be
migrated as well.

>
> For backward compatible SIOV transport, one may need them to transport without above two properties.

Why, just mediate between virtual PCI and adminq.

>
> 3. None of this transport is needed for PFs, VFs and non-backward compatible SIOVs.
> Each device to have its own transport that is not intercepted by the hypervisor and follow the equivalency principle uniformly for all 3 device types.

You can have per VF transport q, what's wrong with that?

Thanks

>
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 15:09                                                                                                           ` Michael S. Tsirkin
@ 2023-11-21  4:44                                                                                                             ` Jason Wang
  2023-11-21 16:27                                                                                                               ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-21  4:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 11:09 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Nov 17, 2023 at 02:51:04PM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 7:24 PM
> > >
> > > On Fri, Nov 17, 2023 at 12:46:21PM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 6:00 PM
> > > > > To: Parav Pandit <parav@nvidia.com>
> > > > >
> > > > > On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > > > >
> > > > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Additionally, if hypervisor has put the trap on
> > > > > > > > > > > > > > virtio config, and because the memory device
> > > > > > > > > > > > > > already has the interface for virtio config,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hypervisor can directly write/read from the
> > > > > > > > > > > > > > virtual config to the member's
> > > > > > > > > > > > > config space, without going through the device context, right?
> > > > > > > > > > > > >
> > > > > > > > > > > > > If it can do it or it can choose to not. I don't see
> > > > > > > > > > > > > how it is related to the discussion here.
> > > > > > > > > > > > >
> > > > > > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > > > > > native interface provided
> > > > > > > > > > > by the member device.
> > > > > > > > > > >
> > > > > > > > > > > So for example, it seems reasonable to a member
> > > > > > > > > > > supporting both existing pci register interface for
> > > > > > > > > > > compatibility and the future DMA based one for scale. In
> > > > > > > > > > > such a case, it seems possible that DMA will expose more
> > > > > > > > > > > features than pci. And then a hypervisor might decide to
> > > > > > > > > > > use
> > > > > > > > > that in preference to pci registers.
> > > > > > > > > >
> > > > > > > > > > We don’t find it right to involve owner device for
> > > > > > > > > > mediating at current scale
> > > > > > > > >
> > > > > > > > > In this model, device will be its own owner. Should not be a problem.
> > > > > > > > >
> > > > > > > > I didn’t understand above comment.
> > > > > > >
> > > > > > > We'd add a new group type "self". You can then send admin
> > > > > > > commands through VF itself not through PF.
> > > > > > >
> > > > > > How? The device is owned by the guest. FLR and device reset cannot
> > > > > > send the
> > > > > admin command reliably.
> > > > >
> > > > > It's of the "it hurts when I do this - don't do this then" category.
> > > > >
> > > > it is don’t do medication category, yes due all this weirdness that has been
> > > asked.
> > > >
> > > > >
> > > > > > >
> > > > > > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > > > > > >
> > > > > > > > > Look you either stop mentioning TDISP as motivation or
> > > > > > > > > actually try to address it. Safe migration with TDISP is really hard.
> > > > > > > > But that is not an excuse to say that TDISP migration is not
> > > > > > > > present, hence
> > > > > > > involve the owner device for config space access.
> > > > > > > > This is another hurdle added that further blocks us away from TDISP.
> > > > > > > > Hence, we don’t want to take the route of involving owner
> > > > > > > > device for any
> > > > > > > config access.
> > > > > > >
> > > > > > > This "blocks" is all just wild hunches. hypervisor controls some
> > > > > > > aspects of TDISP devices for sure - maybe we actually should use
> > > > > > > pci config space as that is generally hypervisor controlled.
> > > > > > Even bad to do hypercalls.
> > > > > > I showed you last time the role of the PCI config space snippet from the
> > > spec.
> > > > >
> > > > > Yes I remember. This is just an example though. My point is maybe it
> > > > > is solvable maybe it is not.
> > > > >
> > > > > > Do you see we are repeating the discussion again?
> > > > >
> > > > > One of the reasons is that people bring up irrelevances. TDISP is
> > > > > important but has to be addressed or deferred not vaguely referred to.
> > > >
> > > > So lets continue to follow the current TDISP direction of not involving
> > > hypervisor for virtio common and device config.
> >
> > If you disagree to it, please speak now, so that we don’t debate on this again in next 3 days.
> > Because this is the fundamental design considerations it relied on.
> > There is no point going forward if you want to disagree to it.
> > Other variants are fine, but other variants cannot be the only choice.
> >
> > > >
> > > > >
> > > > > > >
> > > > > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > > > > owner can control queue state at any time making device
> > > > > > > > > modify memory in any way it wants.
> > > > > > > > >
> > > > > > > > When TDISP migration is needed, the admin device can be
> > > > > > > > another TVM
> > > > > > > outside the HV scope.
> > > > > > > > Or an alternative would have device context encrypted not
> > > > > > > > visible to HV at
> > > > > all.
> > > > > > >
> > > > > > > Maybe. Fact remains your patches do conflict with TDISP and you
> > > > > > > seem to be fine with it because you have a hunch you can fix it.
> > > > > > > But we can't do development based on your hunches.
> > > > > > >
> > > > > > We have different view.
> > > > > > My patches do not conflict with TDISP because TDISP has clear
> > > > > > definition of
> > > > > not involving hypervisor for transport.
> > > > > > And that part is still preserved.
> > > > > > Delegating the migration to another TDISP or encrypting is yet to be
> > > defined.
> > > > > > And current patches will align to both the approaches in future.
> > > > > >
> > > > > > So you need to re-evaluate your judgment.
> > > > >
> > > > > If you like they do not "conflict".  But if used with TDISP they
> > > > > just make it insecure and thus completely worthless.  If hypervisor
> > > > > can change ring state to make device poke at random guest memory
> > > > > then it is game over and all the effort spent was security theater.
> > > > Not really, I proposed two options.
> > > > 1. delegate the task of LM to the TVM. (proposed by two cpu vendors).
> > > > In this case all the infra we build here, just works fine.
> > >
> > > I think modification will be needed: currently commands are sent through the
> > > PF, and that is under hypervisor control.
> > > You should not assign PF to TVM.

That's the point. And that's why it keeps people confused to believe
the current PF/adminq can work in the TDISP.

> > Yes, an admin virtio function will be there which will do the admin commands listed.
>
> So it can't be PF, so at least we need a new group type.
> I am inclined to then say, operate it through VF itself.

So it exactly matches the idea of transport virtqueue (a per VF/SF one).

But it still requires a PCI part to bootstrap.

>
>
> > >
> > > > It also does not require any hypervisor mediation for control plane.
> > > >
> > > > 2. Encrypt the owner device workload to be not seen by hypervisor
> > > >
> > > > Both methods does not affect the current direction.
> > > >
> > > > But if we force trap+emulation, it is 100% broken for TDISP.
> > > > And I would not promote that.
> > > >
> > > > > But you know this, don't you? This is why you mentioned encrypting device.
> > > > > Maybe that works. It just does not work *as is*.
> > > > It works as_is. But current infrastructure does not block the future work.
> > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > > Such encryption is not possible, with the trap+emulation
> > > > > > > > method, where HV
> > > > > > > will have to decrypt the data coming over MMIO writes.
> > > > > > >
> > > > > > > I don't how what trap+emulation has to do with it. Do you refer
> > > > > > > to the shadow vq thing?
> > > > > >
> > > > > > The method proposed here does not hinder any TDISP direction.
> > > > >
> > > > > direction? No, why would it. we can always add more commands that
> > > > > are safe for TDISP. commands you propose here are unsafe for TDISP.
> > > > >
> > > > > > Without my proposal, do you have a method that does not involve
> > > > > > hypervisor
> > > > > intervention for virtio common and device config space, cvq and shadow
> > > vq?
> > > > > > If so, I would like to hear that as well because that will align with TDISP.
> > > > >
> > > > > I really did not give it much thought.  I suspect for TDISP it just
> > > > > might be cleaner to have guest agent migrate device. Certainly removes all
> > > the messy questions.
> > > > > That, to me impliest there needs to be a way to send migration
> > > > > commands through VF itself. Does this "involve hypervisor
> > > > > intervention"? No one should care I think.
> > > > Too far of the future to envision. May be yes. When such platform is
> > > > built, for sure whoever migrates need migrate its device side too.
> > > > Some knowledge of migration driver is needed.
> > >
> > > So TDISP migration is so far in the future you do not need to bother about it.
> > > Fine. Then don't bring it up pls.
> > >
> > As long as we are aligned to the requirement that a virtio member device is mapped to the guest VM without mediating the virtio interface, I am good.
> > Again, other variants are fine, but above listed mapped variant is the minimum variant needed.
>
> I think it's worth supporting this. I wouldn't call this minimum there
> are other approaches.  And I am not so sure it's worth trying
> to support this in all kind of systems such as IOMMU without
> dirty bit support. If some old systems will need mediation,
> this is kind of like legacy interface. Not a big deal.
>

+1

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-17 12:02                                                                                                 ` Parav Pandit
  2023-11-17 12:30                                                                                                   ` Michael S. Tsirkin
@ 2023-11-21  5:25                                                                                                   ` Jason Wang
  2023-11-21 16:30                                                                                                     ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-21  5:25 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Fri, Nov 17, 2023 at 8:03 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 17, 2023 5:13 PM
> >
> > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 4:41 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > >
> > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > > config, and because the memory device already has the
> > > > > > > > > interface for virtio config,
> > > > > > > > >
> > > > > > > > > Hypervisor can directly write/read from the virtual config
> > > > > > > > > to the member's
> > > > > > > > config space, without going through the device context, right?
> > > > > > > >
> > > > > > > > If it can do it or it can choose to not. I don't see how it
> > > > > > > > is related to the discussion here.
> > > > > > > >
> > > > > > > It is. I don’t see a point of hypervisor not using the native
> > > > > > > interface provided
> > > > > > by the member device.
> > > > > >
> > > > > > So for example, it seems reasonable to a member supporting both
> > > > > > existing pci register interface for compatibility and the future
> > > > > > DMA based one for scale. In such a case, it seems possible that
> > > > > > DMA will expose more features than pci. And then a hypervisor
> > > > > > might decide to use
> > > > that in preference to pci registers.
> > > > >
> > > > > We don’t find it right to involve owner device for mediating at
> > > > > current scale
> > > >
> > > > In this model, device will be its own owner. Should not be a problem.
> > > >
> > > I didn’t understand above comment.
> >
> > We'd add a new group type "self". You can then send admin commands through
> > VF itself not through PF.
> >
> How? The device is owned by the guest. FLR and device reset cannot send the admin command reliably.
>
> >
> > > > > and to not break TDISP efforts in upcoming time by such design.
> > > >
> > > > Look you either stop mentioning TDISP as motivation or actually try
> > > > to address it. Safe migration with TDISP is really hard.
> > > But that is not an excuse to say that TDISP migration is not present, hence
> > involve the owner device for config space access.
> > > This is another hurdle added that further blocks us away from TDISP.
> > > Hence, we don’t want to take the route of involving owner device for any
> > config access.
> >
> > This "blocks" is all just wild hunches. hypervisor controls some aspects of TDISP
> > devices for sure - maybe we actually should use pci config space as that is
> > generally hypervisor controlled.
> Even bad to do hypercalls.
> I showed you last time the role of the PCI config space snippet from the spec.
> Do you see we are repeating the discussion again?
>
> >
> > > > For example, your current patches are clearly broken for TDISP:
> > > > owner can control queue state at any time making device modify
> > > > memory in any way it wants.
> > > >
> > > When TDISP migration is needed, the admin device can be another TVM
> > outside the HV scope.
> > > Or an alternative would have device context encrypted not visible to HV at all.
> >
> > Maybe. Fact remains your patches do conflict with TDISP and you seem to be
> > fine with it because you have a hunch you can fix it. But we can't do
> > development based on your hunches.
> >
> We have different view.
> My patches do not conflict with TDISP because TDISP has clear definition of not involving hypervisor for transport.
> And that part is still preserved.
> Delegating the migration to another TDISP or encrypting is yet to be defined.
> And current patches will align to both the approaches in future.
>
> So you need to re-evaluate your judgment.
>
> >
> > > Such encryption is not possible, with the trap+emulation method, where HV
> > will have to decrypt the data coming over MMIO writes.
> >
> > I don't how what trap+emulation has to do with it. Do you refer to the shadow
> > vq thing?
>
> The method proposed here does not hinder any TDISP direction.
>
> Without my proposal, do you have a method that does not involve hypervisor intervention for virtio common and device config space, cvq and shadow vq?
> If so, I would like to hear that as well because that will align with TDISP.

So this is what you said:

1) TDISP would not do mediation
2) registers doesn't scale

This is exactly what transport virtqueue did. Isn't it?

>
> > I am guessing modern platforms with TDISP support are likely to also
> > support dirty bit in the IOMMU.
> >
> It will be some day.

Dirty bit is far more realistic than TDISP in the short term.

>
> >
> > > > > And for future scale, having new SIOV interface makes more sense
> > > > > which has
> > > > its own direct interface to device.
> > > > >
> > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > >
> > > > > [1]
> > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x1s73
> > > > > Ed6r
> > > > > Osmn6LfXc/edit?usp=sharing
> > > >
> > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > discussion will take place on the mailing list please.
> > >
> > > We cannot keep discussing the register interface every week.
> > > I remember we have discussed this many times already in following series.
> > >
> > > 1. legacy series

How can this be supported in TDISP then?

> > > 2. tvq v4 series
> > > 3. dynamic vq creation series
> > > 4. again during suspend series under tvq head 5. right now 6. May be
> > > more that I forgot.
> > >
> > > I captured all the direction and options in the doc. One can refer when those
> > questions arise there.
> > > If we don’t work cohesively same reasoning repetition does not help.
> >
> > It's still the same too, doc or no doc. You want to build a device without
> > registers fine but don't force it down everyone's throat.
> I don’t see any compelling reason for inventing new method really.

New requests/platforms come for sure, and virtio supports various transports.

For example, there's a request to support PCI endpoint devices.

> Nor continuing in register mode.

Most virtio devices are implemented in software. And we have pure MMIO
based transport now which is implemented in registers only.

> Virtio already has VQ.
> If CVQ is so problematic, one should put everything on registers and not run on double standards.

I don't think there's anyone who says CVQ is problematic.

>
> I captured all the reasoning and thoughts. I don’t have much to say in support of infinite register scale.
>
> People who wants to push SIOV does not show single performance reason on why SIOV to be done.
> I have upstreamed SIOVs in Linux as SFs without PASID, and in all our scale tests, before the device chocks, the system chocks.
>
> So when someone pushes the SIOV series, I will be the first one interested in reading the performance numbers to proceed with patches.
>
> > And now with 8MBytes
> > of on-device memory that's needed for migration and that's apparently fine I
> > am even less interested in saving 256 bytes for config space.
>
> Again, not the right comparison.
> When and how to use 256 matters.

Do you know how much the config has grown in the past years since 1.0?

Virtio should be implemented easily from:

1) software device to hardware device
2) embedded to server

You can't say e.g migration is needed in all of the environments.

Thanks

> I haven’t come across any device that prefers infinite register scale.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-16  5:28                                                                                         ` Parav Pandit
  2023-11-16  6:23                                                                                           ` Michael S. Tsirkin
@ 2023-11-21  7:24                                                                                           ` Jason Wang
  2023-11-21 16:32                                                                                             ` Parav Pandit
  1 sibling, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-21  7:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Thu, Nov 16, 2023 at 1:28 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, November 16, 2023 9:50 AM
> >
> > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, November 13, 2023 9:03 AM
> > > >
> > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Tuesday, November 7, 2023 9:35 AM
> > > > > >
> > > > > > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > > > > >
> > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav Pandit
> > > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On
> > > > > > > > > > > > > > > > Behalf Of Jason Wang
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav
> > > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM Parav
> > > > > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > From: Jason Wang
> > > > > > > > > > > > > > > > > > > > <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023
> > > > > > > > > > > > > > > > > > > > 6:59 AM
> > > > > > > > > > > > > > > > > > > > > For passthrough PASID assignment
> > > > > > > > > > > > > > > > > > > > > vq is not
> > > > needed.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > > > > > > > Because for passthrough, the
> > > > > > > > > > > > > > > > > > > hypervisor is not involved in dealing
> > > > > > > > > > > > > > > > > > > with VQ at
> > > > > > > > > > > > > > > > > > all.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Ok, so if I understand correctly, you
> > > > > > > > > > > > > > > > > > are saying your design can't work for
> > > > > > > > > > > > > > > > > > the case of PASID
> > > > assignment.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > No. PASID assignment will happen from the
> > > > > > > > > > > > > > > > > guest for its own use and device
> > > > > > > > > > > > > > > > migration will just work fine because device
> > > > > > > > > > > > > > > > context will capture
> > > > > > > > this.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It's not about device context. We're
> > > > > > > > > > > > > > > > discussing "passthrough",
> > > > > > > > no?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > > > > > > > A member device is passthrough to the guest,
> > > > > > > > > > > > > > > dealing with its own PASIDs and
> > > > > > > > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > > > > > > > So VQ context captured by the hypervisor, will
> > > > > > > > > > > > > > > have some PASID attached to
> > > > > > > > > > > > > > this VQ.
> > > > > > > > > > > > > > > Device context will be updated.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > You want all virtio stuff to be
> > > > > > > > > > > > > > > > "passthrough", but assigning a PASID to a
> > > > > > > > > > > > > > > > specific virtqueue in the guest must be
> > > > > > > > trapped.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > No. PASID assignment to a specific virtqueue
> > > > > > > > > > > > > > > in the guest must go directly
> > > > > > > > > > > > > > from guest to device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This works like setting CR3, you can't simply
> > > > > > > > > > > > > > let it go from guest to
> > > > > > > > host.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Host IOMMU driver needs to know the PASID to
> > > > > > > > > > > > > > program the IO page tables correctly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > This will be done by the IOMMU.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > When guest iommu may need to communicate
> > > > > > > > > > > > > > > anything for this PASID, it will
> > > > > > > > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Let's say using PASID X for queue 0, this
> > > > > > > > > > > > > > knowledge is beyond the IOMMU scope but belongs
> > > > > > > > > > > > > > to virtio. Or please explain how it can work
> > > > > > > > > > > > > > when it goes directly from guest to
> > > > > > device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > > > > > > > >
> > > > > > > > > > > > It has one.
> > > > > > > > > > > >
> > > > > > > > > > > > > For ok for theory sake it is there.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Virtio driver will assign the PASID directly from
> > > > > > > > > > > > > guest driver to device using a
> > > > > > > > > > > > create_vq(pasid=X) command.
> > > > > > > > > > > > > Same process is somehow attached the PASID by the guest
> > OS.
> > > > > > > > > > > > > The whole PASID range is known to the hypervisor
> > > > > > > > > > > > > when the device is handed
> > > > > > > > > > > > over to the guest VM.
> > > > > > > > > > > >
> > > > > > > > > > > > How can it know?
> > > > > > > > > > > >
> > > > > > > > > > > > > So PASID mapping is setup by the hypervisor IOMMU
> > > > > > > > > > > > > at this
> > > > point.
> > > > > > > > > > > >
> > > > > > > > > > > > You disallow the PASID to be virtualized here.
> > > > > > > > > > > > What's more, such a PASID passthrough has security
> > implications.
> > > > > > > > > > > >
> > > > > > > > > > > No. virtio spec is not disallowing. At least for sure,
> > > > > > > > > > > this series is not the
> > > > > > > > one.
> > > > > > > > > > > My main point is, virtio device interface will not be
> > > > > > > > > > > the source of hypercall to
> > > > > > > > > > program IOMMU in the hypervisor.
> > > > > > > > > > > It is something to be done by IOMMU side.
> > > > > > > > > >
> > > > > > > > > > So unless vPASID can be used by the hardware you need to
> > > > > > > > > > trap the mapping from a PASID to a virtqueue. Then you
> > > > > > > > > > need virtio specific
> > > > > > > > knowledge.
> > > > > > > > > >
> > > > > > > > > vPASID by hardware is unlikely to be used by hw PCI EP
> > > > > > > > > devices at least in any
> > > > > > > > near term future.
> > > > > > > > > This requires either vPASID to pPASID table in device or in IOMMU.
> > > > > > > >
> > > > > > > > So we are on the same page.
> > > > > > > >
> > > > > > > > Claiming a method that can only work for passthrough or
> > > > > > > > emulation is not
> > > > > > good.
> > > > > > > > We all know virtualization is passthrough + emulation.
> > > > > > > Again, I agree but I wont generalize it here.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Again, we are talking about different things, I've
> > > > > > > > > > > > tried to show you that there are cases that
> > > > > > > > > > > > passthrough can't work but if you think the only way
> > > > > > > > > > > > for migration is to use passthrough in every case,
> > > > > > > > > > > > you will
> > > > > > > > > > probably fail.
> > > > > > > > > > > >
> > > > > > > > > > > I didn't say only way for migration is passthrough.
> > > > > > > > > > > Passthrough is clearly one way.
> > > > > > > > > > > Other ways may be possible.
> > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > There are works ongoing to make
> > > > > > > > > > > > > > > > > > > > vPASID work for the guest like
> > > > > > > > > > > > > > vSVA.
> > > > > > > > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Great, you find another limitation of
> > > > > > > > > > > > > > > > > > "passthrough" by
> > > > > > > > yourself.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > No. it is not the limitation it is just
> > > > > > > > > > > > > > > > > the way it does not need complex SVA to
> > > > > > > > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > How can you limit the user in the guest to not use
> > vSVA?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > He he, I am not limiting, again
> > > > > > > > > > > > > > > misunderstanding or wrong
> > > > > > > > attribution.
> > > > > > > > > > > > > > > I explained that hypervisor for passthrough
> > > > > > > > > > > > > > > does not need
> > > > SVA.
> > > > > > > > > > > > > > > Guest can do anything it wants from the guest
> > > > > > > > > > > > > > > OS with the member
> > > > > > > > > > > > device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ok, so the point stills, see above.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I don’t think so. The guest owns its PASID space
> > > > > > > > > > > >
> > > > > > > > > > > > Again, vPASID to PASID can't be done hardware unless
> > > > > > > > > > > > I miss some recent features of IOMMUs.
> > > > > > > > > > > >
> > > > > > > > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > > > > > > > >
> > > > > > > > > > At least for the current version of major IOMMU vendors,
> > > > > > > > > > such translation (aka PASID remapping) is not
> > > > > > > > > > implemented in the hardware so it needs to be trapped first.
> > > > > > > > > >
> > > > > > > > > Right. So it is really far in future, atleast few years away.
> > > > > > > > >
> > > > > > > > > > > It is still an early space for virtio.
> > > > > > > > > > >
> > > > > > > > > > > > > and directly communicates like any other device attribute.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Each passthrough device has PASID from
> > > > > > > > > > > > > > > > > > > its own space fully managed by the
> > > > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > > > > Some cpu required vPASID and SIOV is
> > > > > > > > > > > > > > > > > > > not going this way
> > > > > > > > > > anmore.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Then how to migrate? Invent a full set
> > > > > > > > > > > > > > > > > > of something else through another giant
> > > > > > > > > > > > > > > > > > series like this to migrate to the SIOV
> > > > > > > > > > thing?
> > > > > > > > > > > > > > > > > > That's a mess for
> > > > > > > > > > > > > > > > sure.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > SIOV will for sure reuse most or all parts
> > > > > > > > > > > > > > > > > of this work, almost entirely
> > > > > > > > > > > > as_is.
> > > > > > > > > > > > > > > > > vPASID is cpu/platform specific things not
> > > > > > > > > > > > > > > > > part of the SIOV
> > > > > > > > devices.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > If at all it is done, it will be
> > > > > > > > > > > > > > > > > > > > > done from the guest by the driver
> > > > > > > > > > > > > > > > > > > > > using virtio
> > > > > > > > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Then you need to trap. Such things
> > > > > > > > > > > > > > > > > > > > couldn't be passed through to guests
> > > > > > > > > > > > > > > > > > directly.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Only PASID capability is trapped.
> > > > > > > > > > > > > > > > > > > PASID allocation and usage is directly
> > > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > How can you achieve this? Assigning a
> > > > > > > > > > > > > > > > > > PAISD to a device is completely
> > > > > > > > > > > > > > > > > > device(virtio) specific. How can you use
> > > > > > > > > > > > > > > > > > a general layer without the knowledge of
> > > > > > > > > > > > > > > > > > virtio to trap
> > > > that?
> > > > > > > > > > > > > > > > > When one wants to map vPASID to pPASID a
> > > > > > > > > > > > > > > > > platform needs to be
> > > > > > > > > > > > > > involved.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'm not talking about how to map vPASID to
> > > > > > > > > > > > > > > > pPASID, it's out of the scope of virtio. I'm
> > > > > > > > > > > > > > > > talking about assigning a vPASID to a
> > > > > > > > > > > > > > > > specific virtqueue or other virtio function
> > > > > > > > > > > > > > > > in the
> > > > > > > > guest.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That can be done in the guest. The key is
> > > > > > > > > > > > > > > guest wont know that it is dealing
> > > > > > > > > > > > > > with vPASID.
> > > > > > > > > > > > > > > It will follow the same principle from your
> > > > > > > > > > > > > > > paper of equivalency, where virtio
> > > > > > > > > > > > > > software layer will assign PASID to VQ and
> > > > > > > > > > > > > > communicate to
> > > > > > device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It's not, as you mention that only MSI-X is
> > > > > > > > > > > > > > trapped, I give you another
> > > > > > > > > > one.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > PASID access from the guest to be done fully by
> > > > > > > > > > > > > the guest
> > > > IOMMU.
> > > > > > > > > > > > > Not by virtio devices.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > You need a virtio specific queue or
> > > > > > > > > > > > > > > > capability to assign a PASID to a specific
> > > > > > > > > > > > > > > > virtqueue, and that can't be done without
> > > > > > > > > > > > > > > > trapping and without virito specific
> > > > > > knowledge.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I disagree. PASID assignment to a virqueue in
> > > > > > > > > > > > > > > future from guest virtio driver to
> > > > > > > > > > > > > > device is uniform method.
> > > > > > > > > > > > > > > Whether its PF assigning PASID to VQ of self,
> > > > > > > > > > > > > > > Or VF driver in the guest assigning PASID to VQ.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > All same.
> > > > > > > > > > > > > > > Only IOMMU layer hypercalls will know how to
> > > > > > > > > > > > > > > deal with PASID assignment at
> > > > > > > > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > > > > > > > > By any means, if you were implying that
> > > > > > > > > > > > > > > somehow vq to PASID assignment
> > > > > > > > > > > > > > _may_ need trap+emulation, hence whole device
> > > > > > > > > > > > > > migration to depend on some
> > > > > > > > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > See above.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > Yeah, I disagree to such implying.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD
> > > > > > > > > > > > > > > isolating the guest process and
> > > > > > > > > > > > > > all of that just works on efficiency and
> > > > > > > > > > > > > > equivalence principle already for a decade now
> > > > > > > > > > > > > > without any
> > > > trap+emulation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > When virtio passthrough device is in
> > > > > > > > > > > > > > > > > guest, it has all its PASID
> > > > > > > > > > > > accessible.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > All these is large deviation from current
> > > > > > > > > > > > > > > > > discussion of this series, so I will keep
> > > > > > > > > > > > > > > > it short.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Regardless it is not relevant to
> > > > > > > > > > > > > > > > > > > passthrough mode as PASID is yet
> > > > > > > > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > > > > > And for some cpu if it is trapped, it
> > > > > > > > > > > > > > > > > > > is generic layer, that does not
> > > > > > > > > > > > > > > > > > > require virtio
> > > > > > > > > > > > > > > > > > involvement.
> > > > > > > > > > > > > > > > > > > So virtio interface asking to trap
> > > > > > > > > > > > > > > > > > > something because generic facility has
> > > > > > > > > > > > > > > > > > > done
> > > > > > > > > > > > > > > > > > in not the approach.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This misses the point of PASID. How to
> > > > > > > > > > > > > > > > > > use PASID is totally device
> > > > > > > > > > > > > > specific.
> > > > > > > > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID
> > > > > > > > > > > > > > > > > is platform specific as single PASID
> > > > > > > > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Capabilities of #2 is generic
> > > > > > > > > > > > > > > > > > > > > across all pci devices, so it will
> > > > > > > > > > > > > > > > > > > > > be handled by the
> > > > > > > > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > > > > > > > ATS/PRI cap is also generic manner
> > > > > > > > > > > > > > > > > > > > > handled by the HV and PCI
> > > > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > No, ATS/PRI requires the cooperation
> > > > > > > > > > > > > > > > > > > > from the
> > > > > > vIOMMU.
> > > > > > > > > > > > > > > > > > > > You can simply do ATS/PRI
> > > > > > > > > > > > > > > > > > > > passthrough but with an emulated
> > > > > > > > > > > > vIOMMU.
> > > > > > > > > > > > > > > > > > > And that is not the reason for virtio
> > > > > > > > > > > > > > > > > > > device to build
> > > > > > > > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > vIOMMU is emulated by hypervisor with a
> > > > > > > > > > > > > > > > > > PRI queue,
> > > > > > > > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Shouldn't it arrive at platform IOMMU first?
> > > > > > > > > > > > > > > > The path should be PRI
> > > > > > > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor ->
> > > > > > > > > > > > > > > > -> vIOMMU PRI
> > > > > > > > > > > > > > > > -> -> guest
> > > > > > > > > > IOMMU.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Above sequence seems write.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > And things will be more complicated when
> > > > > > > > > > > > > > > > (v)PASID is
> > > > used.
> > > > > > > > > > > > > > > > So you can't simply let PRI go directly to
> > > > > > > > > > > > > > > > the guest with the current
> > > > > > > > > > > > architecture.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In current architecture of the pci VF, PRI
> > > > > > > > > > > > > > > does not go directly to the
> > > > > > > > > > guest.
> > > > > > > > > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we
> > > > > > > > > > > > > > will probably trap other things in the future
> > > > > > > > > > > > > > like PASID
> > > > assignment.
> > > > > > > > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > > > > > > > >
> > > > > > > > > > > > It's not about the capability, it's about the whole
> > > > > > > > > > > > process of PRI request handling. We've agreed that
> > > > > > > > > > > > the PRI request needs to be trapped by the
> > > > > > > > > > > > hypervisor and then delivered to the
> > > > > > vIOMMU.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > Trap+emulation done in generic manner without
> > > > > > > > > > > > > Trap+involving virtio or other
> > > > > > > > > > > > device types.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > how can you pass through a hardware PRI
> > > > > > > > > > > > > > > > > > request to a guest directly without
> > > > > > > > > > > > > > > > > > trapping it
> > > > > > > > > > > > > > then?
> > > > > > > > > > > > > > > > > > What's more, PCIE allows the PRI to be
> > > > > > > > > > > > > > > > > > done in a vendor
> > > > > > > > > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > > > > > > > for virtio?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > > > > > > > Do you have a reference to the ECN that
> > > > > > > > > > > > > > > > > enables vendor specific way of PRI? I
> > > > > > > > > > > > > > > > would like to read it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I mean it doesn't forbid us to build a
> > > > > > > > > > > > > > > > virtio specific interface for I/O page fault report and
> > recovery.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So PRI of PCI does not allow. It is ODP kind
> > > > > > > > > > > > > > > of technique you meant
> > > > > > > > > > above.
> > > > > > > > > > > > > > > Yes one can build.
> > > > > > > > > > > > > > > Ok. unrelated to device migration, so I will
> > > > > > > > > > > > > > > park this good discussion for
> > > > > > > > > > > > later.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > That's fine.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This will be very good to eliminate IOMMU
> > > > > > > > > > > > > > > > > PRI
> > > > limitations.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Probably.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > PRI will directly go to the guest driver,
> > > > > > > > > > > > > > > > > and guest would interact with IOMMU
> > > > > > > > > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > > > > > > > But again these PCI-SIG extensions of PASID
> > > > > > > > > > > > > > > are not related to device
> > > > > > > > > > > > > > migration, so I am differing it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > For PRI in vendor specific way needs a
> > > > > > > > > > > > > > > > > separate discussion. It is not related to
> > > > > > > > > > > > > > > > live migration.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > PRI itself is not related. But the point is,
> > > > > > > > > > > > > > > > you can't simply pass through ATS/PRI now.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Ah ok. the whole 4K PCI config space where
> > > > > > > > > > > > > > > ATS/PRI capabilities are located
> > > > > > > > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > > > > > > > So?
> > > > > > > > > > > > > > > So do we start emulating virito interfaces too
> > > > > > > > > > > > > > > for
> > > > passthrough?
> > > > > > > > > > > > > > > No.
> > > > > > > > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > > > > > > > Sure why not?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Then let's not limit your proposal to be used by
> > "passthrough"
> > > > > > only?
> > > > > > > > > > > > > One can possibly build some variant of the
> > > > > > > > > > > > > existing virtio member device
> > > > > > > > > > > > using same owner and member scheme.
> > > > > > > > > > > >
> > > > > > > > > > > > It's not about the member/owner, it's about e.g
> > > > > > > > > > > > whether the hypervisor can trap and emulate.
> > > > > > > > > > > >
> > > > > > > > > > > > I've pointed out that what you invent here is
> > > > > > > > > > > > actually a partial new transport, for example, a
> > > > > > > > > > > > hypervisor can trap and use things like device
> > > > > > > > > > > > context in PF to bypass the registers in VF. This is
> > > > > > > > > > > > the idea of
> > > > > > > > > > transport commands/q.
> > > > > > > > > > > >
> > > > > > > > > > > I will not mix transport commands which are mainly
> > > > > > > > > > > useful for actual device
> > > > > > > > > > operation for SIOV only for backward compatibility that
> > > > > > > > > > too
> > > > optionally.
> > > > > > > > > > > One may still choose to have virtio common and device
> > > > > > > > > > > config in MMIO
> > > > > > > > > > ofcourse at lower scale.
> > > > > > > > > > >
> > > > > > > > > > > Anyway, mixing migration context with actual SIOV
> > > > > > > > > > > specific thing is not correct
> > > > > > > > > > as device context is read/write incremental values.
> > > > > > > > > >
> > > > > > > > > > SIOV is transport level stuff, the transport virtqueue
> > > > > > > > > > is designed in a way that is general enough to cover it.
> > > > > > > > > > Let's not shift
> > > > > > concepts.
> > > > > > > > > >
> > > > > > > > > Such TVQ is only for backward compatible vPCI composition.
> > > > > > > > > For ground up work such TVQ must not be done through the
> > > > > > > > > owner
> > > > > > device.
> > > > > > > >
> > > > > > > > That's the idea actually.
> > > > > > > >
> > > > > > > > > Each SIOV device to have its own channel to communicate
> > > > > > > > > directly to the
> > > > > > > > device.
> > > > > > > > >
> > > > > > > > > > One thing that you ignore is that, hypervisor can use
> > > > > > > > > > what you invented as a transport for VF, no?
> > > > > > > > > >
> > > > > > > > > No. by design,
> > > > > > > >
> > > > > > > > It works like hypervisor traps the virito config and
> > > > > > > > forwards it to admin virtqueue and starts the device via device
> > context.
> > > > > > > It needs more granular support than the management framework
> > > > > > > of device
> > > > > > context.
> > > > > >
> > > > > > It doesn't otherwise it is a design defect as you can't recover
> > > > > > the device context in the destination.
> > > > > >
> > > > > > Let me give you an example:
> > > > > >
> > > > > > 1) in the case of live migration, dst receive migration byte
> > > > > > flows and convert them into device context
> > > > > > 2) in the case of transporting, hypervisor traps virtio config
> > > > > > and convert them into the device context
> > > > > >
> > > > > > I don't see anything different in this case. Or can you give me an
> > example?
> > > > > In #1 dst received byte flows one or multiple times.
> > > >
> > > > How can this be different?
> > > >
> > > > Transport can also receive initial state incrementally.
> > > >
> > > Transport is just simple register RW interface without any caching layer in-
> > between.
> > > More below.
> > > > > And byte flows can be large.
> > > >
> > > > So when doing transport, it is not that large, that's it. If it can
> > > > work with large byte flow, why can't it work for small?
> > > Write context can as used (abused) for different purpose.
> > > Read cannot because it is meant to be incremental.
> >
> > Well hypervisor can just cache what it reads since the last, what's wrong with it?
> >
> But hypervisor does not know what changed, so it does do guess work to find out what to query.
>
> > > One can invent a cheap command to read it.
> >
> > For sure, but it's not the context here.
> >
> It is.
> > >
> > >
> > > >
> > > > > So it does not always contain everything. It only contains the new
> > > > > delta of the
> > > > device context.
> > > >
> > > > Isn't it just how current PCI transport does?
> > > >
> > > No. PCI transport has explicit API between device and driver to read or write
> > at specific offset and value.
> >
> > The point is that they are functional equivalents.
> >
> I disagree.
> There are two different functionalities.
>
> Functionality_1: explicit ask for read or write
> Functionality_2: read what has changed

This needs to be justified. I won't repeat the questions again here.

>
> Should one merge 1 and 2 and complicate the command?
> I prefer not to.

Again there're functional duplications. E.g your command duplicates
common_cfg for sure.

>
> Now having two different commands help for debugging to differentiate between mgmt. commands and guest initiated commands. :)
>
> > >
> > > > Guest configure the following one by one:
> > > >
> > > > 1) vq size
> > > > 2) vq addresses
> > > > 3) MSI-X
> > > >
> > > > etc?
> > > >
> > > I think you interpreted "incremental" differently than I described.
> > > In the device context read, the incremental is:
> > >
> > > If the hypervisor driver has read the device context twice, the second read
> > won't return any new data if nothing changed.
> >
> > See above.
> >
> Yeah, two separate commands needed.
>
> > > For example, if RSS configuration didn’t change between two reads, the
> > second read wont return the TLV for RSS Context.
> > >
> > > While for transport the need is, when guest asked, one device must read it
> > regardless of the change.
> > >
> > > So notion of incremental is not by address, but by the value.
> > >
> > > > > For example, VQ configuration is exchanged once between src and dst.
> > > > > But VQ avail and used index may be updated multiple times.
> > > >
> > > > If it can work with multiple times of updating, why can't it work if
> > > > we just update it once?
> > > Functionally it can work.
> >
> > I think you answer yourself.
> >
> Yes, I don’t like abuse of command.

How did you define abuse or can spec ever need to define that?

>
> > > Performance wise, one does not want to update multiple times, unless there
> > is a change.
> > >
> > > Read as explained above is not meant to return same content again.
> > >
> > > >
> > > > > So here hypervisor do not want to read any specific set of fields
> > > > > and
> > > > hypervisor is not parsing them either.
> > > > > It is just a byte stream for it.
> > > >
> > > > Firstly, spec must define the device context format, so hypervisor
> > > > can understand which byte is what otherwise you can't maintain
> > > > migration compatibility.
> > > Device context is defined already in the latest version.
> > >
> > > > Secondly, you can't mandate how the hypervisor is written.
> > > >
> > > > >
> > > > > As opposed to that, in case of transport, the guest explicitly
> > > > > asks to read or
> > > > write specific bytes.
> > > > > Therefore, it is not incremental.
> > > >
> > > > I'm totally lost. Which part of the transport is not incremental?
> > > >
> > > > >
> > > > > Additionally, if hypervisor has put the trap on virtio config, and
> > > > > because the memory device already has the interface for virtio
> > > > > config,
> > > > >
> > > > > Hypervisor can directly write/read from the virtual config to the
> > > > > member's
> > > > config space, without going through the device context, right?
> > > >
> > > > If it can do it or it can choose to not. I don't see how it is
> > > > related to the discussion here.
> > > >
> > > It is. I don’t see a point of hypervisor not using the native interface provided
> > by the member device.
> >
> > It really depends on the case, and I see how it duplicates with the functionality
> > that is provided by both:
> >
> > 1) The existing PCI transport
> >
> > or
> >
> > 2) The transport virtqueue
> >
> I would like to conclude that we disagree in our approaches.
> PCI transport is for member device to directly communicate from guest driver to the device.
> This is uniform across PF, VFs, SIOV.

For "PCi transport" did you mean the one defined in spec? If yes, how
can it work with SIOV with what you're saying here (a direct
communication channel)?

>
> Admin commands are transport independent and their task is device migration.
> One is not replacing the other.
>
> Transport virtqueue will never transport driver notifications, hence it does not qualify at "transport".

Another double standard.

MMIO will never transport device notification, hence it does not
qualify as "transport"?

>
> For the vdpa case, there is no need for extra admin commands as the mediation layer can directly use the interface available from the member device itself.
>
> You continue to want to overload admin commands for dual purpose, does not make sense to me.
>
> > >
> > >  > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > it is not good idea to overload management commands with
> > > > > > > > > actual run time
> > > > > > > > guest commands.
> > > > > > > > > The device context read writes are largely for incremental updates.
> > > > > > > >
> > > > > > > > It doesn't matter if it is incremental or not but
> > > > > > > >
> > > > > > > It does because you want different functionality only for
> > > > > > > purpose of backward
> > > > > > compatibility.
> > > > > > > That also if the device does not offer them as portion of MMIO BAR.
> > > > > >
> > > > > > I don't see how it is related to the "incremental part".
> > > > > >
> > > > > > >
> > > > > > > > 1) the function is there
> > > > > > > > 2) hypervisor can use that function if they want and virtio
> > > > > > > > (spec) can't forbid that
> > > > > > > >
> > > > > > > It is not about forbidding or supporting.
> > > > > > > Its about what functionality to use for management plane and
> > > > > > > guest
> > > > plane.
> > > > > > > Both have different needs.
> > > > > >
> > > > > > People can have different views, there's nothing we can prevent
> > > > > > a hypervisor from using it as a transport as far as I can see.
> > > > > For device context write command, it can be used (or probably
> > > > > abused) to do
> > > > write but I fail to see why to use it.
> > > >
> > > > The function is there, you can't prevent people from doing that.
> > > >
> > > One can always mess up itself. :)
> > > It is not prevented. It is just not right way to use the interface.
> > >
> > > > > Because member device already has the interface to do config
> > > > > read/write and
> > > > it is accessible to the hypervisor.
> > > >
> > > > Well, it looks self-contradictory again. Are you saying another set
> > > > of commands that is similar to device context is needed for non-PCI
> > transport?
> > > >
> > > All these non pci transport discussion is just meaning less.
> > > Let MMIO bring the concept of member device at that point something make
> > sense to discuss.
> >
> > It's not necessarily MMIO. For example the SIOV, which I don't think can use the
> > existing PCI transport.
> >
> > > PCI SIOV is also the PCI device at the end.
> >
> > We don't want to end up with two sets of commands to save/load SRIOV and
> > SIOV at least.
> >
> This proposal ensures that SRIOV and SIOV devices are treated equally.

How? Did you mean your proposal can work for SIOV? What's the transport then?

> How brand new non-compatible SIOV device to transport this, is outside of the scope of this work.

You invented one that can be used for doing this. If you disagree, how
can we know your proposal can work for SIOV without a transport then?

Thanks



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-21  4:22                                                                                               ` Jason Wang
@ 2023-11-21 16:25                                                                                                 ` Parav Pandit
  2023-11-22  4:13                                                                                                   ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-21 16:25 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 9:53 AM
> On Thu, Nov 16, 2023 at 2:34 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 16, 2023 11:53 AM
> > >
> > > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > > You continue to want to overload admin commands for dual purpose,
> > > > does
> > > not make sense to me.
> > >
> > > dual -> as a transport and for migration? why can't they be used for
> > > this? I was really hoping to cover these two cases when I proposed them.
> > For following reasons.
> >
> > 1. migration needs incremental reads of only changed context between
> > two reads
> 
> This is wrong. We need to invent general facilities. I've pointed out sufficient
> issues, and what's more delta doesn't work for the following cases:
> 
I disagree to your above comments. Please read below.
It works.

> 1) migration but fail the another try for migration

When hypervisor wants to retry the migration, it invokes the DISCARD command present in v4.
And starts reading the device context again.

> 2) save vm state twice
> 
This can be saved twice when needed.

> It request a lot of tricks in hypervisor to do that (e.g cache the last state?).
Not really.
Hypervisor can always discard what it read and re-read it again as fresh device context.

> 
> >
> > 2. migration writes covers large part of the configurations not just virtio
> common config and device config.
> 
> You invent a duplication of common_cfg structure no?
> 
Nop.
Common configuration is written using a MMIO, byte/word etc boundary by the guest directly in guest owned area.

> What's wrong if we just allow them to be R/W over adminq/cmmands?
> 
As explained before,
Each guest has its own dedicated non mediated interface as defined in virtio spec to not involve hypervisor.
This is uniform for PF and SR-IOV VFs.
And it will be uniform for backward compatible SIOV tomorrow, one the performance numbers for SIOV are available.
For non-backward compatible SIOV, there is better way to not have such a large config space anyway.

> > Such as configuration occurred through the CVQ. All of these is not needed
> when done from guest directly via member's own CVQ.
> 
> That's the device type specific state which requires new commands forsure. I
> don't see any connection. The SIOV device needs to be migrated as well.
> 
And they will use all majority of the device context as presented here.

> >
> > For backward compatible SIOV transport, one may need them to transport
> without above two properties.
> 
> Why, just mediate between virtual PCI and adminq.
> 
I don’t understand this.

> >
> > 3. None of this transport is needed for PFs, VFs and non-backward
> compatible SIOVs.
> > Each device to have its own transport that is not intercepted by the
> hypervisor and follow the equivalency principle uniformly for all 3 device types.
> 
> You can have per VF transport q, what's wrong with that?
As explained in the doc and in multiple emails, it is inefficient. CVQ is just enough.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-21  4:44                                                                                                             ` Jason Wang
@ 2023-11-21 16:27                                                                                                               ` Parav Pandit
  2023-11-22  4:16                                                                                                                 ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-21 16:27 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 10:15 AM
> 
> On Fri, Nov 17, 2023 at 11:09 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> >
> > On Fri, Nov 17, 2023 at 02:51:04PM +0000, Parav Pandit wrote:
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 7:24 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 12:46:21PM +0000, Parav Pandit wrote:
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 6:00 PM
> > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit
> wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit
> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Additionally, if hypervisor has put the trap
> > > > > > > > > > > > > > > on virtio config, and because the memory
> > > > > > > > > > > > > > > device already has the interface for virtio
> > > > > > > > > > > > > > > config,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hypervisor can directly write/read from the
> > > > > > > > > > > > > > > virtual config to the member's
> > > > > > > > > > > > > > config space, without going through the device context,
> right?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If it can do it or it can choose to not. I
> > > > > > > > > > > > > > don't see how it is related to the discussion here.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > It is. I don’t see a point of hypervisor not
> > > > > > > > > > > > > using the native interface provided
> > > > > > > > > > > > by the member device.
> > > > > > > > > > > >
> > > > > > > > > > > > So for example, it seems reasonable to a member
> > > > > > > > > > > > supporting both existing pci register interface
> > > > > > > > > > > > for compatibility and the future DMA based one for
> > > > > > > > > > > > scale. In such a case, it seems possible that DMA
> > > > > > > > > > > > will expose more features than pci. And then a
> > > > > > > > > > > > hypervisor might decide to use
> > > > > > > > > > that in preference to pci registers.
> > > > > > > > > > >
> > > > > > > > > > > We don’t find it right to involve owner device for
> > > > > > > > > > > mediating at current scale
> > > > > > > > > >
> > > > > > > > > > In this model, device will be its own owner. Should not be a
> problem.
> > > > > > > > > >
> > > > > > > > > I didn’t understand above comment.
> > > > > > > >
> > > > > > > > We'd add a new group type "self". You can then send admin
> > > > > > > > commands through VF itself not through PF.
> > > > > > > >
> > > > > > > How? The device is owned by the guest. FLR and device reset
> > > > > > > cannot send the
> > > > > > admin command reliably.
> > > > > >
> > > > > > It's of the "it hurts when I do this - don't do this then" category.
> > > > > >
> > > > > it is don’t do medication category, yes due all this weirdness
> > > > > that has been
> > > > asked.
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > > and to not break TDISP efforts in upcoming time by such
> design.
> > > > > > > > > >
> > > > > > > > > > Look you either stop mentioning TDISP as motivation or
> > > > > > > > > > actually try to address it. Safe migration with TDISP is really
> hard.
> > > > > > > > > But that is not an excuse to say that TDISP migration is
> > > > > > > > > not present, hence
> > > > > > > > involve the owner device for config space access.
> > > > > > > > > This is another hurdle added that further blocks us away from
> TDISP.
> > > > > > > > > Hence, we don’t want to take the route of involving
> > > > > > > > > owner device for any
> > > > > > > > config access.
> > > > > > > >
> > > > > > > > This "blocks" is all just wild hunches. hypervisor
> > > > > > > > controls some aspects of TDISP devices for sure - maybe we
> > > > > > > > actually should use pci config space as that is generally hypervisor
> controlled.
> > > > > > > Even bad to do hypercalls.
> > > > > > > I showed you last time the role of the PCI config space
> > > > > > > snippet from the
> > > > spec.
> > > > > >
> > > > > > Yes I remember. This is just an example though. My point is
> > > > > > maybe it is solvable maybe it is not.
> > > > > >
> > > > > > > Do you see we are repeating the discussion again?
> > > > > >
> > > > > > One of the reasons is that people bring up irrelevances. TDISP
> > > > > > is important but has to be addressed or deferred not vaguely referred
> to.
> > > > >
> > > > > So lets continue to follow the current TDISP direction of not
> > > > > involving
> > > > hypervisor for virtio common and device config.
> > >
> > > If you disagree to it, please speak now, so that we don’t debate on this
> again in next 3 days.
> > > Because this is the fundamental design considerations it relied on.
> > > There is no point going forward if you want to disagree to it.
> > > Other variants are fine, but other variants cannot be the only choice.
> > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > > > > > owner can control queue state at any time making
> > > > > > > > > > device modify memory in any way it wants.
> > > > > > > > > >
> > > > > > > > > When TDISP migration is needed, the admin device can be
> > > > > > > > > another TVM
> > > > > > > > outside the HV scope.
> > > > > > > > > Or an alternative would have device context encrypted
> > > > > > > > > not visible to HV at
> > > > > > all.
> > > > > > > >
> > > > > > > > Maybe. Fact remains your patches do conflict with TDISP
> > > > > > > > and you seem to be fine with it because you have a hunch you can
> fix it.
> > > > > > > > But we can't do development based on your hunches.
> > > > > > > >
> > > > > > > We have different view.
> > > > > > > My patches do not conflict with TDISP because TDISP has
> > > > > > > clear definition of
> > > > > > not involving hypervisor for transport.
> > > > > > > And that part is still preserved.
> > > > > > > Delegating the migration to another TDISP or encrypting is
> > > > > > > yet to be
> > > > defined.
> > > > > > > And current patches will align to both the approaches in future.
> > > > > > >
> > > > > > > So you need to re-evaluate your judgment.
> > > > > >
> > > > > > If you like they do not "conflict".  But if used with TDISP
> > > > > > they just make it insecure and thus completely worthless.  If
> > > > > > hypervisor can change ring state to make device poke at random
> > > > > > guest memory then it is game over and all the effort spent was
> security theater.
> > > > > Not really, I proposed two options.
> > > > > 1. delegate the task of LM to the TVM. (proposed by two cpu vendors).
> > > > > In this case all the infra we build here, just works fine.
> > > >
> > > > I think modification will be needed: currently commands are sent
> > > > through the PF, and that is under hypervisor control.
> > > > You should not assign PF to TVM.
> 
> That's the point. And that's why it keeps people confused to believe the
> current PF/adminq can work in the TDISP.
> 
There is no confusion.
The admin queue interface ensures first step that TDISP interface is dedicated to guest as today.
There is no bifurcation added on the VF that needs extra mediation.

> > > Yes, an admin virtio function will be there which will do the admin
> commands listed.
> >
> > So it can't be PF, so at least we need a new group type.
> > I am inclined to then say, operate it through VF itself.
> 
> So it exactly matches the idea of transport virtqueue (a per VF/SF one).
> 
There is no need for transport virtqueue for VF as VF device has same uniform principle as PF.
If you want transport vq, please have it on the PF too.
And that also is not needed because there is already CVQ.

> But it still requires a PCI part to bootstrap.
> 
> >
> >
> > > >
> > > > > It also does not require any hypervisor mediation for control plane.
> > > > >
> > > > > 2. Encrypt the owner device workload to be not seen by
> > > > > hypervisor
> > > > >
> > > > > Both methods does not affect the current direction.
> > > > >
> > > > > But if we force trap+emulation, it is 100% broken for TDISP.
> > > > > And I would not promote that.
> > > > >
> > > > > > But you know this, don't you? This is why you mentioned encrypting
> device.
> > > > > > Maybe that works. It just does not work *as is*.
> > > > > It works as_is. But current infrastructure does not block the future
> work.
> > > > >
> > > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > Such encryption is not possible, with the trap+emulation
> > > > > > > > > method, where HV
> > > > > > > > will have to decrypt the data coming over MMIO writes.
> > > > > > > >
> > > > > > > > I don't how what trap+emulation has to do with it. Do you
> > > > > > > > refer to the shadow vq thing?
> > > > > > >
> > > > > > > The method proposed here does not hinder any TDISP direction.
> > > > > >
> > > > > > direction? No, why would it. we can always add more commands
> > > > > > that are safe for TDISP. commands you propose here are unsafe for
> TDISP.
> > > > > >
> > > > > > > Without my proposal, do you have a method that does not
> > > > > > > involve hypervisor
> > > > > > intervention for virtio common and device config space, cvq
> > > > > > and shadow
> > > > vq?
> > > > > > > If so, I would like to hear that as well because that will align with
> TDISP.
> > > > > >
> > > > > > I really did not give it much thought.  I suspect for TDISP it
> > > > > > just might be cleaner to have guest agent migrate device.
> > > > > > Certainly removes all
> > > > the messy questions.
> > > > > > That, to me impliest there needs to be a way to send migration
> > > > > > commands through VF itself. Does this "involve hypervisor
> > > > > > intervention"? No one should care I think.
> > > > > Too far of the future to envision. May be yes. When such
> > > > > platform is built, for sure whoever migrates need migrate its device side
> too.
> > > > > Some knowledge of migration driver is needed.
> > > >
> > > > So TDISP migration is so far in the future you do not need to bother
> about it.
> > > > Fine. Then don't bring it up pls.
> > > >
> > > As long as we are aligned to the requirement that a virtio member device is
> mapped to the guest VM without mediating the virtio interface, I am good.
> > > Again, other variants are fine, but above listed mapped variant is the
> minimum variant needed.
> >
> > I think it's worth supporting this. I wouldn't call this minimum there
> > are other approaches.  And I am not so sure it's worth trying to
> > support this in all kind of systems such as IOMMU without dirty bit
> > support. If some old systems will need mediation, this is kind of like
> > legacy interface. Not a big deal.
> >
> 
> +1

There are users with the recent cpus that may not have the IOMMU dirty page tracking support.
So I don’t fully agree.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-21  5:25                                                                                                   ` Jason Wang
@ 2023-11-21 16:30                                                                                                     ` Parav Pandit
  2023-11-22  4:18                                                                                                       ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-21 16:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 10:55 AM
> 
> On Fri, Nov 17, 2023 at 8:03 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, November 17, 2023 5:13 PM
> > >
> > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > >
> > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > > > config, and because the memory device already has the
> > > > > > > > > > interface for virtio config,
> > > > > > > > > >
> > > > > > > > > > Hypervisor can directly write/read from the virtual
> > > > > > > > > > config to the member's
> > > > > > > > > config space, without going through the device context, right?
> > > > > > > > >
> > > > > > > > > If it can do it or it can choose to not. I don't see how
> > > > > > > > > it is related to the discussion here.
> > > > > > > > >
> > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > native interface provided
> > > > > > > by the member device.
> > > > > > >
> > > > > > > So for example, it seems reasonable to a member supporting
> > > > > > > both existing pci register interface for compatibility and
> > > > > > > the future DMA based one for scale. In such a case, it seems
> > > > > > > possible that DMA will expose more features than pci. And
> > > > > > > then a hypervisor might decide to use
> > > > > that in preference to pci registers.
> > > > > >
> > > > > > We don’t find it right to involve owner device for mediating
> > > > > > at current scale
> > > > >
> > > > > In this model, device will be its own owner. Should not be a problem.
> > > > >
> > > > I didn’t understand above comment.
> > >
> > > We'd add a new group type "self". You can then send admin commands
> > > through VF itself not through PF.
> > >
> > How? The device is owned by the guest. FLR and device reset cannot send
> the admin command reliably.
> >
> > >
> > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > >
> > > > > Look you either stop mentioning TDISP as motivation or actually
> > > > > try to address it. Safe migration with TDISP is really hard.
> > > > But that is not an excuse to say that TDISP migration is not
> > > > present, hence
> > > involve the owner device for config space access.
> > > > This is another hurdle added that further blocks us away from TDISP.
> > > > Hence, we don’t want to take the route of involving owner device
> > > > for any
> > > config access.
> > >
> > > This "blocks" is all just wild hunches. hypervisor controls some
> > > aspects of TDISP devices for sure - maybe we actually should use pci
> > > config space as that is generally hypervisor controlled.
> > Even bad to do hypercalls.
> > I showed you last time the role of the PCI config space snippet from the
> spec.
> > Do you see we are repeating the discussion again?
> >
> > >
> > > > > For example, your current patches are clearly broken for TDISP:
> > > > > owner can control queue state at any time making device modify
> > > > > memory in any way it wants.
> > > > >
> > > > When TDISP migration is needed, the admin device can be another
> > > > TVM
> > > outside the HV scope.
> > > > Or an alternative would have device context encrypted not visible to HV
> at all.
> > >
> > > Maybe. Fact remains your patches do conflict with TDISP and you seem
> > > to be fine with it because you have a hunch you can fix it. But we
> > > can't do development based on your hunches.
> > >
> > We have different view.
> > My patches do not conflict with TDISP because TDISP has clear definition of
> not involving hypervisor for transport.
> > And that part is still preserved.
> > Delegating the migration to another TDISP or encrypting is yet to be defined.
> > And current patches will align to both the approaches in future.
> >
> > So you need to re-evaluate your judgment.
> >
> > >
> > > > Such encryption is not possible, with the trap+emulation method,
> > > > where HV
> > > will have to decrypt the data coming over MMIO writes.
> > >
> > > I don't how what trap+emulation has to do with it. Do you refer to
> > > the shadow vq thing?
> >
> > The method proposed here does not hinder any TDISP direction.
> >
> > Without my proposal, do you have a method that does not involve
> hypervisor intervention for virtio common and device config space, cvq and
> shadow vq?
> > If so, I would like to hear that as well because that will align with TDISP.
> 
> So this is what you said:
> 
> 1) TDISP would not do mediation
> 2) registers doesn't scale
> 
> This is exactly what transport virtqueue did. Isn't it?
> 
No. 
CVQ is doing both of them currently uniformly across PF, VF.
Future SIOV will be able to this also.

> >
> > > I am guessing modern platforms with TDISP support are likely to also
> > > support dirty bit in the IOMMU.
> > >
> > It will be some day.
> 
> Dirty bit is far more realistic than TDISP in the short term.
> 
> >
> > >
> > > > > > And for future scale, having new SIOV interface makes more
> > > > > > sense which has
> > > > > its own direct interface to device.
> > > > > >
> > > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > > >
> > > > > > [1]
> > > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x
> > > > > > 1s73
> > > > > > Ed6r
> > > > > > Osmn6LfXc/edit?usp=sharing
> > > > >
> > > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > > discussion will take place on the mailing list please.
> > > >
> > > > We cannot keep discussing the register interface every week.
> > > > I remember we have discussed this many times already in following
> series.
> > > >
> > > > 1. legacy series
> 
> How can this be supported in TDISP then?
> 
> > > > 2. tvq v4 series
> > > > 3. dynamic vq creation series
> > > > 4. again during suspend series under tvq head 5. right now 6. May
> > > > be more that I forgot.
> > > >
> > > > I captured all the direction and options in the doc. One can refer
> > > > when those
> > > questions arise there.
> > > > If we don’t work cohesively same reasoning repetition does not help.
> > >
> > > It's still the same too, doc or no doc. You want to build a device
> > > without registers fine but don't force it down everyone's throat.
> > I don’t see any compelling reason for inventing new method really.
> 
> New requests/platforms come for sure, and virtio supports various transports.
> 
> For example, there's a request to support PCI endpoint devices.
What is PCI endpoint devices? Does it have a new device type?

> 
> > Nor continuing in register mode.
> 
> Most virtio devices are implemented in software. 
We see it differently in field and in virtio charter since 2021.

> And we have pure MMIO
> based transport now which is implemented in registers only.
> 
> > Virtio already has VQ.
> > If CVQ is so problematic, one should put everything on registers and not run
> on double standards.
> 
> I don't think there's anyone who says CVQ is problematic.
> 
Ok. than lets stop this endless debate.
Everyone is using CVQ, lets continue to use.

> >
> > I captured all the reasoning and thoughts. I don’t have much to say in
> support of infinite register scale.
> >
> > People who wants to push SIOV does not show single performance reason
> on why SIOV to be done.
> > I have upstreamed SIOVs in Linux as SFs without PASID, and in all our scale
> tests, before the device chocks, the system chocks.
> >
> > So when someone pushes the SIOV series, I will be the first one interested in
> reading the performance numbers to proceed with patches.
> >
> > > And now with 8MBytes
> > > of on-device memory that's needed for migration and that's
> > > apparently fine I am even less interested in saving 256 bytes for config
> space.
> >
> > Again, not the right comparison.
> > When and how to use 256 matters.
> 
> Do you know how much the config has grown in the past years since 1.0?
> 
Very less and no point in deviating the design now anyway for device migration or otherwise.

> Virtio should be implemented easily from:
> 
And it is already there.

> 1) software device to hardware device
> 2) embedded to server
> 
> You can't say e.g migration is needed in all of the environments.
Which line in the patch said this?
I specifically asked to not build transport vq because efficiency is needed on the PFs too.

> 
> Thanks
> 
> > I haven’t come across any device that prefers infinite register scale.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-21  7:24                                                                                           ` Jason Wang
@ 2023-11-21 16:32                                                                                             ` Parav Pandit
  2023-11-22  5:27                                                                                               ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-21 16:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 12:54 PM
> 
> On Thu, Nov 16, 2023 at 1:28 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, November 16, 2023 9:50 AM
> > >
> > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, November 13, 2023 9:03 AM
> > > > >
> > > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Tuesday, November 7, 2023 9:35 AM
> > > > > > >
> > > > > > > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > > > > > >
> > > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 1, 2023 at 11:32 AM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Oct 31, 2023 at 1:30 PM Parav Pandit
> > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Oct 30, 2023 at 12:47 PM Parav
> > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > From:
> > > > > > > > > > > > > > > > > virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org>
> > > > > > > > > > > > > > > > > On Behalf Of Jason Wang
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45 AM Parav
> > > > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > From: Jason Wang
> > > > > > > > > > > > > > > > > > > <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > > > Sent: Thursday, October 26, 2023
> > > > > > > > > > > > > > > > > > > 6:16 AM
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03 PM
> > > > > > > > > > > > > > > > > > > Parav Pandit <parav@nvidia.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > From: Jason Wang
> > > > > > > > > > > > > > > > > > > > > <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, October 25,
> > > > > > > > > > > > > > > > > > > > > 2023
> > > > > > > > > > > > > > > > > > > > > 6:59 AM
> > > > > > > > > > > > > > > > > > > > > > For passthrough PASID
> > > > > > > > > > > > > > > > > > > > > > assignment vq is not
> > > > > needed.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > > > > > > > > Because for passthrough, the
> > > > > > > > > > > > > > > > > > > > hypervisor is not involved in
> > > > > > > > > > > > > > > > > > > > dealing with VQ at
> > > > > > > > > > > > > > > > > > > all.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Ok, so if I understand correctly,
> > > > > > > > > > > > > > > > > > > you are saying your design can't
> > > > > > > > > > > > > > > > > > > work for the case of PASID
> > > > > assignment.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > No. PASID assignment will happen from
> > > > > > > > > > > > > > > > > > the guest for its own use and device
> > > > > > > > > > > > > > > > > migration will just work fine because
> > > > > > > > > > > > > > > > > device context will capture
> > > > > > > > > this.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > It's not about device context. We're
> > > > > > > > > > > > > > > > > discussing "passthrough",
> > > > > > > > > no?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > > > > > > > > A member device is passthrough to the
> > > > > > > > > > > > > > > > guest, dealing with its own PASIDs and
> > > > > > > > > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > > > > > > > > So VQ context captured by the hypervisor,
> > > > > > > > > > > > > > > > will have some PASID attached to
> > > > > > > > > > > > > > > this VQ.
> > > > > > > > > > > > > > > > Device context will be updated.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > You want all virtio stuff to be
> > > > > > > > > > > > > > > > > "passthrough", but assigning a PASID to
> > > > > > > > > > > > > > > > > a specific virtqueue in the guest must
> > > > > > > > > > > > > > > > > be
> > > > > > > > > trapped.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > No. PASID assignment to a specific
> > > > > > > > > > > > > > > > virtqueue in the guest must go directly
> > > > > > > > > > > > > > > from guest to device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This works like setting CR3, you can't
> > > > > > > > > > > > > > > simply let it go from guest to
> > > > > > > > > host.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Host IOMMU driver needs to know the PASID to
> > > > > > > > > > > > > > > program the IO page tables correctly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > This will be done by the IOMMU.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When guest iommu may need to communicate
> > > > > > > > > > > > > > > > anything for this PASID, it will
> > > > > > > > > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Let's say using PASID X for queue 0, this
> > > > > > > > > > > > > > > knowledge is beyond the IOMMU scope but
> > > > > > > > > > > > > > > belongs to virtio. Or please explain how it
> > > > > > > > > > > > > > > can work when it goes directly from guest to
> > > > > > > device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It has one.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > For ok for theory sake it is there.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Virtio driver will assign the PASID directly
> > > > > > > > > > > > > > from guest driver to device using a
> > > > > > > > > > > > > create_vq(pasid=X) command.
> > > > > > > > > > > > > > Same process is somehow attached the PASID by
> > > > > > > > > > > > > > the guest
> > > OS.
> > > > > > > > > > > > > > The whole PASID range is known to the
> > > > > > > > > > > > > > hypervisor when the device is handed
> > > > > > > > > > > > > over to the guest VM.
> > > > > > > > > > > > >
> > > > > > > > > > > > > How can it know?
> > > > > > > > > > > > >
> > > > > > > > > > > > > > So PASID mapping is setup by the hypervisor
> > > > > > > > > > > > > > IOMMU at this
> > > > > point.
> > > > > > > > > > > > >
> > > > > > > > > > > > > You disallow the PASID to be virtualized here.
> > > > > > > > > > > > > What's more, such a PASID passthrough has
> > > > > > > > > > > > > security
> > > implications.
> > > > > > > > > > > > >
> > > > > > > > > > > > No. virtio spec is not disallowing. At least for
> > > > > > > > > > > > sure, this series is not the
> > > > > > > > > one.
> > > > > > > > > > > > My main point is, virtio device interface will not
> > > > > > > > > > > > be the source of hypercall to
> > > > > > > > > > > program IOMMU in the hypervisor.
> > > > > > > > > > > > It is something to be done by IOMMU side.
> > > > > > > > > > >
> > > > > > > > > > > So unless vPASID can be used by the hardware you
> > > > > > > > > > > need to trap the mapping from a PASID to a
> > > > > > > > > > > virtqueue. Then you need virtio specific
> > > > > > > > > knowledge.
> > > > > > > > > > >
> > > > > > > > > > vPASID by hardware is unlikely to be used by hw PCI EP
> > > > > > > > > > devices at least in any
> > > > > > > > > near term future.
> > > > > > > > > > This requires either vPASID to pPASID table in device or in
> IOMMU.
> > > > > > > > >
> > > > > > > > > So we are on the same page.
> > > > > > > > >
> > > > > > > > > Claiming a method that can only work for passthrough or
> > > > > > > > > emulation is not
> > > > > > > good.
> > > > > > > > > We all know virtualization is passthrough + emulation.
> > > > > > > > Again, I agree but I wont generalize it here.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Again, we are talking about different things,
> > > > > > > > > > > > > I've tried to show you that there are cases that
> > > > > > > > > > > > > passthrough can't work but if you think the only
> > > > > > > > > > > > > way for migration is to use passthrough in every
> > > > > > > > > > > > > case, you will
> > > > > > > > > > > probably fail.
> > > > > > > > > > > > >
> > > > > > > > > > > > I didn't say only way for migration is passthrough.
> > > > > > > > > > > > Passthrough is clearly one way.
> > > > > > > > > > > > Other ways may be possible.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > There are works ongoing to make
> > > > > > > > > > > > > > > > > > > > > vPASID work for the guest like
> > > > > > > > > > > > > > > vSVA.
> > > > > > > > > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Great, you find another limitation
> > > > > > > > > > > > > > > > > > > of "passthrough" by
> > > > > > > > > yourself.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > No. it is not the limitation it is
> > > > > > > > > > > > > > > > > > just the way it does not need complex
> > > > > > > > > > > > > > > > > > SVA to
> > > > > > > > > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > How can you limit the user in the guest
> > > > > > > > > > > > > > > > > to not use
> > > vSVA?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > He he, I am not limiting, again
> > > > > > > > > > > > > > > > misunderstanding or wrong
> > > > > > > > > attribution.
> > > > > > > > > > > > > > > > I explained that hypervisor for
> > > > > > > > > > > > > > > > passthrough does not need
> > > > > SVA.
> > > > > > > > > > > > > > > > Guest can do anything it wants from the
> > > > > > > > > > > > > > > > guest OS with the member
> > > > > > > > > > > > > device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Ok, so the point stills, see above.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I don’t think so. The guest owns its PASID
> > > > > > > > > > > > > > space
> > > > > > > > > > > > >
> > > > > > > > > > > > > Again, vPASID to PASID can't be done hardware
> > > > > > > > > > > > > unless I miss some recent features of IOMMUs.
> > > > > > > > > > > > >
> > > > > > > > > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > > > > > > > > >
> > > > > > > > > > > At least for the current version of major IOMMU
> > > > > > > > > > > vendors, such translation (aka PASID remapping) is
> > > > > > > > > > > not implemented in the hardware so it needs to be trapped
> first.
> > > > > > > > > > >
> > > > > > > > > > Right. So it is really far in future, atleast few years away.
> > > > > > > > > >
> > > > > > > > > > > > It is still an early space for virtio.
> > > > > > > > > > > >
> > > > > > > > > > > > > > and directly communicates like any other device
> attribute.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Each passthrough device has PASID
> > > > > > > > > > > > > > > > > > > > from its own space fully managed
> > > > > > > > > > > > > > > > > > > > by the
> > > > > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > > > > > Some cpu required vPASID and SIOV
> > > > > > > > > > > > > > > > > > > > is not going this way
> > > > > > > > > > > anmore.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Then how to migrate? Invent a full
> > > > > > > > > > > > > > > > > > > set of something else through
> > > > > > > > > > > > > > > > > > > another giant series like this to
> > > > > > > > > > > > > > > > > > > migrate to the SIOV
> > > > > > > > > > > thing?
> > > > > > > > > > > > > > > > > > > That's a mess for
> > > > > > > > > > > > > > > > > sure.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > SIOV will for sure reuse most or all
> > > > > > > > > > > > > > > > > > parts of this work, almost entirely
> > > > > > > > > > > > > as_is.
> > > > > > > > > > > > > > > > > > vPASID is cpu/platform specific things
> > > > > > > > > > > > > > > > > > not part of the SIOV
> > > > > > > > > devices.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > If at all it is done, it will
> > > > > > > > > > > > > > > > > > > > > > be done from the guest by the
> > > > > > > > > > > > > > > > > > > > > > driver using virtio
> > > > > > > > > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Then you need to trap. Such
> > > > > > > > > > > > > > > > > > > > > things couldn't be passed
> > > > > > > > > > > > > > > > > > > > > through to guests
> > > > > > > > > > > > > > > > > > > directly.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Only PASID capability is trapped.
> > > > > > > > > > > > > > > > > > > > PASID allocation and usage is
> > > > > > > > > > > > > > > > > > > > directly from
> > > > > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > How can you achieve this? Assigning
> > > > > > > > > > > > > > > > > > > a PAISD to a device is completely
> > > > > > > > > > > > > > > > > > > device(virtio) specific. How can you
> > > > > > > > > > > > > > > > > > > use a general layer without the
> > > > > > > > > > > > > > > > > > > knowledge of virtio to trap
> > > > > that?
> > > > > > > > > > > > > > > > > > When one wants to map vPASID to pPASID
> > > > > > > > > > > > > > > > > > a platform needs to be
> > > > > > > > > > > > > > > involved.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'm not talking about how to map vPASID
> > > > > > > > > > > > > > > > > to pPASID, it's out of the scope of
> > > > > > > > > > > > > > > > > virtio. I'm talking about assigning a
> > > > > > > > > > > > > > > > > vPASID to a specific virtqueue or other
> > > > > > > > > > > > > > > > > virtio function in the
> > > > > > > > > guest.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > That can be done in the guest. The key is
> > > > > > > > > > > > > > > > guest wont know that it is dealing
> > > > > > > > > > > > > > > with vPASID.
> > > > > > > > > > > > > > > > It will follow the same principle from
> > > > > > > > > > > > > > > > your paper of equivalency, where virtio
> > > > > > > > > > > > > > > software layer will assign PASID to VQ and
> > > > > > > > > > > > > > > communicate to
> > > > > > > device.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It's not, as you mention that only MSI-X is
> > > > > > > > > > > > > > > trapped, I give you another
> > > > > > > > > > > one.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > PASID access from the guest to be done fully
> > > > > > > > > > > > > > by the guest
> > > > > IOMMU.
> > > > > > > > > > > > > > Not by virtio devices.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > You need a virtio specific queue or
> > > > > > > > > > > > > > > > > capability to assign a PASID to a
> > > > > > > > > > > > > > > > > specific virtqueue, and that can't be
> > > > > > > > > > > > > > > > > done without trapping and without virito
> > > > > > > > > > > > > > > > > specific
> > > > > > > knowledge.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I disagree. PASID assignment to a virqueue
> > > > > > > > > > > > > > > > in future from guest virtio driver to
> > > > > > > > > > > > > > > device is uniform method.
> > > > > > > > > > > > > > > > Whether its PF assigning PASID to VQ of
> > > > > > > > > > > > > > > > self, Or VF driver in the guest assigning PASID to VQ.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > All same.
> > > > > > > > > > > > > > > > Only IOMMU layer hypercalls will know how
> > > > > > > > > > > > > > > > to deal with PASID assignment at
> > > > > > > > > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > And this is way beyond our device migration
> discussion.
> > > > > > > > > > > > > > > > By any means, if you were implying that
> > > > > > > > > > > > > > > > somehow vq to PASID assignment
> > > > > > > > > > > > > > > _may_ need trap+emulation, hence whole
> > > > > > > > > > > > > > > device migration to depend on some
> > > > > > > > > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > See above.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yeah, I disagree to such implying.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > PASID equivalent in mlx5 world is
> > > > > > > > > > > > > > > > ODP_MR+PD isolating the guest process and
> > > > > > > > > > > > > > > all of that just works on efficiency and
> > > > > > > > > > > > > > > equivalence principle already for a decade
> > > > > > > > > > > > > > > now without any
> > > > > trap+emulation.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > When virtio passthrough device is in
> > > > > > > > > > > > > > > > > > guest, it has all its PASID
> > > > > > > > > > > > > accessible.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > All these is large deviation from
> > > > > > > > > > > > > > > > > > current discussion of this series, so
> > > > > > > > > > > > > > > > > > I will keep
> > > > > > > > > > > > > > > > > it short.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Regardless it is not relevant to
> > > > > > > > > > > > > > > > > > > > passthrough mode as PASID is yet
> > > > > > > > > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > > > > > > And for some cpu if it is trapped,
> > > > > > > > > > > > > > > > > > > > it is generic layer, that does not
> > > > > > > > > > > > > > > > > > > > require virtio
> > > > > > > > > > > > > > > > > > > involvement.
> > > > > > > > > > > > > > > > > > > > So virtio interface asking to trap
> > > > > > > > > > > > > > > > > > > > something because generic facility
> > > > > > > > > > > > > > > > > > > > has done
> > > > > > > > > > > > > > > > > > > in not the approach.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > This misses the point of PASID. How
> > > > > > > > > > > > > > > > > > > to use PASID is totally device
> > > > > > > > > > > > > > > specific.
> > > > > > > > > > > > > > > > > > Sure, and how to virtualize
> > > > > > > > > > > > > > > > > > vPASID/pPASID is platform specific as
> > > > > > > > > > > > > > > > > > single PASID
> > > > > > > > > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > See above, I think we're talking about different
> things.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Capabilities of #2 is generic
> > > > > > > > > > > > > > > > > > > > > > across all pci devices, so it
> > > > > > > > > > > > > > > > > > > > > > will be handled by the
> > > > > > > > > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > > > > > > > > ATS/PRI cap is also generic
> > > > > > > > > > > > > > > > > > > > > > manner handled by the HV and
> > > > > > > > > > > > > > > > > > > > > > PCI
> > > > > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > No, ATS/PRI requires the
> > > > > > > > > > > > > > > > > > > > > cooperation from the
> > > > > > > vIOMMU.
> > > > > > > > > > > > > > > > > > > > > You can simply do ATS/PRI
> > > > > > > > > > > > > > > > > > > > > passthrough but with an emulated
> > > > > > > > > > > > > vIOMMU.
> > > > > > > > > > > > > > > > > > > > And that is not the reason for
> > > > > > > > > > > > > > > > > > > > virtio device to build
> > > > > > > > > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > vIOMMU is emulated by hypervisor
> > > > > > > > > > > > > > > > > > > with a PRI queue,
> > > > > > > > > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Shouldn't it arrive at platform IOMMU first?
> > > > > > > > > > > > > > > > > The path should be PRI
> > > > > > > > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor ->
> > > > > > > > > > > > > > > > > -> vIOMMU PRI
> > > > > > > > > > > > > > > > > -> -> guest
> > > > > > > > > > > IOMMU.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Above sequence seems write.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > And things will be more complicated when
> > > > > > > > > > > > > > > > > (v)PASID is
> > > > > used.
> > > > > > > > > > > > > > > > > So you can't simply let PRI go directly
> > > > > > > > > > > > > > > > > to the guest with the current
> > > > > > > > > > > > > architecture.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In current architecture of the pci VF, PRI
> > > > > > > > > > > > > > > > does not go directly to the
> > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > (and that is not reason to trap and emulate other
> things).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and
> > > > > > > > > > > > > > > we will probably trap other things in the
> > > > > > > > > > > > > > > future like PASID
> > > > > assignment.
> > > > > > > > > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's not about the capability, it's about the
> > > > > > > > > > > > > whole process of PRI request handling. We've
> > > > > > > > > > > > > agreed that the PRI request needs to be trapped
> > > > > > > > > > > > > by the hypervisor and then delivered to the
> > > > > > > vIOMMU.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > > Trap+emulation done in generic manner without
> > > > > > > > > > > > > > Trap+involving virtio or other
> > > > > > > > > > > > > device types.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > how can you pass through a hardware
> > > > > > > > > > > > > > > > > > > PRI request to a guest directly
> > > > > > > > > > > > > > > > > > > without trapping it
> > > > > > > > > > > > > > > then?
> > > > > > > > > > > > > > > > > > > What's more, PCIE allows the PRI to
> > > > > > > > > > > > > > > > > > > be done in a vendor
> > > > > > > > > > > > > > > > > > > (virtio) specific way, so you want to break this
> rule?
> > > > > > > > > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > > > > > > > > for virtio?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > > > > > > > > Do you have a reference to the ECN
> > > > > > > > > > > > > > > > > > that enables vendor specific way of
> > > > > > > > > > > > > > > > > > PRI? I
> > > > > > > > > > > > > > > > > would like to read it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I mean it doesn't forbid us to build a
> > > > > > > > > > > > > > > > > virtio specific interface for I/O page
> > > > > > > > > > > > > > > > > fault report and
> > > recovery.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > So PRI of PCI does not allow. It is ODP
> > > > > > > > > > > > > > > > kind of technique you meant
> > > > > > > > > > > above.
> > > > > > > > > > > > > > > > Yes one can build.
> > > > > > > > > > > > > > > > Ok. unrelated to device migration, so I
> > > > > > > > > > > > > > > > will park this good discussion for
> > > > > > > > > > > > > later.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's fine.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This will be very good to eliminate
> > > > > > > > > > > > > > > > > > IOMMU PRI
> > > > > limitations.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Probably.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > PRI will directly go to the guest
> > > > > > > > > > > > > > > > > > driver, and guest would interact with
> > > > > > > > > > > > > > > > > > IOMMU
> > > > > > > > > > > > > > > > > to service the paging request through IOMMU
> APIs.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > > > > > > > > But again these PCI-SIG extensions of
> > > > > > > > > > > > > > > > PASID are not related to device
> > > > > > > > > > > > > > > migration, so I am differing it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > For PRI in vendor specific way needs a
> > > > > > > > > > > > > > > > > > separate discussion. It is not related
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > live migration.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > PRI itself is not related. But the point
> > > > > > > > > > > > > > > > > is, you can't simply pass through ATS/PRI now.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Ah ok. the whole 4K PCI config space where
> > > > > > > > > > > > > > > > ATS/PRI capabilities are located
> > > > > > > > > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > > > > > > > > So?
> > > > > > > > > > > > > > > > So do we start emulating virito interfaces
> > > > > > > > > > > > > > > > too for
> > > > > passthrough?
> > > > > > > > > > > > > > > > No.
> > > > > > > > > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > > > > > > > > Sure why not?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Then let's not limit your proposal to be
> > > > > > > > > > > > > > > used by
> > > "passthrough"
> > > > > > > only?
> > > > > > > > > > > > > > One can possibly build some variant of the
> > > > > > > > > > > > > > existing virtio member device
> > > > > > > > > > > > > using same owner and member scheme.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's not about the member/owner, it's about e.g
> > > > > > > > > > > > > whether the hypervisor can trap and emulate.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've pointed out that what you invent here is
> > > > > > > > > > > > > actually a partial new transport, for example, a
> > > > > > > > > > > > > hypervisor can trap and use things like device
> > > > > > > > > > > > > context in PF to bypass the registers in VF.
> > > > > > > > > > > > > This is the idea of
> > > > > > > > > > > transport commands/q.
> > > > > > > > > > > > >
> > > > > > > > > > > > I will not mix transport commands which are mainly
> > > > > > > > > > > > useful for actual device
> > > > > > > > > > > operation for SIOV only for backward compatibility
> > > > > > > > > > > that too
> > > > > optionally.
> > > > > > > > > > > > One may still choose to have virtio common and
> > > > > > > > > > > > device config in MMIO
> > > > > > > > > > > ofcourse at lower scale.
> > > > > > > > > > > >
> > > > > > > > > > > > Anyway, mixing migration context with actual SIOV
> > > > > > > > > > > > specific thing is not correct
> > > > > > > > > > > as device context is read/write incremental values.
> > > > > > > > > > >
> > > > > > > > > > > SIOV is transport level stuff, the transport
> > > > > > > > > > > virtqueue is designed in a way that is general enough to cover
> it.
> > > > > > > > > > > Let's not shift
> > > > > > > concepts.
> > > > > > > > > > >
> > > > > > > > > > Such TVQ is only for backward compatible vPCI composition.
> > > > > > > > > > For ground up work such TVQ must not be done through
> > > > > > > > > > the owner
> > > > > > > device.
> > > > > > > > >
> > > > > > > > > That's the idea actually.
> > > > > > > > >
> > > > > > > > > > Each SIOV device to have its own channel to
> > > > > > > > > > communicate directly to the
> > > > > > > > > device.
> > > > > > > > > >
> > > > > > > > > > > One thing that you ignore is that, hypervisor can
> > > > > > > > > > > use what you invented as a transport for VF, no?
> > > > > > > > > > >
> > > > > > > > > > No. by design,
> > > > > > > > >
> > > > > > > > > It works like hypervisor traps the virito config and
> > > > > > > > > forwards it to admin virtqueue and starts the device via
> > > > > > > > > device
> > > context.
> > > > > > > > It needs more granular support than the management
> > > > > > > > framework of device
> > > > > > > context.
> > > > > > >
> > > > > > > It doesn't otherwise it is a design defect as you can't
> > > > > > > recover the device context in the destination.
> > > > > > >
> > > > > > > Let me give you an example:
> > > > > > >
> > > > > > > 1) in the case of live migration, dst receive migration byte
> > > > > > > flows and convert them into device context
> > > > > > > 2) in the case of transporting, hypervisor traps virtio
> > > > > > > config and convert them into the device context
> > > > > > >
> > > > > > > I don't see anything different in this case. Or can you give
> > > > > > > me an
> > > example?
> > > > > > In #1 dst received byte flows one or multiple times.
> > > > >
> > > > > How can this be different?
> > > > >
> > > > > Transport can also receive initial state incrementally.
> > > > >
> > > > Transport is just simple register RW interface without any caching
> > > > layer in-
> > > between.
> > > > More below.
> > > > > > And byte flows can be large.
> > > > >
> > > > > So when doing transport, it is not that large, that's it. If it
> > > > > can work with large byte flow, why can't it work for small?
> > > > Write context can as used (abused) for different purpose.
> > > > Read cannot because it is meant to be incremental.
> > >
> > > Well hypervisor can just cache what it reads since the last, what's wrong
> with it?
> > >
> > But hypervisor does not know what changed, so it does do guess work to
> find out what to query.
> >
> > > > One can invent a cheap command to read it.
> > >
> > > For sure, but it's not the context here.
> > >
> > It is.
> > > >
> > > >
> > > > >
> > > > > > So it does not always contain everything. It only contains the
> > > > > > new delta of the
> > > > > device context.
> > > > >
> > > > > Isn't it just how current PCI transport does?
> > > > >
> > > > No. PCI transport has explicit API between device and driver to
> > > > read or write
> > > at specific offset and value.
> > >
> > > The point is that they are functional equivalents.
> > >
> > I disagree.
> > There are two different functionalities.
> >
> > Functionality_1: explicit ask for read or write
> > Functionality_2: read what has changed
> 
> This needs to be justified. I won't repeat the questions again here.
> 
As explained the use case in theory of operation already.

> >
> > Should one merge 1 and 2 and complicate the command?
> > I prefer not to.
> 
> Again there're functional duplications. E.g your command duplicates
> common_cfg for sure.
Nop. it is not.
Common cfg is accessed directly by guest member driver.

> 
> >
> > Now having two different commands help for debugging to differentiate
> > between mgmt. commands and guest initiated commands. :)
> >
> > > >
> > > > > Guest configure the following one by one:
> > > > >
> > > > > 1) vq size
> > > > > 2) vq addresses
> > > > > 3) MSI-X
> > > > >
> > > > > etc?
> > > > >
> > > > I think you interpreted "incremental" differently than I described.
> > > > In the device context read, the incremental is:
> > > >
> > > > If the hypervisor driver has read the device context twice, the
> > > > second read
> > > won't return any new data if nothing changed.
> > >
> > > See above.
> > >
> > Yeah, two separate commands needed.
> >
> > > > For example, if RSS configuration didn’t change between two reads,
> > > > the
> > > second read wont return the TLV for RSS Context.
> > > >
> > > > While for transport the need is, when guest asked, one device must
> > > > read it
> > > regardless of the change.
> > > >
> > > > So notion of incremental is not by address, but by the value.
> > > >
> > > > > > For example, VQ configuration is exchanged once between src and
> dst.
> > > > > > But VQ avail and used index may be updated multiple times.
> > > > >
> > > > > If it can work with multiple times of updating, why can't it
> > > > > work if we just update it once?
> > > > Functionally it can work.
> > >
> > > I think you answer yourself.
> > >
> > Yes, I don’t like abuse of command.
> 
> How did you define abuse or can spec ever need to define that?
I don’t have any different definition than dictionary definition for abuse. :)

> 
> >
> > > > Performance wise, one does not want to update multiple times,
> > > > unless there
> > > is a change.
> > > >
> > > > Read as explained above is not meant to return same content again.
> > > >
> > > > >
> > > > > > So here hypervisor do not want to read any specific set of
> > > > > > fields and
> > > > > hypervisor is not parsing them either.
> > > > > > It is just a byte stream for it.
> > > > >
> > > > > Firstly, spec must define the device context format, so
> > > > > hypervisor can understand which byte is what otherwise you can't
> > > > > maintain migration compatibility.
> > > > Device context is defined already in the latest version.
> > > >
> > > > > Secondly, you can't mandate how the hypervisor is written.
> > > > >
> > > > > >
> > > > > > As opposed to that, in case of transport, the guest explicitly
> > > > > > asks to read or
> > > > > write specific bytes.
> > > > > > Therefore, it is not incremental.
> > > > >
> > > > > I'm totally lost. Which part of the transport is not incremental?
> > > > >
> > > > > >
> > > > > > Additionally, if hypervisor has put the trap on virtio config,
> > > > > > and because the memory device already has the interface for
> > > > > > virtio config,
> > > > > >
> > > > > > Hypervisor can directly write/read from the virtual config to
> > > > > > the member's
> > > > > config space, without going through the device context, right?
> > > > >
> > > > > If it can do it or it can choose to not. I don't see how it is
> > > > > related to the discussion here.
> > > > >
> > > > It is. I don’t see a point of hypervisor not using the native
> > > > interface provided
> > > by the member device.
> > >
> > > It really depends on the case, and I see how it duplicates with the
> > > functionality that is provided by both:
> > >
> > > 1) The existing PCI transport
> > >
> > > or
> > >
> > > 2) The transport virtqueue
> > >
> > I would like to conclude that we disagree in our approaches.
> > PCI transport is for member device to directly communicate from guest
> driver to the device.
> > This is uniform across PF, VFs, SIOV.
> 
> For "PCi transport" did you mean the one defined in spec? If yes, how can it
> work with SIOV with what you're saying here (a direct communication
> channel)?
> 
SIOV device may have same MMIO as VF.

> >
> > Admin commands are transport independent and their task is device
> migration.
> > One is not replacing the other.
> >
> > Transport virtqueue will never transport driver notifications, hence it does
> not qualify at "transport".
> 
> Another double standard.
I disagree. You coined the term transport vq, so stand behind it to transport everything.

> 
> MMIO will never transport device notification, hence it does not qualify as
> "transport"?
> 
How does interrupts work?
Seems like missing basic functionality in transport.

> >
> > For the vdpa case, there is no need for extra admin commands as the
> mediation layer can directly use the interface available from the member
> device itself.
> >
> > You continue to want to overload admin commands for dual purpose, does
> not make sense to me.
> >
> > > >
> > > >  > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > it is not good idea to overload management commands
> > > > > > > > > > with actual run time
> > > > > > > > > guest commands.
> > > > > > > > > > The device context read writes are largely for incremental
> updates.
> > > > > > > > >
> > > > > > > > > It doesn't matter if it is incremental or not but
> > > > > > > > >
> > > > > > > > It does because you want different functionality only for
> > > > > > > > purpose of backward
> > > > > > > compatibility.
> > > > > > > > That also if the device does not offer them as portion of MMIO
> BAR.
> > > > > > >
> > > > > > > I don't see how it is related to the "incremental part".
> > > > > > >
> > > > > > > >
> > > > > > > > > 1) the function is there
> > > > > > > > > 2) hypervisor can use that function if they want and
> > > > > > > > > virtio
> > > > > > > > > (spec) can't forbid that
> > > > > > > > >
> > > > > > > > It is not about forbidding or supporting.
> > > > > > > > Its about what functionality to use for management plane
> > > > > > > > and guest
> > > > > plane.
> > > > > > > > Both have different needs.
> > > > > > >
> > > > > > > People can have different views, there's nothing we can
> > > > > > > prevent a hypervisor from using it as a transport as far as I can see.
> > > > > > For device context write command, it can be used (or probably
> > > > > > abused) to do
> > > > > write but I fail to see why to use it.
> > > > >
> > > > > The function is there, you can't prevent people from doing that.
> > > > >
> > > > One can always mess up itself. :)
> > > > It is not prevented. It is just not right way to use the interface.
> > > >
> > > > > > Because member device already has the interface to do config
> > > > > > read/write and
> > > > > it is accessible to the hypervisor.
> > > > >
> > > > > Well, it looks self-contradictory again. Are you saying another
> > > > > set of commands that is similar to device context is needed for
> > > > > non-PCI
> > > transport?
> > > > >
> > > > All these non pci transport discussion is just meaning less.
> > > > Let MMIO bring the concept of member device at that point
> > > > something make
> > > sense to discuss.
> > >
> > > It's not necessarily MMIO. For example the SIOV, which I don't think
> > > can use the existing PCI transport.
> > >
> > > > PCI SIOV is also the PCI device at the end.
> > >
> > > We don't want to end up with two sets of commands to save/load SRIOV
> > > and SIOV at least.
> > >
> > This proposal ensures that SRIOV and SIOV devices are treated equally.
> 
> How? Did you mean your proposal can work for SIOV? What's the transport
> then?
Yes. All majority of the device contexts should work for SIOV device as_is.
Member id would be different.
Some device context TLVs may be new as SIOV may have some simplifications as it may not have the giant register space like current one.

> 
> > How brand new non-compatible SIOV device to transport this, is outside of
> the scope of this work.
> 
> You invented one that can be used for doing this. If you disagree, how can we
> know your proposal can work for SIOV without a transport then?

I don’t understand your comment.

All I am saying is, most pieces of device contexts are reusable across VFs and SIOVs.
When SIOV is defined, we can relook at what may need to be added.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-21 16:25                                                                                                 ` Parav Pandit
@ 2023-11-22  4:13                                                                                                   ` Jason Wang
  2023-11-22  7:48                                                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-22  4:13 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 12:25 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 9:53 AM
> > On Thu, Nov 16, 2023 at 2:34 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Thursday, November 16, 2023 11:53 AM
> > > >
> > > > On Thu, Nov 16, 2023 at 05:28:19AM +0000, Parav Pandit wrote:
> > > > > You continue to want to overload admin commands for dual purpose,
> > > > > does
> > > > not make sense to me.
> > > >
> > > > dual -> as a transport and for migration? why can't they be used for
> > > > this? I was really hoping to cover these two cases when I proposed them.
> > > For following reasons.
> > >
> > > 1. migration needs incremental reads of only changed context between
> > > two reads
> >
> > This is wrong. We need to invent general facilities. I've pointed out sufficient
> > issues, and what's more delta doesn't work for the following cases:
> >
> I disagree to your above comments. Please read below.
> It works.
>
> > 1) migration but fail the another try for migration
>
> When hypervisor wants to retry the migration, it invokes the DISCARD command present in v4.

Well, I've explained in your first version, your design introduce
unnecessary complexities:

1) there's no requirement for live migrate the device state, that is,
one or two rounds at most are sufficient
2) there's no much data that needs to be migrated

So it doesn't behave like RAM, having things like DISCARD might
complicate the implementation of both software and hardware.

> And starts reading the device context again.
>
> > 2) save vm state twice
> >
> This can be saved twice when needed.
>
> > It request a lot of tricks in hypervisor to do that (e.g cache the last state?).
> Not really.
> Hypervisor can always discard what it read and re-read it again as fresh device context.

Why not leave the policy in the software?

>
> >
> > >
> > > 2. migration writes covers large part of the configurations not just virtio
> > common config and device config.
> >
> > You invent a duplication of common_cfg structure no?
> >
> Nop.
> Common configuration is written using a MMIO, byte/word etc boundary by the guest directly in guest owned area.

We are discussing the function no?

>
> > What's wrong if we just allow them to be R/W over adminq/cmmands?
> >
> As explained before,
> Each guest has its own dedicated non mediated interface as defined in virtio spec to not involve hypervisor.

So what's wrong with inventing per VF queue to do that? For example
transport virtqueue.

> This is uniform for PF and SR-IOV VFs.
> And it will be uniform for backward compatible SIOV tomorrow, one the performance numbers for SIOV are available.
> For non-backward compatible SIOV, there is better way to not have such a large config space anyway.
>
> > > Such as configuration occurred through the CVQ. All of these is not needed
> > when done from guest directly via member's own CVQ.
> >
> > That's the device type specific state which requires new commands forsure. I
> > don't see any connection. The SIOV device needs to be migrated as well.
> >
> And they will use all majority of the device context as presented here.
>
> > >
> > > For backward compatible SIOV transport, one may need them to transport
> > without above two properties.
> >
> > Why, just mediate between virtual PCI and adminq.
> >
> I don’t understand this.

I mean, in order to run "legacy" guest that can only recognize
virtio-pci, a hypervisor can mediate between it and transport.

>
> > >
> > > 3. None of this transport is needed for PFs, VFs and non-backward
> > compatible SIOVs.
> > > Each device to have its own transport that is not intercepted by the
> > hypervisor and follow the equivalency principle uniformly for all 3 device types.
> >
> > You can have per VF transport q, what's wrong with that?
> As explained in the doc and in multiple emails, it is inefficient. CVQ is just enough.

Let me repeat again, transport q is not intended to replace CVQ. Where
did you see or get such a wrong conclusion in the series of transport
q?

One of its goals is to have a transport where the register doesn't scale.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-21 16:27                                                                                                               ` Parav Pandit
@ 2023-11-22  4:16                                                                                                                 ` Jason Wang
  2023-11-22  4:39                                                                                                                   ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-22  4:16 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 12:27 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 10:15 AM
> >
> > On Fri, Nov 17, 2023 at 11:09 PM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > >
> > > On Fri, Nov 17, 2023 at 02:51:04PM +0000, Parav Pandit wrote:
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 7:24 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 12:46:21PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 6:00 PM
> > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > > > > > >
> > > > > > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Additionally, if hypervisor has put the trap
> > > > > > > > > > > > > > > > on virtio config, and because the memory
> > > > > > > > > > > > > > > > device already has the interface for virtio
> > > > > > > > > > > > > > > > config,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hypervisor can directly write/read from the
> > > > > > > > > > > > > > > > virtual config to the member's
> > > > > > > > > > > > > > > config space, without going through the device context,
> > right?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If it can do it or it can choose to not. I
> > > > > > > > > > > > > > > don't see how it is related to the discussion here.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > It is. I don’t see a point of hypervisor not
> > > > > > > > > > > > > > using the native interface provided
> > > > > > > > > > > > > by the member device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So for example, it seems reasonable to a member
> > > > > > > > > > > > > supporting both existing pci register interface
> > > > > > > > > > > > > for compatibility and the future DMA based one for
> > > > > > > > > > > > > scale. In such a case, it seems possible that DMA
> > > > > > > > > > > > > will expose more features than pci. And then a
> > > > > > > > > > > > > hypervisor might decide to use
> > > > > > > > > > > that in preference to pci registers.
> > > > > > > > > > > >
> > > > > > > > > > > > We don’t find it right to involve owner device for
> > > > > > > > > > > > mediating at current scale
> > > > > > > > > > >
> > > > > > > > > > > In this model, device will be its own owner. Should not be a
> > problem.
> > > > > > > > > > >
> > > > > > > > > > I didn’t understand above comment.
> > > > > > > > >
> > > > > > > > > We'd add a new group type "self". You can then send admin
> > > > > > > > > commands through VF itself not through PF.
> > > > > > > > >
> > > > > > > > How? The device is owned by the guest. FLR and device reset
> > > > > > > > cannot send the
> > > > > > > admin command reliably.
> > > > > > >
> > > > > > > It's of the "it hurts when I do this - don't do this then" category.
> > > > > > >
> > > > > > it is don’t do medication category, yes due all this weirdness
> > > > > > that has been
> > > > > asked.
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > > and to not break TDISP efforts in upcoming time by such
> > design.
> > > > > > > > > > >
> > > > > > > > > > > Look you either stop mentioning TDISP as motivation or
> > > > > > > > > > > actually try to address it. Safe migration with TDISP is really
> > hard.
> > > > > > > > > > But that is not an excuse to say that TDISP migration is
> > > > > > > > > > not present, hence
> > > > > > > > > involve the owner device for config space access.
> > > > > > > > > > This is another hurdle added that further blocks us away from
> > TDISP.
> > > > > > > > > > Hence, we don’t want to take the route of involving
> > > > > > > > > > owner device for any
> > > > > > > > > config access.
> > > > > > > > >
> > > > > > > > > This "blocks" is all just wild hunches. hypervisor
> > > > > > > > > controls some aspects of TDISP devices for sure - maybe we
> > > > > > > > > actually should use pci config space as that is generally hypervisor
> > controlled.
> > > > > > > > Even bad to do hypercalls.
> > > > > > > > I showed you last time the role of the PCI config space
> > > > > > > > snippet from the
> > > > > spec.
> > > > > > >
> > > > > > > Yes I remember. This is just an example though. My point is
> > > > > > > maybe it is solvable maybe it is not.
> > > > > > >
> > > > > > > > Do you see we are repeating the discussion again?
> > > > > > >
> > > > > > > One of the reasons is that people bring up irrelevances. TDISP
> > > > > > > is important but has to be addressed or deferred not vaguely referred
> > to.
> > > > > >
> > > > > > So lets continue to follow the current TDISP direction of not
> > > > > > involving
> > > > > hypervisor for virtio common and device config.
> > > >
> > > > If you disagree to it, please speak now, so that we don’t debate on this
> > again in next 3 days.
> > > > Because this is the fundamental design considerations it relied on.
> > > > There is no point going forward if you want to disagree to it.
> > > > Other variants are fine, but other variants cannot be the only choice.
> > > >
> > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > > > > > > owner can control queue state at any time making
> > > > > > > > > > > device modify memory in any way it wants.
> > > > > > > > > > >
> > > > > > > > > > When TDISP migration is needed, the admin device can be
> > > > > > > > > > another TVM
> > > > > > > > > outside the HV scope.
> > > > > > > > > > Or an alternative would have device context encrypted
> > > > > > > > > > not visible to HV at
> > > > > > > all.
> > > > > > > > >
> > > > > > > > > Maybe. Fact remains your patches do conflict with TDISP
> > > > > > > > > and you seem to be fine with it because you have a hunch you can
> > fix it.
> > > > > > > > > But we can't do development based on your hunches.
> > > > > > > > >
> > > > > > > > We have different view.
> > > > > > > > My patches do not conflict with TDISP because TDISP has
> > > > > > > > clear definition of
> > > > > > > not involving hypervisor for transport.
> > > > > > > > And that part is still preserved.
> > > > > > > > Delegating the migration to another TDISP or encrypting is
> > > > > > > > yet to be
> > > > > defined.
> > > > > > > > And current patches will align to both the approaches in future.
> > > > > > > >
> > > > > > > > So you need to re-evaluate your judgment.
> > > > > > >
> > > > > > > If you like they do not "conflict".  But if used with TDISP
> > > > > > > they just make it insecure and thus completely worthless.  If
> > > > > > > hypervisor can change ring state to make device poke at random
> > > > > > > guest memory then it is game over and all the effort spent was
> > security theater.
> > > > > > Not really, I proposed two options.
> > > > > > 1. delegate the task of LM to the TVM. (proposed by two cpu vendors).
> > > > > > In this case all the infra we build here, just works fine.
> > > > >
> > > > > I think modification will be needed: currently commands are sent
> > > > > through the PF, and that is under hypervisor control.
> > > > > You should not assign PF to TVM.
> >
> > That's the point. And that's why it keeps people confused to believe the
> > current PF/adminq can work in the TDISP.
> >
> There is no confusion.

No, when LingShan points out the conflict, you just told us it will be
addressed in the future.

And after Michael pointed it out again, you agree than adminq can not
be part of PF in this context.

And you miss the fact that admin virtqueue today can't be used to
manage the owner.

> The admin queue interface ensures first step that TDISP interface is dedicated to guest as today.
> There is no bifurcation added on the VF that needs extra mediation.
>
> > > > Yes, an admin virtio function will be there which will do the admin
> > commands listed.
> > >
> > > So it can't be PF, so at least we need a new group type.
> > > I am inclined to then say, operate it through VF itself.
> >
> > So it exactly matches the idea of transport virtqueue (a per VF/SF one).
> >
> There is no need for transport virtqueue for VF as VF device has same uniform principle as PF.

I don't understand here. I have explained that you have invented a
function duplication of transport virtqueue.

> If you want transport vq, please have it on the PF too.

Nothing prevents this, actually transport virtqueue start from this.

> And that also is not needed because there is already CVQ.

I don't see why you keep mentioning CVQ. I don't see anyone that says
transport virtqueue is going to replace CVQ.

>
> > But it still requires a PCI part to bootstrap.
> >
> > >
> > >
> > > > >
> > > > > > It also does not require any hypervisor mediation for control plane.
> > > > > >
> > > > > > 2. Encrypt the owner device workload to be not seen by
> > > > > > hypervisor
> > > > > >
> > > > > > Both methods does not affect the current direction.
> > > > > >
> > > > > > But if we force trap+emulation, it is 100% broken for TDISP.
> > > > > > And I would not promote that.
> > > > > >
> > > > > > > But you know this, don't you? This is why you mentioned encrypting
> > device.
> > > > > > > Maybe that works. It just does not work *as is*.
> > > > > > It works as_is. But current infrastructure does not block the future
> > work.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > Such encryption is not possible, with the trap+emulation
> > > > > > > > > > method, where HV
> > > > > > > > > will have to decrypt the data coming over MMIO writes.
> > > > > > > > >
> > > > > > > > > I don't how what trap+emulation has to do with it. Do you
> > > > > > > > > refer to the shadow vq thing?
> > > > > > > >
> > > > > > > > The method proposed here does not hinder any TDISP direction.
> > > > > > >
> > > > > > > direction? No, why would it. we can always add more commands
> > > > > > > that are safe for TDISP. commands you propose here are unsafe for
> > TDISP.
> > > > > > >
> > > > > > > > Without my proposal, do you have a method that does not
> > > > > > > > involve hypervisor
> > > > > > > intervention for virtio common and device config space, cvq
> > > > > > > and shadow
> > > > > vq?
> > > > > > > > If so, I would like to hear that as well because that will align with
> > TDISP.
> > > > > > >
> > > > > > > I really did not give it much thought.  I suspect for TDISP it
> > > > > > > just might be cleaner to have guest agent migrate device.
> > > > > > > Certainly removes all
> > > > > the messy questions.
> > > > > > > That, to me impliest there needs to be a way to send migration
> > > > > > > commands through VF itself. Does this "involve hypervisor
> > > > > > > intervention"? No one should care I think.
> > > > > > Too far of the future to envision. May be yes. When such
> > > > > > platform is built, for sure whoever migrates need migrate its device side
> > too.
> > > > > > Some knowledge of migration driver is needed.
> > > > >
> > > > > So TDISP migration is so far in the future you do not need to bother
> > about it.
> > > > > Fine. Then don't bring it up pls.
> > > > >
> > > > As long as we are aligned to the requirement that a virtio member device is
> > mapped to the guest VM without mediating the virtio interface, I am good.
> > > > Again, other variants are fine, but above listed mapped variant is the
> > minimum variant needed.
> > >
> > > I think it's worth supporting this. I wouldn't call this minimum there
> > > are other approaches.  And I am not so sure it's worth trying to
> > > support this in all kind of systems such as IOMMU without dirty bit
> > > support. If some old systems will need mediation, this is kind of like
> > > legacy interface. Not a big deal.
> > >
> >
> > +1
>
> There are users with the recent cpus that may not have the IOMMU dirty page tracking support.
> So I don’t fully agree.

There are setups that don't have SR-IOV or even PCI.

Let's have a unified standard.

Thanks

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-21 16:30                                                                                                     ` Parav Pandit
@ 2023-11-22  4:18                                                                                                       ` Jason Wang
  2023-11-22  4:26                                                                                                         ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-22  4:18 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 12:30 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 10:55 AM
> >
> > On Fri, Nov 17, 2023 at 8:03 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 5:13 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > >
> > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit wrote:
> > > > > > > > > > >
> > > > > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > > > > config, and because the memory device already has the
> > > > > > > > > > > interface for virtio config,
> > > > > > > > > > >
> > > > > > > > > > > Hypervisor can directly write/read from the virtual
> > > > > > > > > > > config to the member's
> > > > > > > > > > config space, without going through the device context, right?
> > > > > > > > > >
> > > > > > > > > > If it can do it or it can choose to not. I don't see how
> > > > > > > > > > it is related to the discussion here.
> > > > > > > > > >
> > > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > > native interface provided
> > > > > > > > by the member device.
> > > > > > > >
> > > > > > > > So for example, it seems reasonable to a member supporting
> > > > > > > > both existing pci register interface for compatibility and
> > > > > > > > the future DMA based one for scale. In such a case, it seems
> > > > > > > > possible that DMA will expose more features than pci. And
> > > > > > > > then a hypervisor might decide to use
> > > > > > that in preference to pci registers.
> > > > > > >
> > > > > > > We don’t find it right to involve owner device for mediating
> > > > > > > at current scale
> > > > > >
> > > > > > In this model, device will be its own owner. Should not be a problem.
> > > > > >
> > > > > I didn’t understand above comment.
> > > >
> > > > We'd add a new group type "self". You can then send admin commands
> > > > through VF itself not through PF.
> > > >
> > > How? The device is owned by the guest. FLR and device reset cannot send
> > the admin command reliably.
> > >
> > > >
> > > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > > >
> > > > > > Look you either stop mentioning TDISP as motivation or actually
> > > > > > try to address it. Safe migration with TDISP is really hard.
> > > > > But that is not an excuse to say that TDISP migration is not
> > > > > present, hence
> > > > involve the owner device for config space access.
> > > > > This is another hurdle added that further blocks us away from TDISP.
> > > > > Hence, we don’t want to take the route of involving owner device
> > > > > for any
> > > > config access.
> > > >
> > > > This "blocks" is all just wild hunches. hypervisor controls some
> > > > aspects of TDISP devices for sure - maybe we actually should use pci
> > > > config space as that is generally hypervisor controlled.
> > > Even bad to do hypercalls.
> > > I showed you last time the role of the PCI config space snippet from the
> > spec.
> > > Do you see we are repeating the discussion again?
> > >
> > > >
> > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > owner can control queue state at any time making device modify
> > > > > > memory in any way it wants.
> > > > > >
> > > > > When TDISP migration is needed, the admin device can be another
> > > > > TVM
> > > > outside the HV scope.
> > > > > Or an alternative would have device context encrypted not visible to HV
> > at all.
> > > >
> > > > Maybe. Fact remains your patches do conflict with TDISP and you seem
> > > > to be fine with it because you have a hunch you can fix it. But we
> > > > can't do development based on your hunches.
> > > >
> > > We have different view.
> > > My patches do not conflict with TDISP because TDISP has clear definition of
> > not involving hypervisor for transport.
> > > And that part is still preserved.
> > > Delegating the migration to another TDISP or encrypting is yet to be defined.
> > > And current patches will align to both the approaches in future.
> > >
> > > So you need to re-evaluate your judgment.
> > >
> > > >
> > > > > Such encryption is not possible, with the trap+emulation method,
> > > > > where HV
> > > > will have to decrypt the data coming over MMIO writes.
> > > >
> > > > I don't how what trap+emulation has to do with it. Do you refer to
> > > > the shadow vq thing?
> > >
> > > The method proposed here does not hinder any TDISP direction.
> > >
> > > Without my proposal, do you have a method that does not involve
> > hypervisor intervention for virtio common and device config space, cvq and
> > shadow vq?
> > > If so, I would like to hear that as well because that will align with TDISP.
> >
> > So this is what you said:
> >
> > 1) TDISP would not do mediation
> > 2) registers doesn't scale
> >
> > This is exactly what transport virtqueue did. Isn't it?
> >
> No.
> CVQ is doing both of them currently uniformly across PF, VF.
> Future SIOV will be able to this also.

Please explain how CVQ is related. Or how transport virtqueue blocks
CVQ in any sense. Or any transport virtqueue commands do that.

Then let's continue the discussion here.

>
> > >
> > > > I am guessing modern platforms with TDISP support are likely to also
> > > > support dirty bit in the IOMMU.
> > > >
> > > It will be some day.
> >
> > Dirty bit is far more realistic than TDISP in the short term.
> >
> > >
> > > >
> > > > > > > And for future scale, having new SIOV interface makes more
> > > > > > > sense which has
> > > > > > its own direct interface to device.
> > > > > > >
> > > > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > > > >
> > > > > > > [1]
> > > > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZiVj8x
> > > > > > > 1s73
> > > > > > > Ed6r
> > > > > > > Osmn6LfXc/edit?usp=sharing
> > > > > >
> > > > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > > > discussion will take place on the mailing list please.
> > > > >
> > > > > We cannot keep discussing the register interface every week.
> > > > > I remember we have discussed this many times already in following
> > series.
> > > > >
> > > > > 1. legacy series
> >
> > How can this be supported in TDISP then?

Please answer this question.

> >
> > > > > 2. tvq v4 series
> > > > > 3. dynamic vq creation series
> > > > > 4. again during suspend series under tvq head 5. right now 6. May
> > > > > be more that I forgot.
> > > > >
> > > > > I captured all the direction and options in the doc. One can refer
> > > > > when those
> > > > questions arise there.
> > > > > If we don’t work cohesively same reasoning repetition does not help.
> > > >
> > > > It's still the same too, doc or no doc. You want to build a device
> > > > without registers fine but don't force it down everyone's throat.
> > > I don’t see any compelling reason for inventing new method really.
> >
> > New requests/platforms come for sure, and virtio supports various transports.
> >
> > For example, there's a request to support PCI endpoint devices.
> What is PCI endpoint devices? Does it have a new device type?

https://docs.kernel.org/PCI/endpoint/index.html

>
> >
> > > Nor continuing in register mode.
> >
> > Most virtio devices are implemented in software.
> We see it differently in field and in virtio charter since 2021.

I think we're talking about different things.

>
> > And we have pure MMIO
> > based transport now which is implemented in registers only.
> >
> > > Virtio already has VQ.
> > > If CVQ is so problematic, one should put everything on registers and not run
> > on double standards.
> >
> > I don't think there's anyone who says CVQ is problematic.
> >
> Ok. than lets stop this endless debate.
> Everyone is using CVQ, lets continue to use.

The CVQ and transport vitqueue is orthogonal. Again, there seems
nobody think transport virtqueue is a replacement of CVQ other than
you.

>
> > >
> > > I captured all the reasoning and thoughts. I don’t have much to say in
> > support of infinite register scale.
> > >
> > > People who wants to push SIOV does not show single performance reason
> > on why SIOV to be done.
> > > I have upstreamed SIOVs in Linux as SFs without PASID, and in all our scale
> > tests, before the device chocks, the system chocks.
> > >
> > > So when someone pushes the SIOV series, I will be the first one interested in
> > reading the performance numbers to proceed with patches.
> > >
> > > > And now with 8MBytes
> > > > of on-device memory that's needed for migration and that's
> > > > apparently fine I am even less interested in saving 256 bytes for config
> > space.
> > >
> > > Again, not the right comparison.
> > > When and how to use 256 matters.
> >
> > Do you know how much the config has grown in the past years since 1.0?
> >
> Very less and no point in deviating the design now anyway for device migration or otherwise.
>
> > Virtio should be implemented easily from:
> >
> And it is already there.
>
> > 1) software device to hardware device
> > 2) embedded to server
> >
> > You can't say e.g migration is needed in all of the environments.
> Which line in the patch said this?

You said you don't want to let the register grow. No?

Why do an embedded virtio device need to implement admin virtqueue
just for a new function like suspend or vq reset?

> I specifically asked to not build transport vq because efficiency is needed on the PFs too.

Transport virtqueue can be done in PF.

Thanks


>
> >
> > Thanks
> >
> > > I haven’t come across any device that prefers infinite register scale.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-22  4:18                                                                                                       ` Jason Wang
@ 2023-11-22  4:26                                                                                                         ` Parav Pandit
  2023-11-24  3:07                                                                                                           ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-22  4:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas



> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 9:49 AM
> 
> On Wed, Nov 22, 2023 at 12:30 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 10:55 AM
> > >
> > > On Fri, Nov 17, 2023 at 8:03 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > >
> > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > >
> > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit
> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Additionally, if hypervisor has put the trap on
> > > > > > > > > > > > virtio config, and because the memory device
> > > > > > > > > > > > already has the interface for virtio config,
> > > > > > > > > > > >
> > > > > > > > > > > > Hypervisor can directly write/read from the
> > > > > > > > > > > > virtual config to the member's
> > > > > > > > > > > config space, without going through the device context,
> right?
> > > > > > > > > > >
> > > > > > > > > > > If it can do it or it can choose to not. I don't see
> > > > > > > > > > > how it is related to the discussion here.
> > > > > > > > > > >
> > > > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > > > native interface provided
> > > > > > > > > by the member device.
> > > > > > > > >
> > > > > > > > > So for example, it seems reasonable to a member
> > > > > > > > > supporting both existing pci register interface for
> > > > > > > > > compatibility and the future DMA based one for scale. In
> > > > > > > > > such a case, it seems possible that DMA will expose more
> > > > > > > > > features than pci. And then a hypervisor might decide to
> > > > > > > > > use
> > > > > > > that in preference to pci registers.
> > > > > > > >
> > > > > > > > We don’t find it right to involve owner device for
> > > > > > > > mediating at current scale
> > > > > > >
> > > > > > > In this model, device will be its own owner. Should not be a
> problem.
> > > > > > >
> > > > > > I didn’t understand above comment.
> > > > >
> > > > > We'd add a new group type "self". You can then send admin
> > > > > commands through VF itself not through PF.
> > > > >
> > > > How? The device is owned by the guest. FLR and device reset cannot
> > > > send
> > > the admin command reliably.
> > > >
> > > > >
> > > > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > > > >
> > > > > > > Look you either stop mentioning TDISP as motivation or
> > > > > > > actually try to address it. Safe migration with TDISP is really hard.
> > > > > > But that is not an excuse to say that TDISP migration is not
> > > > > > present, hence
> > > > > involve the owner device for config space access.
> > > > > > This is another hurdle added that further blocks us away from
> TDISP.
> > > > > > Hence, we don’t want to take the route of involving owner
> > > > > > device for any
> > > > > config access.
> > > > >
> > > > > This "blocks" is all just wild hunches. hypervisor controls some
> > > > > aspects of TDISP devices for sure - maybe we actually should use
> > > > > pci config space as that is generally hypervisor controlled.
> > > > Even bad to do hypercalls.
> > > > I showed you last time the role of the PCI config space snippet
> > > > from the
> > > spec.
> > > > Do you see we are repeating the discussion again?
> > > >
> > > > >
> > > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > > owner can control queue state at any time making device
> > > > > > > modify memory in any way it wants.
> > > > > > >
> > > > > > When TDISP migration is needed, the admin device can be
> > > > > > another TVM
> > > > > outside the HV scope.
> > > > > > Or an alternative would have device context encrypted not
> > > > > > visible to HV
> > > at all.
> > > > >
> > > > > Maybe. Fact remains your patches do conflict with TDISP and you
> > > > > seem to be fine with it because you have a hunch you can fix it.
> > > > > But we can't do development based on your hunches.
> > > > >
> > > > We have different view.
> > > > My patches do not conflict with TDISP because TDISP has clear
> > > > definition of
> > > not involving hypervisor for transport.
> > > > And that part is still preserved.
> > > > Delegating the migration to another TDISP or encrypting is yet to be
> defined.
> > > > And current patches will align to both the approaches in future.
> > > >
> > > > So you need to re-evaluate your judgment.
> > > >
> > > > >
> > > > > > Such encryption is not possible, with the trap+emulation
> > > > > > method, where HV
> > > > > will have to decrypt the data coming over MMIO writes.
> > > > >
> > > > > I don't how what trap+emulation has to do with it. Do you refer
> > > > > to the shadow vq thing?
> > > >
> > > > The method proposed here does not hinder any TDISP direction.
> > > >
> > > > Without my proposal, do you have a method that does not involve
> > > hypervisor intervention for virtio common and device config space,
> > > cvq and shadow vq?
> > > > If so, I would like to hear that as well because that will align with TDISP.
> > >
> > > So this is what you said:
> > >
> > > 1) TDISP would not do mediation
> > > 2) registers doesn't scale
> > >
> > > This is exactly what transport virtqueue did. Isn't it?
> > >
> > No.
> > CVQ is doing both of them currently uniformly across PF, VF.
> > Future SIOV will be able to this also.
> 
> Please explain how CVQ is related. Or how transport virtqueue blocks CVQ in
> any sense. Or any transport virtqueue commands do that.
>
Any new configuration of PF, VF device is done over CVQ as listed in the doc by the guest directly.
Same is usable for SIOV too.
 
> Then let's continue the discussion here.
> 
> >
> > > >
> > > > > I am guessing modern platforms with TDISP support are likely to
> > > > > also support dirty bit in the IOMMU.
> > > > >
> > > > It will be some day.
> > >
> > > Dirty bit is far more realistic than TDISP in the short term.
> > >
> > > >
> > > > >
> > > > > > > > And for future scale, having new SIOV interface makes more
> > > > > > > > sense which has
> > > > > > > its own direct interface to device.
> > > > > > > >
> > > > > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > > > > >
> > > > > > > > [1]
> > > > > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZi
> > > > > > > > Vj8x
> > > > > > > > 1s73
> > > > > > > > Ed6r
> > > > > > > > Osmn6LfXc/edit?usp=sharing
> > > > > > >
> > > > > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > > > > discussion will take place on the mailing list please.
> > > > > >
> > > > > > We cannot keep discussing the register interface every week.
> > > > > > I remember we have discussed this many times already in
> > > > > > following
> > > series.
> > > > > >
> > > > > > 1. legacy series
> > >
> > > How can this be supported in TDISP then?
> 
> Please answer this question.
>
There is no requirement to support TDISP with legacy because TDISP archi require attestation and other things which were not there in legacy VMs anyway.
Hence it is not applicable.

 
> > >
> > > > > > 2. tvq v4 series
> > > > > > 3. dynamic vq creation series
> > > > > > 4. again during suspend series under tvq head 5. right now 6.
> > > > > > May be more that I forgot.
> > > > > >
> > > > > > I captured all the direction and options in the doc. One can
> > > > > > refer when those
> > > > > questions arise there.
> > > > > > If we don’t work cohesively same reasoning repetition does not
> help.
> > > > >
> > > > > It's still the same too, doc or no doc. You want to build a
> > > > > device without registers fine but don't force it down everyone's
> throat.
> > > > I don’t see any compelling reason for inventing new method really.
> > >
> > > New requests/platforms come for sure, and virtio supports various
> transports.
> > >
> > > For example, there's a request to support PCI endpoint devices.
> > What is PCI endpoint devices? Does it have a new device type?
> 
> https://docs.kernel.org/PCI/endpoint/index.html
> 
> >
> > >
> > > > Nor continuing in register mode.
> > >
> > > Most virtio devices are implemented in software.
> > We see it differently in field and in virtio charter since 2021.
> 
> I think we're talking about different things.
So lets make any arbitrary comment as virtio is done in sw.

> 
> >
> > > And we have pure MMIO
> > > based transport now which is implemented in registers only.
> > >
> > > > Virtio already has VQ.
> > > > If CVQ is so problematic, one should put everything on registers
> > > > and not run
> > > on double standards.
> > >
> > > I don't think there's anyone who says CVQ is problematic.
> > >
> > Ok. than lets stop this endless debate.
> > Everyone is using CVQ, lets continue to use.
> 
> The CVQ and transport vitqueue is orthogonal. Again, there seems nobody
> think transport virtqueue is a replacement of CVQ other than you.
The performance numbers for the need of transport virtqueue are not published yet.
So there is no need of transport virtqueue for VF and SIOV.

> 
> >
> > > >
> > > > I captured all the reasoning and thoughts. I don’t have much to
> > > > say in
> > > support of infinite register scale.
> > > >
> > > > People who wants to push SIOV does not show single performance
> > > > reason
> > > on why SIOV to be done.
> > > > I have upstreamed SIOVs in Linux as SFs without PASID, and in all
> > > > our scale
> > > tests, before the device chocks, the system chocks.
> > > >
> > > > So when someone pushes the SIOV series, I will be the first one
> > > > interested in
> > > reading the performance numbers to proceed with patches.
> > > >
> > > > > And now with 8MBytes
> > > > > of on-device memory that's needed for migration and that's
> > > > > apparently fine I am even less interested in saving 256 bytes
> > > > > for config
> > > space.
> > > >
> > > > Again, not the right comparison.
> > > > When and how to use 256 matters.
> > >
> > > Do you know how much the config has grown in the past years since 1.0?
> > >
> > Very less and no point in deviating the design now anyway for device
> migration or otherwise.
> >
> > > Virtio should be implemented easily from:
> > >
> > And it is already there.
> >
> > > 1) software device to hardware device
> > > 2) embedded to server
> > >
> > > You can't say e.g migration is needed in all of the environments.
> > Which line in the patch said this?
> 
> You said you don't want to let the register grow. No?
> 
Right.
Hence all config work to occur on CVQ by the driver owning the device without mediation.


> Why do an embedded virtio device need to implement admin virtqueue just
> for a new function like suspend or vq reset?
> 
Vq reset, vq enable are blocking operation by nature post the device init is done.
Hence, they don’t belong to init time config registers.

> > I specifically asked to not build transport vq because efficiency is needed
> on the PFs too.
> 
> Transport virtqueue can be done in PF.
> 
Please explain why PF needs a transport VQ. I explained all in the doc.


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-22  4:16                                                                                                                 ` Jason Wang
@ 2023-11-22  4:39                                                                                                                   ` Parav Pandit
  2023-11-24  3:08                                                                                                                     ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-22  4:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 9:46 AM
> 
> On Wed, Nov 22, 2023 at 12:27 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 10:15 AM
> > >
> > > On Fri, Nov 17, 2023 at 11:09 PM Michael S. Tsirkin <mst@redhat.com>
> > > wrote:
> > > >
> > > > On Fri, Nov 17, 2023 at 02:51:04PM +0000, Parav Pandit wrote:
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 7:24 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 12:46:21PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 6:00 PM
> > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit
> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav
> > > > > > > > > > > > Pandit
> > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000,
> > > > > > > > > > > > > > Parav Pandit
> > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Additionally, if hypervisor has put the
> > > > > > > > > > > > > > > > > trap on virtio config, and because the
> > > > > > > > > > > > > > > > > memory device already has the interface
> > > > > > > > > > > > > > > > > for virtio config,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hypervisor can directly write/read from
> > > > > > > > > > > > > > > > > the virtual config to the member's
> > > > > > > > > > > > > > > > config space, without going through the
> > > > > > > > > > > > > > > > device context,
> > > right?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If it can do it or it can choose to not. I
> > > > > > > > > > > > > > > > don't see how it is related to the discussion here.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It is. I don’t see a point of hypervisor not
> > > > > > > > > > > > > > > using the native interface provided
> > > > > > > > > > > > > > by the member device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So for example, it seems reasonable to a
> > > > > > > > > > > > > > member supporting both existing pci register
> > > > > > > > > > > > > > interface for compatibility and the future DMA
> > > > > > > > > > > > > > based one for scale. In such a case, it seems
> > > > > > > > > > > > > > possible that DMA will expose more features
> > > > > > > > > > > > > > than pci. And then a hypervisor might decide
> > > > > > > > > > > > > > to use
> > > > > > > > > > > > that in preference to pci registers.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We don’t find it right to involve owner device
> > > > > > > > > > > > > for mediating at current scale
> > > > > > > > > > > >
> > > > > > > > > > > > In this model, device will be its own owner.
> > > > > > > > > > > > Should not be a
> > > problem.
> > > > > > > > > > > >
> > > > > > > > > > > I didn’t understand above comment.
> > > > > > > > > >
> > > > > > > > > > We'd add a new group type "self". You can then send
> > > > > > > > > > admin commands through VF itself not through PF.
> > > > > > > > > >
> > > > > > > > > How? The device is owned by the guest. FLR and device
> > > > > > > > > reset cannot send the
> > > > > > > > admin command reliably.
> > > > > > > >
> > > > > > > > It's of the "it hurts when I do this - don't do this then" category.
> > > > > > > >
> > > > > > > it is don’t do medication category, yes due all this
> > > > > > > weirdness that has been
> > > > > > asked.
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > > and to not break TDISP efforts in upcoming time
> > > > > > > > > > > > > by such
> > > design.
> > > > > > > > > > > >
> > > > > > > > > > > > Look you either stop mentioning TDISP as
> > > > > > > > > > > > motivation or actually try to address it. Safe
> > > > > > > > > > > > migration with TDISP is really
> > > hard.
> > > > > > > > > > > But that is not an excuse to say that TDISP
> > > > > > > > > > > migration is not present, hence
> > > > > > > > > > involve the owner device for config space access.
> > > > > > > > > > > This is another hurdle added that further blocks us
> > > > > > > > > > > away from
> > > TDISP.
> > > > > > > > > > > Hence, we don’t want to take the route of involving
> > > > > > > > > > > owner device for any
> > > > > > > > > > config access.
> > > > > > > > > >
> > > > > > > > > > This "blocks" is all just wild hunches. hypervisor
> > > > > > > > > > controls some aspects of TDISP devices for sure -
> > > > > > > > > > maybe we actually should use pci config space as that
> > > > > > > > > > is generally hypervisor
> > > controlled.
> > > > > > > > > Even bad to do hypercalls.
> > > > > > > > > I showed you last time the role of the PCI config space
> > > > > > > > > snippet from the
> > > > > > spec.
> > > > > > > >
> > > > > > > > Yes I remember. This is just an example though. My point
> > > > > > > > is maybe it is solvable maybe it is not.
> > > > > > > >
> > > > > > > > > Do you see we are repeating the discussion again?
> > > > > > > >
> > > > > > > > One of the reasons is that people bring up irrelevances.
> > > > > > > > TDISP is important but has to be addressed or deferred not
> > > > > > > > vaguely referred
> > > to.
> > > > > > >
> > > > > > > So lets continue to follow the current TDISP direction of
> > > > > > > not involving
> > > > > > hypervisor for virtio common and device config.
> > > > >
> > > > > If you disagree to it, please speak now, so that we don’t debate
> > > > > on this
> > > again in next 3 days.
> > > > > Because this is the fundamental design considerations it relied on.
> > > > > There is no point going forward if you want to disagree to it.
> > > > > Other variants are fine, but other variants cannot be the only choice.
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > For example, your current patches are clearly broken for
> TDISP:
> > > > > > > > > > > > owner can control queue state at any time making
> > > > > > > > > > > > device modify memory in any way it wants.
> > > > > > > > > > > >
> > > > > > > > > > > When TDISP migration is needed, the admin device can
> > > > > > > > > > > be another TVM
> > > > > > > > > > outside the HV scope.
> > > > > > > > > > > Or an alternative would have device context
> > > > > > > > > > > encrypted not visible to HV at
> > > > > > > > all.
> > > > > > > > > >
> > > > > > > > > > Maybe. Fact remains your patches do conflict with
> > > > > > > > > > TDISP and you seem to be fine with it because you have
> > > > > > > > > > a hunch you can
> > > fix it.
> > > > > > > > > > But we can't do development based on your hunches.
> > > > > > > > > >
> > > > > > > > > We have different view.
> > > > > > > > > My patches do not conflict with TDISP because TDISP has
> > > > > > > > > clear definition of
> > > > > > > > not involving hypervisor for transport.
> > > > > > > > > And that part is still preserved.
> > > > > > > > > Delegating the migration to another TDISP or encrypting
> > > > > > > > > is yet to be
> > > > > > defined.
> > > > > > > > > And current patches will align to both the approaches in
> future.
> > > > > > > > >
> > > > > > > > > So you need to re-evaluate your judgment.
> > > > > > > >
> > > > > > > > If you like they do not "conflict".  But if used with
> > > > > > > > TDISP they just make it insecure and thus completely
> > > > > > > > worthless.  If hypervisor can change ring state to make
> > > > > > > > device poke at random guest memory then it is game over
> > > > > > > > and all the effort spent was
> > > security theater.
> > > > > > > Not really, I proposed two options.
> > > > > > > 1. delegate the task of LM to the TVM. (proposed by two cpu
> vendors).
> > > > > > > In this case all the infra we build here, just works fine.
> > > > > >
> > > > > > I think modification will be needed: currently commands are
> > > > > > sent through the PF, and that is under hypervisor control.
> > > > > > You should not assign PF to TVM.
> > >
> > > That's the point. And that's why it keeps people confused to believe
> > > the current PF/adminq can work in the TDISP.
> > >
> > There is no confusion.
> 
> No, when LingShan points out the conflict, you just told us it will be
> addressed in the future.
>
Current proposal does not punch the hole in the TDISP TVM interface.
 
> And after Michael pointed it out again, you agree than adminq can not be
> part of PF in this context.
>
As explained when TDISP for device migration evolves, there will be other trusted entity which will do it.

 
> And you miss the fact that admin virtqueue today can't be used to manage
> the owner.
>
There is no such need for device migration.
 
> > The admin queue interface ensures first step that TDISP interface is
> dedicated to guest as today.
> > There is no bifurcation added on the VF that needs extra mediation.
> >
> > > > > Yes, an admin virtio function will be there which will do the
> > > > > admin
> > > commands listed.
> > > >
> > > > So it can't be PF, so at least we need a new group type.
> > > > I am inclined to then say, operate it through VF itself.
> > >
> > > So it exactly matches the idea of transport virtqueue (a per VF/SF one).
> > >
> > There is no need for transport virtqueue for VF as VF device has same
> uniform principle as PF.
> 
> I don't understand here. I have explained that you have invented a function
> duplication of transport virtqueue.
> 
No. we haven’t.
It is explained in other thread that functionality is different.

> > If you want transport vq, please have it on the PF too.
> 
> Nothing prevents this, actually transport virtqueue start from this.
> 
> > And that also is not needed because there is already CVQ.
> 
> I don't see why you keep mentioning CVQ. I don't see anyone that says
> transport virtqueue is going to replace CVQ.
>
I explained everything in the doc. This is the Nth time being repeated...
 
> >
> > > But it still requires a PCI part to bootstrap.
> > >
> > > >
> > > >
> > > > > >
> > > > > > > It also does not require any hypervisor mediation for control
> plane.
> > > > > > >
> > > > > > > 2. Encrypt the owner device workload to be not seen by
> > > > > > > hypervisor
> > > > > > >
> > > > > > > Both methods does not affect the current direction.
> > > > > > >
> > > > > > > But if we force trap+emulation, it is 100% broken for TDISP.
> > > > > > > And I would not promote that.
> > > > > > >
> > > > > > > > But you know this, don't you? This is why you mentioned
> > > > > > > > encrypting
> > > device.
> > > > > > > > Maybe that works. It just does not work *as is*.
> > > > > > > It works as_is. But current infrastructure does not block
> > > > > > > the future
> > > work.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Such encryption is not possible, with the
> > > > > > > > > > > trap+emulation method, where HV
> > > > > > > > > > will have to decrypt the data coming over MMIO writes.
> > > > > > > > > >
> > > > > > > > > > I don't how what trap+emulation has to do with it. Do
> > > > > > > > > > you refer to the shadow vq thing?
> > > > > > > > >
> > > > > > > > > The method proposed here does not hinder any TDISP
> direction.
> > > > > > > >
> > > > > > > > direction? No, why would it. we can always add more
> > > > > > > > commands that are safe for TDISP. commands you propose
> > > > > > > > here are unsafe for
> > > TDISP.
> > > > > > > >
> > > > > > > > > Without my proposal, do you have a method that does not
> > > > > > > > > involve hypervisor
> > > > > > > > intervention for virtio common and device config space,
> > > > > > > > cvq and shadow
> > > > > > vq?
> > > > > > > > > If so, I would like to hear that as well because that
> > > > > > > > > will align with
> > > TDISP.
> > > > > > > >
> > > > > > > > I really did not give it much thought.  I suspect for
> > > > > > > > TDISP it just might be cleaner to have guest agent migrate
> device.
> > > > > > > > Certainly removes all
> > > > > > the messy questions.
> > > > > > > > That, to me impliest there needs to be a way to send
> > > > > > > > migration commands through VF itself. Does this "involve
> > > > > > > > hypervisor intervention"? No one should care I think.
> > > > > > > Too far of the future to envision. May be yes. When such
> > > > > > > platform is built, for sure whoever migrates need migrate
> > > > > > > its device side
> > > too.
> > > > > > > Some knowledge of migration driver is needed.
> > > > > >
> > > > > > So TDISP migration is so far in the future you do not need to
> > > > > > bother
> > > about it.
> > > > > > Fine. Then don't bring it up pls.
> > > > > >
> > > > > As long as we are aligned to the requirement that a virtio
> > > > > member device is
> > > mapped to the guest VM without mediating the virtio interface, I am
> good.
> > > > > Again, other variants are fine, but above listed mapped variant
> > > > > is the
> > > minimum variant needed.
> > > >
> > > > I think it's worth supporting this. I wouldn't call this minimum
> > > > there are other approaches.  And I am not so sure it's worth
> > > > trying to support this in all kind of systems such as IOMMU
> > > > without dirty bit support. If some old systems will need
> > > > mediation, this is kind of like legacy interface. Not a big deal.
> > > >
> > >
> > > +1
> >
> > There are users with the recent cpus that may not have the IOMMU dirty
> page tracking support.
> > So I don’t fully agree.
> 
> There are setups that don't have SR-IOV or even PCI.
> 
Which are those? Please expand those transports that need it.

> Let's have a unified standard.

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-21 16:32                                                                                             ` Parav Pandit
@ 2023-11-22  5:27                                                                                               ` Jason Wang
  2023-11-22  6:05                                                                                                 ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-22  5:27 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 12:32 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 12:54 PM
> >
> > On Thu, Nov 16, 2023 at 1:28 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Thursday, November 16, 2023 9:50 AM
> > > >
> > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, November 13, 2023 9:03 AM
> > > > > >
> > > > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Tuesday, November 7, 2023 9:35 AM
> > > > > > > >
> > > > > > > > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >

[...]

> > > >
> > > I disagree.
> > > There are two different functionalities.
> > >
> > > Functionality_1: explicit ask for read or write
> > > Functionality_2: read what has changed
> >
> > This needs to be justified. I won't repeat the questions again here.
> >
> As explained the use case in theory of operation already.
>
> > >
> > > Should one merge 1 and 2 and complicate the command?
> > > I prefer not to.
> >
> > Again there're functional duplications. E.g your command duplicates
> > common_cfg for sure.
> Nop. it is not.
> Common cfg is accessed directly by guest member driver.

It can be accessed directly, if we have adminq per VF.

>
> >
> > >
> > > Now having two different commands help for debugging to differentiate
> > > between mgmt. commands and guest initiated commands. :)
> > >
> > > > >
> > > > > > Guest configure the following one by one:
> > > > > >
> > > > > > 1) vq size
> > > > > > 2) vq addresses
> > > > > > 3) MSI-X
> > > > > >
> > > > > > etc?
> > > > > >
> > > > > I think you interpreted "incremental" differently than I described.
> > > > > In the device context read, the incremental is:
> > > > >
> > > > > If the hypervisor driver has read the device context twice, the
> > > > > second read
> > > > won't return any new data if nothing changed.
> > > >
> > > > See above.
> > > >
> > > Yeah, two separate commands needed.
> > >
> > > > > For example, if RSS configuration didn’t change between two reads,
> > > > > the
> > > > second read wont return the TLV for RSS Context.
> > > > >
> > > > > While for transport the need is, when guest asked, one device must
> > > > > read it
> > > > regardless of the change.
> > > > >
> > > > > So notion of incremental is not by address, but by the value.
> > > > >
> > > > > > > For example, VQ configuration is exchanged once between src and
> > dst.
> > > > > > > But VQ avail and used index may be updated multiple times.
> > > > > >
> > > > > > If it can work with multiple times of updating, why can't it
> > > > > > work if we just update it once?
> > > > > Functionally it can work.
> > > >
> > > > I think you answer yourself.
> > > >
> > > Yes, I don’t like abuse of command.
> >
> > How did you define abuse or can spec ever need to define that?
> I don’t have any different definition than dictionary definition for abuse. :)

"Abuse" is pretty subjective. Spec have driver normative, please
explain how and if you can use that.

>
> >
> > >
> > > > > Performance wise, one does not want to update multiple times,
> > > > > unless there
> > > > is a change.
> > > > >
> > > > > Read as explained above is not meant to return same content again.
> > > > >
> > > > > >
> > > > > > > So here hypervisor do not want to read any specific set of
> > > > > > > fields and
> > > > > > hypervisor is not parsing them either.
> > > > > > > It is just a byte stream for it.
> > > > > >
> > > > > > Firstly, spec must define the device context format, so
> > > > > > hypervisor can understand which byte is what otherwise you can't
> > > > > > maintain migration compatibility.
> > > > > Device context is defined already in the latest version.
> > > > >
> > > > > > Secondly, you can't mandate how the hypervisor is written.
> > > > > >
> > > > > > >
> > > > > > > As opposed to that, in case of transport, the guest explicitly
> > > > > > > asks to read or
> > > > > > write specific bytes.
> > > > > > > Therefore, it is not incremental.
> > > > > >
> > > > > > I'm totally lost. Which part of the transport is not incremental?
> > > > > >
> > > > > > >
> > > > > > > Additionally, if hypervisor has put the trap on virtio config,
> > > > > > > and because the memory device already has the interface for
> > > > > > > virtio config,
> > > > > > >
> > > > > > > Hypervisor can directly write/read from the virtual config to
> > > > > > > the member's
> > > > > > config space, without going through the device context, right?
> > > > > >
> > > > > > If it can do it or it can choose to not. I don't see how it is
> > > > > > related to the discussion here.
> > > > > >
> > > > > It is. I don’t see a point of hypervisor not using the native
> > > > > interface provided
> > > > by the member device.
> > > >
> > > > It really depends on the case, and I see how it duplicates with the
> > > > functionality that is provided by both:
> > > >
> > > > 1) The existing PCI transport
> > > >
> > > > or
> > > >
> > > > 2) The transport virtqueue
> > > >
> > > I would like to conclude that we disagree in our approaches.
> > > PCI transport is for member device to directly communicate from guest
> > driver to the device.
> > > This is uniform across PF, VFs, SIOV.
> >
> > For "PCi transport" did you mean the one defined in spec? If yes, how can it
> > work with SIOV with what you're saying here (a direct communication
> > channel)?
> >
> SIOV device may have same MMIO as VF.

We circle back, SIOV is for scalability. If you claim registers don't
scale, SVIO via MMIO doesn't scale either.

>
> > >
> > > Admin commands are transport independent and their task is device
> > migration.
> > > One is not replacing the other.
> > >
> > > Transport virtqueue will never transport driver notifications, hence it does
> > not qualify at "transport".
> >
> > Another double standard.
> I disagree. You coined the term transport vq, so stand behind it to transport everything.

Nope, it works like other transport. It doesn't aim to replace any
existing transport.

>
> >
> > MMIO will never transport device notification, hence it does not qualify as
> > "transport"?
> >
> How does interrupts work?

It depends on the platform, no?

> Seems like missing basic functionality in transport.

Not necessarily the charge of a transport. Virtio transport can't fly
without a platform.

>
> > >
> > > For the vdpa case, there is no need for extra admin commands as the
> > mediation layer can directly use the interface available from the member
> > device itself.
> > >
> > > You continue to want to overload admin commands for dual purpose, does
> > not make sense to me.
> > >
> > > > >
> > > > >  > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > it is not good idea to overload management commands
> > > > > > > > > > > with actual run time
> > > > > > > > > > guest commands.
> > > > > > > > > > > The device context read writes are largely for incremental
> > updates.
> > > > > > > > > >
> > > > > > > > > > It doesn't matter if it is incremental or not but
> > > > > > > > > >
> > > > > > > > > It does because you want different functionality only for
> > > > > > > > > purpose of backward
> > > > > > > > compatibility.
> > > > > > > > > That also if the device does not offer them as portion of MMIO
> > BAR.
> > > > > > > >
> > > > > > > > I don't see how it is related to the "incremental part".
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 1) the function is there
> > > > > > > > > > 2) hypervisor can use that function if they want and
> > > > > > > > > > virtio
> > > > > > > > > > (spec) can't forbid that
> > > > > > > > > >
> > > > > > > > > It is not about forbidding or supporting.
> > > > > > > > > Its about what functionality to use for management plane
> > > > > > > > > and guest
> > > > > > plane.
> > > > > > > > > Both have different needs.
> > > > > > > >
> > > > > > > > People can have different views, there's nothing we can
> > > > > > > > prevent a hypervisor from using it as a transport as far as I can see.
> > > > > > > For device context write command, it can be used (or probably
> > > > > > > abused) to do
> > > > > > write but I fail to see why to use it.
> > > > > >
> > > > > > The function is there, you can't prevent people from doing that.
> > > > > >
> > > > > One can always mess up itself. :)
> > > > > It is not prevented. It is just not right way to use the interface.
> > > > >
> > > > > > > Because member device already has the interface to do config
> > > > > > > read/write and
> > > > > > it is accessible to the hypervisor.
> > > > > >
> > > > > > Well, it looks self-contradictory again. Are you saying another
> > > > > > set of commands that is similar to device context is needed for
> > > > > > non-PCI
> > > > transport?
> > > > > >
> > > > > All these non pci transport discussion is just meaning less.
> > > > > Let MMIO bring the concept of member device at that point
> > > > > something make
> > > > sense to discuss.
> > > >
> > > > It's not necessarily MMIO. For example the SIOV, which I don't think
> > > > can use the existing PCI transport.
> > > >
> > > > > PCI SIOV is also the PCI device at the end.
> > > >
> > > > We don't want to end up with two sets of commands to save/load SRIOV
> > > > and SIOV at least.
> > > >
> > > This proposal ensures that SRIOV and SIOV devices are treated equally.
> >
> > How? Did you mean your proposal can work for SIOV? What's the transport
> > then?
> Yes. All majority of the device contexts should work for SIOV device as_is.
> Member id would be different.
> Some device context TLVs may be new as SIOV may have some simplifications as it may not have the giant register space like current one.

You only explain the migration part but not the transport part.

>
> >
> > > How brand new non-compatible SIOV device to transport this, is outside of
> > the scope of this work.
> >
> > You invented one that can be used for doing this. If you disagree, how can we
> > know your proposal can work for SIOV without a transport then?
>
> I don’t understand your comment.
>
> All I am saying is, most pieces of device contexts are reusable across VFs and SIOVs.
> When SIOV is defined, we can relook at what may need to be added.

I've explained several times. Transport virtqueue is not solely
designed for SIOV. SIOV could be one of the use cases.

Thanks

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-22  5:27                                                                                               ` Jason Wang
@ 2023-11-22  6:05                                                                                                 ` Parav Pandit
  2023-11-24  3:40                                                                                                   ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-22  6:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 10:57 AM
> 
> On Wed, Nov 22, 2023 at 12:32 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 12:54 PM
> > >
> > > On Thu, Nov 16, 2023 at 1:28 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 9:50 AM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > > wrote:
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Monday, November 13, 2023 9:03 AM
> > > > > > >
> > > > > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Tuesday, November 7, 2023 9:35 AM
> > > > > > > > >
> > > > > > > > > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> 
> [...]
> 
> > > > >
> > > > I disagree.
> > > > There are two different functionalities.
> > > >
> > > > Functionality_1: explicit ask for read or write
> > > > Functionality_2: read what has changed
> > >
> > > This needs to be justified. I won't repeat the questions again here.
> > >
> > As explained the use case in theory of operation already.
> >
> > > >
> > > > Should one merge 1 and 2 and complicate the command?
> > > > I prefer not to.
> > >
> > > Again there're functional duplications. E.g your command duplicates
> > > common_cfg for sure.
> > Nop. it is not.
> > Common cfg is accessed directly by guest member driver.
> 
> It can be accessed directly, if we have adminq per VF.
Sure, instead I proposed the cvq as there is no need to burn another queue.

> 
> >
> > >
> > > >
> > > > Now having two different commands help for debugging to
> > > > differentiate between mgmt. commands and guest initiated commands.
> > > > :)
> > > >
> > > > > >
> > > > > > > Guest configure the following one by one:
> > > > > > >
> > > > > > > 1) vq size
> > > > > > > 2) vq addresses
> > > > > > > 3) MSI-X
> > > > > > >
> > > > > > > etc?
> > > > > > >
> > > > > > I think you interpreted "incremental" differently than I described.
> > > > > > In the device context read, the incremental is:
> > > > > >
> > > > > > If the hypervisor driver has read the device context twice,
> > > > > > the second read
> > > > > won't return any new data if nothing changed.
> > > > >
> > > > > See above.
> > > > >
> > > > Yeah, two separate commands needed.
> > > >
> > > > > > For example, if RSS configuration didn’t change between two
> > > > > > reads, the
> > > > > second read wont return the TLV for RSS Context.
> > > > > >
> > > > > > While for transport the need is, when guest asked, one device
> > > > > > must read it
> > > > > regardless of the change.
> > > > > >
> > > > > > So notion of incremental is not by address, but by the value.
> > > > > >
> > > > > > > > For example, VQ configuration is exchanged once between
> > > > > > > > src and
> > > dst.
> > > > > > > > But VQ avail and used index may be updated multiple times.
> > > > > > >
> > > > > > > If it can work with multiple times of updating, why can't it
> > > > > > > work if we just update it once?
> > > > > > Functionally it can work.
> > > > >
> > > > > I think you answer yourself.
> > > > >
> > > > Yes, I don’t like abuse of command.
> > >
> > > How did you define abuse or can spec ever need to define that?
> > I don’t have any different definition than dictionary definition for
> > abuse. :)
> 
> "Abuse" is pretty subjective. Spec have driver normative, please explain how
> and if you can use that.
> 
:)

So be it.

Spec does not need to define the dictionary.

For explicit read and write simple read/write commands are right to me.

> >
> > >
> > > >
> > > > > > Performance wise, one does not want to update multiple times,
> > > > > > unless there
> > > > > is a change.
> > > > > >
> > > > > > Read as explained above is not meant to return same content again.
> > > > > >
> > > > > > >
> > > > > > > > So here hypervisor do not want to read any specific set of
> > > > > > > > fields and
> > > > > > > hypervisor is not parsing them either.
> > > > > > > > It is just a byte stream for it.
> > > > > > >
> > > > > > > Firstly, spec must define the device context format, so
> > > > > > > hypervisor can understand which byte is what otherwise you
> > > > > > > can't maintain migration compatibility.
> > > > > > Device context is defined already in the latest version.
> > > > > >
> > > > > > > Secondly, you can't mandate how the hypervisor is written.
> > > > > > >
> > > > > > > >
> > > > > > > > As opposed to that, in case of transport, the guest
> > > > > > > > explicitly asks to read or
> > > > > > > write specific bytes.
> > > > > > > > Therefore, it is not incremental.
> > > > > > >
> > > > > > > I'm totally lost. Which part of the transport is not incremental?
> > > > > > >
> > > > > > > >
> > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > config, and because the memory device already has the
> > > > > > > > interface for virtio config,
> > > > > > > >
> > > > > > > > Hypervisor can directly write/read from the virtual config
> > > > > > > > to the member's
> > > > > > > config space, without going through the device context, right?
> > > > > > >
> > > > > > > If it can do it or it can choose to not. I don't see how it
> > > > > > > is related to the discussion here.
> > > > > > >
> > > > > > It is. I don’t see a point of hypervisor not using the native
> > > > > > interface provided
> > > > > by the member device.
> > > > >
> > > > > It really depends on the case, and I see how it duplicates with
> > > > > the functionality that is provided by both:
> > > > >
> > > > > 1) The existing PCI transport
> > > > >
> > > > > or
> > > > >
> > > > > 2) The transport virtqueue
> > > > >
> > > > I would like to conclude that we disagree in our approaches.
> > > > PCI transport is for member device to directly communicate from
> > > > guest
> > > driver to the device.
> > > > This is uniform across PF, VFs, SIOV.
> > >
> > > For "PCi transport" did you mean the one defined in spec? If yes,
> > > how can it work with SIOV with what you're saying here (a direct
> > > communication channel)?
> > >
> > SIOV device may have same MMIO as VF.
> 
> We circle back, SIOV is for scalability. If you claim registers don't scale, SVIO via
> MMIO doesn't scale either.
Hence, any new and slow things to stay off the MMIO.
And for currently defined things, Lingshan will show the performance numbers of why they should be transported via a virtqeueue.

> 
> >
> > > >
> > > > Admin commands are transport independent and their task is device
> > > migration.
> > > > One is not replacing the other.
> > > >
> > > > Transport virtqueue will never transport driver notifications,
> > > > hence it does
> > > not qualify at "transport".
> > >
> > > Another double standard.
> > I disagree. You coined the term transport vq, so stand behind it to transport
> everything.
> 
> Nope, it works like other transport. It doesn't aim to replace any existing
> transport.
> 
> >
> > >
> > > MMIO will never transport device notification, hence it does not
> > > qualify as "transport"?
> > >
> > How does interrupts work?
> 
> It depends on the platform, no?
I don’t know, how does a MMIO virtio hw device deliver an interrupt for x86 cpu?

> 
> > Seems like missing basic functionality in transport.
> 
> Not necessarily the charge of a transport. Virtio transport can't fly without a
> platform.
What is the equivalent of msix in mmio?
> 
> >
> > > >
> > > > For the vdpa case, there is no need for extra admin commands as
> > > > the
> > > mediation layer can directly use the interface available from the
> > > member device itself.
> > > >
> > > > You continue to want to overload admin commands for dual purpose,
> > > > does
> > > not make sense to me.
> > > >
> > > > > >
> > > > > >  > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > it is not good idea to overload management
> > > > > > > > > > > > commands with actual run time
> > > > > > > > > > > guest commands.
> > > > > > > > > > > > The device context read writes are largely for
> > > > > > > > > > > > incremental
> > > updates.
> > > > > > > > > > >
> > > > > > > > > > > It doesn't matter if it is incremental or not but
> > > > > > > > > > >
> > > > > > > > > > It does because you want different functionality only
> > > > > > > > > > for purpose of backward
> > > > > > > > > compatibility.
> > > > > > > > > > That also if the device does not offer them as portion
> > > > > > > > > > of MMIO
> > > BAR.
> > > > > > > > >
> > > > > > > > > I don't see how it is related to the "incremental part".
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 1) the function is there
> > > > > > > > > > > 2) hypervisor can use that function if they want and
> > > > > > > > > > > virtio
> > > > > > > > > > > (spec) can't forbid that
> > > > > > > > > > >
> > > > > > > > > > It is not about forbidding or supporting.
> > > > > > > > > > Its about what functionality to use for management
> > > > > > > > > > plane and guest
> > > > > > > plane.
> > > > > > > > > > Both have different needs.
> > > > > > > > >
> > > > > > > > > People can have different views, there's nothing we can
> > > > > > > > > prevent a hypervisor from using it as a transport as far as I can
> see.
> > > > > > > > For device context write command, it can be used (or
> > > > > > > > probably
> > > > > > > > abused) to do
> > > > > > > write but I fail to see why to use it.
> > > > > > >
> > > > > > > The function is there, you can't prevent people from doing that.
> > > > > > >
> > > > > > One can always mess up itself. :) It is not prevented. It is
> > > > > > just not right way to use the interface.
> > > > > >
> > > > > > > > Because member device already has the interface to do
> > > > > > > > config read/write and
> > > > > > > it is accessible to the hypervisor.
> > > > > > >
> > > > > > > Well, it looks self-contradictory again. Are you saying
> > > > > > > another set of commands that is similar to device context is
> > > > > > > needed for non-PCI
> > > > > transport?
> > > > > > >
> > > > > > All these non pci transport discussion is just meaning less.
> > > > > > Let MMIO bring the concept of member device at that point
> > > > > > something make
> > > > > sense to discuss.
> > > > >
> > > > > It's not necessarily MMIO. For example the SIOV, which I don't
> > > > > think can use the existing PCI transport.
> > > > >
> > > > > > PCI SIOV is also the PCI device at the end.
> > > > >
> > > > > We don't want to end up with two sets of commands to save/load
> > > > > SRIOV and SIOV at least.
> > > > >
> > > > This proposal ensures that SRIOV and SIOV devices are treated equally.
> > >
> > > How? Did you mean your proposal can work for SIOV? What's the
> > > transport then?
> > Yes. All majority of the device contexts should work for SIOV device as_is.
> > Member id would be different.
> > Some device context TLVs may be new as SIOV may have some
> simplifications as it may not have the giant register space like current one.
> 
> You only explain the migration part but not the transport part.
> 
Because SIOV is still under construction in community. There is no point of defining SIOV transport for some half-cooked spec.
It is just logical to finish the device migration for well-defined member device.

> >
> > >
> > > > How brand new non-compatible SIOV device to transport this, is
> > > > outside of
> > > the scope of this work.
> > >
> > > You invented one that can be used for doing this. If you disagree,
> > > how can we know your proposal can work for SIOV without a transport
> then?
> >
> > I don’t understand your comment.
> >
> > All I am saying is, most pieces of device contexts are reusable across VFs and
> SIOVs.
> > When SIOV is defined, we can relook at what may need to be added.
> 
> I've explained several times. Transport virtqueue is not solely designed for
> SIOV. SIOV could be one of the use cases.

Seems like a blander...

^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-22  4:13                                                                                                   ` Jason Wang
@ 2023-11-22  7:48                                                                                                     ` Michael S. Tsirkin
  2023-11-24  3:56                                                                                                       ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-22  7:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 12:13:34PM +0800, Jason Wang wrote:
> > > What's wrong if we just allow them to be R/W over adminq/cmmands?
> > >
> > As explained before,
> > Each guest has its own dedicated non mediated interface as defined in virtio spec to not involve hypervisor.
> 
> So what's wrong with inventing per VF queue to do that? For example
> transport virtqueue.

Nothing is wrong with this.

But what is problematic is just re-using config space for migration because
it means we can not just say "don't access device after it is stopped"
because yes you need to access it to save/restore state.
And a new interface over admin cmds just for this side-steps the
issue nicely.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-22  4:26                                                                                                         ` Parav Pandit
@ 2023-11-24  3:07                                                                                                           ` Jason Wang
  2023-11-24 11:38                                                                                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-24  3:07 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 12:26 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 9:49 AM
> >
> > On Wed, Nov 22, 2023 at 12:30 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 10:55 AM
> > > >
> > > > On Fri, Nov 17, 2023 at 8:03 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > > >
> > > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Additionally, if hypervisor has put the trap on
> > > > > > > > > > > > > virtio config, and because the memory device
> > > > > > > > > > > > > already has the interface for virtio config,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hypervisor can directly write/read from the
> > > > > > > > > > > > > virtual config to the member's
> > > > > > > > > > > > config space, without going through the device context,
> > right?
> > > > > > > > > > > >
> > > > > > > > > > > > If it can do it or it can choose to not. I don't see
> > > > > > > > > > > > how it is related to the discussion here.
> > > > > > > > > > > >
> > > > > > > > > > > It is. I don’t see a point of hypervisor not using the
> > > > > > > > > > > native interface provided
> > > > > > > > > > by the member device.
> > > > > > > > > >
> > > > > > > > > > So for example, it seems reasonable to a member
> > > > > > > > > > supporting both existing pci register interface for
> > > > > > > > > > compatibility and the future DMA based one for scale. In
> > > > > > > > > > such a case, it seems possible that DMA will expose more
> > > > > > > > > > features than pci. And then a hypervisor might decide to
> > > > > > > > > > use
> > > > > > > > that in preference to pci registers.
> > > > > > > > >
> > > > > > > > > We don’t find it right to involve owner device for
> > > > > > > > > mediating at current scale
> > > > > > > >
> > > > > > > > In this model, device will be its own owner. Should not be a
> > problem.
> > > > > > > >
> > > > > > > I didn’t understand above comment.
> > > > > >
> > > > > > We'd add a new group type "self". You can then send admin
> > > > > > commands through VF itself not through PF.
> > > > > >
> > > > > How? The device is owned by the guest. FLR and device reset cannot
> > > > > send
> > > > the admin command reliably.
> > > > >
> > > > > >
> > > > > > > > > and to not break TDISP efforts in upcoming time by such design.
> > > > > > > >
> > > > > > > > Look you either stop mentioning TDISP as motivation or
> > > > > > > > actually try to address it. Safe migration with TDISP is really hard.
> > > > > > > But that is not an excuse to say that TDISP migration is not
> > > > > > > present, hence
> > > > > > involve the owner device for config space access.
> > > > > > > This is another hurdle added that further blocks us away from
> > TDISP.
> > > > > > > Hence, we don’t want to take the route of involving owner
> > > > > > > device for any
> > > > > > config access.
> > > > > >
> > > > > > This "blocks" is all just wild hunches. hypervisor controls some
> > > > > > aspects of TDISP devices for sure - maybe we actually should use
> > > > > > pci config space as that is generally hypervisor controlled.
> > > > > Even bad to do hypercalls.
> > > > > I showed you last time the role of the PCI config space snippet
> > > > > from the
> > > > spec.
> > > > > Do you see we are repeating the discussion again?
> > > > >
> > > > > >
> > > > > > > > For example, your current patches are clearly broken for TDISP:
> > > > > > > > owner can control queue state at any time making device
> > > > > > > > modify memory in any way it wants.
> > > > > > > >
> > > > > > > When TDISP migration is needed, the admin device can be
> > > > > > > another TVM
> > > > > > outside the HV scope.
> > > > > > > Or an alternative would have device context encrypted not
> > > > > > > visible to HV
> > > > at all.
> > > > > >
> > > > > > Maybe. Fact remains your patches do conflict with TDISP and you
> > > > > > seem to be fine with it because you have a hunch you can fix it.
> > > > > > But we can't do development based on your hunches.
> > > > > >
> > > > > We have different view.
> > > > > My patches do not conflict with TDISP because TDISP has clear
> > > > > definition of
> > > > not involving hypervisor for transport.
> > > > > And that part is still preserved.
> > > > > Delegating the migration to another TDISP or encrypting is yet to be
> > defined.
> > > > > And current patches will align to both the approaches in future.
> > > > >
> > > > > So you need to re-evaluate your judgment.
> > > > >
> > > > > >
> > > > > > > Such encryption is not possible, with the trap+emulation
> > > > > > > method, where HV
> > > > > > will have to decrypt the data coming over MMIO writes.
> > > > > >
> > > > > > I don't how what trap+emulation has to do with it. Do you refer
> > > > > > to the shadow vq thing?
> > > > >
> > > > > The method proposed here does not hinder any TDISP direction.
> > > > >
> > > > > Without my proposal, do you have a method that does not involve
> > > > hypervisor intervention for virtio common and device config space,
> > > > cvq and shadow vq?
> > > > > If so, I would like to hear that as well because that will align with TDISP.
> > > >
> > > > So this is what you said:
> > > >
> > > > 1) TDISP would not do mediation
> > > > 2) registers doesn't scale
> > > >
> > > > This is exactly what transport virtqueue did. Isn't it?
> > > >
> > > No.
> > > CVQ is doing both of them currently uniformly across PF, VF.
> > > Future SIOV will be able to this also.
> >
> > Please explain how CVQ is related. Or how transport virtqueue blocks CVQ in
> > any sense. Or any transport virtqueue commands do that.
> >
> Any new configuration of PF, VF device is done over CVQ as listed in the doc by the guest directly.
> Same is usable for SIOV too.
>
> > Then let's continue the discussion here.
> >
> > >
> > > > >
> > > > > > I am guessing modern platforms with TDISP support are likely to
> > > > > > also support dirty bit in the IOMMU.
> > > > > >
> > > > > It will be some day.
> > > >
> > > > Dirty bit is far more realistic than TDISP in the short term.
> > > >
> > > > >
> > > > > >
> > > > > > > > > And for future scale, having new SIOV interface makes more
> > > > > > > > > sense which has
> > > > > > > > its own direct interface to device.
> > > > > > > > >
> > > > > > > > > I finally captured all past discussions in form of a FAQ at [1].
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > > https://docs.google.com/document/d/1Iyn-l3Nm0yls3pZaul4lZi
> > > > > > > > > Vj8x
> > > > > > > > > 1s73
> > > > > > > > > Ed6r
> > > > > > > > > Osmn6LfXc/edit?usp=sharing
> > > > > > > >
> > > > > > > > Yea skimmed that, "Cons: None". Are you 100% sure? Anyway,
> > > > > > > > discussion will take place on the mailing list please.
> > > > > > >
> > > > > > > We cannot keep discussing the register interface every week.
> > > > > > > I remember we have discussed this many times already in
> > > > > > > following
> > > > series.
> > > > > > >
> > > > > > > 1. legacy series
> > > >
> > > > How can this be supported in TDISP then?
> >
> > Please answer this question.
> >
> There is no requirement to support TDISP with legacy because TDISP archi require attestation and other things which were not there in legacy VMs anyway.

How TDISP/attestation is related to a specific device driver? How can
a legacy DPDK driver break things like attestation here?

> Hence it is not applicable.

Wow. The legacy tunnel was just invented by you for just several
months and soon became a second-class citizen in the proposal here?

>
>
> > > >
> > > > > > > 2. tvq v4 series
> > > > > > > 3. dynamic vq creation series
> > > > > > > 4. again during suspend series under tvq head 5. right now 6.
> > > > > > > May be more that I forgot.
> > > > > > >
> > > > > > > I captured all the direction and options in the doc. One can
> > > > > > > refer when those
> > > > > > questions arise there.
> > > > > > > If we don’t work cohesively same reasoning repetition does not
> > help.
> > > > > >
> > > > > > It's still the same too, doc or no doc. You want to build a
> > > > > > device without registers fine but don't force it down everyone's
> > throat.
> > > > > I don’t see any compelling reason for inventing new method really.
> > > >
> > > > New requests/platforms come for sure, and virtio supports various
> > transports.
> > > >
> > > > For example, there's a request to support PCI endpoint devices.
> > > What is PCI endpoint devices? Does it have a new device type?
> >
> > https://docs.kernel.org/PCI/endpoint/index.html
> >
> > >
> > > >
> > > > > Nor continuing in register mode.
> > > >
> > > > Most virtio devices are implemented in software.
> > > We see it differently in field and in virtio charter since 2021.
> >
> > I think we're talking about different things.
> So lets make any arbitrary comment as virtio is done in sw.
>
> >
> > >
> > > > And we have pure MMIO
> > > > based transport now which is implemented in registers only.
> > > >
> > > > > Virtio already has VQ.
> > > > > If CVQ is so problematic, one should put everything on registers
> > > > > and not run
> > > > on double standards.
> > > >
> > > > I don't think there's anyone who says CVQ is problematic.
> > > >
> > > Ok. than lets stop this endless debate.
> > > Everyone is using CVQ, lets continue to use.
> >
> > The CVQ and transport vitqueue is orthogonal. Again, there seems nobody
> > think transport virtqueue is a replacement of CVQ other than you.
> The performance numbers for the need of transport virtqueue are not published yet.

Have you published any numbers with admin virtqueue so far?

> So there is no need of transport virtqueue for VF and SIOV.

Let's not double standard please.

>
> >
> > >
> > > > >
> > > > > I captured all the reasoning and thoughts. I don’t have much to
> > > > > say in
> > > > support of infinite register scale.
> > > > >
> > > > > People who wants to push SIOV does not show single performance
> > > > > reason
> > > > on why SIOV to be done.
> > > > > I have upstreamed SIOVs in Linux as SFs without PASID, and in all
> > > > > our scale
> > > > tests, before the device chocks, the system chocks.
> > > > >
> > > > > So when someone pushes the SIOV series, I will be the first one
> > > > > interested in
> > > > reading the performance numbers to proceed with patches.
> > > > >
> > > > > > And now with 8MBytes
> > > > > > of on-device memory that's needed for migration and that's
> > > > > > apparently fine I am even less interested in saving 256 bytes
> > > > > > for config
> > > > space.
> > > > >
> > > > > Again, not the right comparison.
> > > > > When and how to use 256 matters.
> > > >
> > > > Do you know how much the config has grown in the past years since 1.0?
> > > >
> > > Very less and no point in deviating the design now anyway for device
> > migration or otherwise.
> > >
> > > > Virtio should be implemented easily from:
> > > >
> > > And it is already there.
> > >
> > > > 1) software device to hardware device
> > > > 2) embedded to server
> > > >
> > > > You can't say e.g migration is needed in all of the environments.
> > > Which line in the patch said this?
> >
> > You said you don't want to let the register grow. No?
> >
> Right.
> Hence all config work to occur on CVQ by the driver owning the device without mediation.

CVQ is device specific, it can't be used for things like vq reset.

>
>
> > Why do an embedded virtio device need to implement admin virtqueue just
> > for a new function like suspend or vq reset?
> >
> Vq reset, vq enable are blocking operation by nature post the device init is done.

Let's not have any assumption of transport.

> Hence, they don’t belong to init time config registers.

I don't get how this is related.

>
> > > I specifically asked to not build transport vq because efficiency is needed
> > on the PFs too.
> >
> > Transport virtqueue can be done in PF.
> >
> Please explain why PF needs a transport VQ. I explained all in the doc.

You invent a transport via adminq, then you want me to explain why?

What's more, the rationale has been in the transport virtqueue series.
It fits for the case where the register can't work well. SIOV is only
one of the possible use cases.

Thanks

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-22  4:39                                                                                                                   ` Parav Pandit
@ 2023-11-24  3:08                                                                                                                     ` Jason Wang
  0 siblings, 0 replies; 341+ messages in thread
From: Jason Wang @ 2023-11-24  3:08 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 12:39 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 9:46 AM
> >
> > On Wed, Nov 22, 2023 at 12:27 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 10:15 AM
> > > >
> > > > On Fri, Nov 17, 2023 at 11:09 PM Michael S. Tsirkin <mst@redhat.com>
> > > > wrote:
> > > > >
> > > > > On Fri, Nov 17, 2023 at 02:51:04PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 7:24 PM
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 12:46:21PM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 6:00 PM
> > > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > > > >
> > > > > > > > > On Fri, Nov 17, 2023 at 12:02:49PM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > Sent: Friday, November 17, 2023 5:13 PM
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Nov 17, 2023 at 11:20:14AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > > Sent: Friday, November 17, 2023 4:41 PM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Nov 17, 2023 at 10:20:45AM +0000, Parav
> > > > > > > > > > > > > Pandit
> > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:38 PM
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Nov 15, 2023 at 05:39:43PM +0000,
> > > > > > > > > > > > > > > Parav Pandit
> > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Additionally, if hypervisor has put the
> > > > > > > > > > > > > > > > > > trap on virtio config, and because the
> > > > > > > > > > > > > > > > > > memory device already has the interface
> > > > > > > > > > > > > > > > > > for virtio config,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hypervisor can directly write/read from
> > > > > > > > > > > > > > > > > > the virtual config to the member's
> > > > > > > > > > > > > > > > > config space, without going through the
> > > > > > > > > > > > > > > > > device context,
> > > > right?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If it can do it or it can choose to not. I
> > > > > > > > > > > > > > > > > don't see how it is related to the discussion here.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It is. I don’t see a point of hypervisor not
> > > > > > > > > > > > > > > > using the native interface provided
> > > > > > > > > > > > > > > by the member device.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So for example, it seems reasonable to a
> > > > > > > > > > > > > > > member supporting both existing pci register
> > > > > > > > > > > > > > > interface for compatibility and the future DMA
> > > > > > > > > > > > > > > based one for scale. In such a case, it seems
> > > > > > > > > > > > > > > possible that DMA will expose more features
> > > > > > > > > > > > > > > than pci. And then a hypervisor might decide
> > > > > > > > > > > > > > > to use
> > > > > > > > > > > > > that in preference to pci registers.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We don’t find it right to involve owner device
> > > > > > > > > > > > > > for mediating at current scale
> > > > > > > > > > > > >
> > > > > > > > > > > > > In this model, device will be its own owner.
> > > > > > > > > > > > > Should not be a
> > > > problem.
> > > > > > > > > > > > >
> > > > > > > > > > > > I didn’t understand above comment.
> > > > > > > > > > >
> > > > > > > > > > > We'd add a new group type "self". You can then send
> > > > > > > > > > > admin commands through VF itself not through PF.
> > > > > > > > > > >
> > > > > > > > > > How? The device is owned by the guest. FLR and device
> > > > > > > > > > reset cannot send the
> > > > > > > > > admin command reliably.
> > > > > > > > >
> > > > > > > > > It's of the "it hurts when I do this - don't do this then" category.
> > > > > > > > >
> > > > > > > > it is don’t do medication category, yes due all this
> > > > > > > > weirdness that has been
> > > > > > > asked.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > > and to not break TDISP efforts in upcoming time
> > > > > > > > > > > > > > by such
> > > > design.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Look you either stop mentioning TDISP as
> > > > > > > > > > > > > motivation or actually try to address it. Safe
> > > > > > > > > > > > > migration with TDISP is really
> > > > hard.
> > > > > > > > > > > > But that is not an excuse to say that TDISP
> > > > > > > > > > > > migration is not present, hence
> > > > > > > > > > > involve the owner device for config space access.
> > > > > > > > > > > > This is another hurdle added that further blocks us
> > > > > > > > > > > > away from
> > > > TDISP.
> > > > > > > > > > > > Hence, we don’t want to take the route of involving
> > > > > > > > > > > > owner device for any
> > > > > > > > > > > config access.
> > > > > > > > > > >
> > > > > > > > > > > This "blocks" is all just wild hunches. hypervisor
> > > > > > > > > > > controls some aspects of TDISP devices for sure -
> > > > > > > > > > > maybe we actually should use pci config space as that
> > > > > > > > > > > is generally hypervisor
> > > > controlled.
> > > > > > > > > > Even bad to do hypercalls.
> > > > > > > > > > I showed you last time the role of the PCI config space
> > > > > > > > > > snippet from the
> > > > > > > spec.
> > > > > > > > >
> > > > > > > > > Yes I remember. This is just an example though. My point
> > > > > > > > > is maybe it is solvable maybe it is not.
> > > > > > > > >
> > > > > > > > > > Do you see we are repeating the discussion again?
> > > > > > > > >
> > > > > > > > > One of the reasons is that people bring up irrelevances.
> > > > > > > > > TDISP is important but has to be addressed or deferred not
> > > > > > > > > vaguely referred
> > > > to.
> > > > > > > >
> > > > > > > > So lets continue to follow the current TDISP direction of
> > > > > > > > not involving
> > > > > > > hypervisor for virtio common and device config.
> > > > > >
> > > > > > If you disagree to it, please speak now, so that we don’t debate
> > > > > > on this
> > > > again in next 3 days.
> > > > > > Because this is the fundamental design considerations it relied on.
> > > > > > There is no point going forward if you want to disagree to it.
> > > > > > Other variants are fine, but other variants cannot be the only choice.
> > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > For example, your current patches are clearly broken for
> > TDISP:
> > > > > > > > > > > > > owner can control queue state at any time making
> > > > > > > > > > > > > device modify memory in any way it wants.
> > > > > > > > > > > > >
> > > > > > > > > > > > When TDISP migration is needed, the admin device can
> > > > > > > > > > > > be another TVM
> > > > > > > > > > > outside the HV scope.
> > > > > > > > > > > > Or an alternative would have device context
> > > > > > > > > > > > encrypted not visible to HV at
> > > > > > > > > all.
> > > > > > > > > > >
> > > > > > > > > > > Maybe. Fact remains your patches do conflict with
> > > > > > > > > > > TDISP and you seem to be fine with it because you have
> > > > > > > > > > > a hunch you can
> > > > fix it.
> > > > > > > > > > > But we can't do development based on your hunches.
> > > > > > > > > > >
> > > > > > > > > > We have different view.
> > > > > > > > > > My patches do not conflict with TDISP because TDISP has
> > > > > > > > > > clear definition of
> > > > > > > > > not involving hypervisor for transport.
> > > > > > > > > > And that part is still preserved.
> > > > > > > > > > Delegating the migration to another TDISP or encrypting
> > > > > > > > > > is yet to be
> > > > > > > defined.
> > > > > > > > > > And current patches will align to both the approaches in
> > future.
> > > > > > > > > >
> > > > > > > > > > So you need to re-evaluate your judgment.
> > > > > > > > >
> > > > > > > > > If you like they do not "conflict".  But if used with
> > > > > > > > > TDISP they just make it insecure and thus completely
> > > > > > > > > worthless.  If hypervisor can change ring state to make
> > > > > > > > > device poke at random guest memory then it is game over
> > > > > > > > > and all the effort spent was
> > > > security theater.
> > > > > > > > Not really, I proposed two options.
> > > > > > > > 1. delegate the task of LM to the TVM. (proposed by two cpu
> > vendors).
> > > > > > > > In this case all the infra we build here, just works fine.
> > > > > > >
> > > > > > > I think modification will be needed: currently commands are
> > > > > > > sent through the PF, and that is under hypervisor control.
> > > > > > > You should not assign PF to TVM.
> > > >
> > > > That's the point. And that's why it keeps people confused to believe
> > > > the current PF/adminq can work in the TDISP.
> > > >
> > > There is no confusion.
> >
> > No, when LingShan points out the conflict, you just told us it will be
> > addressed in the future.
> >
> Current proposal does not punch the hole in the TDISP TVM interface.

So you meant adminq in PF works in TDISP?

>
> > And after Michael pointed it out again, you agree than adminq can not be
> > part of PF in this context.
> >
> As explained when TDISP for device migration evolves, there will be other trusted entity which will do it.

Platform has been evolved to support dirty pages. Let's use that.

>
>
> > And you miss the fact that admin virtqueue today can't be used to manage
> > the owner.
> >
> There is no such need for device migration.

You just agree that in order to migrate to TDISP, admin can't sit in
PF, so where is it?

>
> > > The admin queue interface ensures first step that TDISP interface is
> > dedicated to guest as today.
> > > There is no bifurcation added on the VF that needs extra mediation.
> > >
> > > > > > Yes, an admin virtio function will be there which will do the
> > > > > > admin
> > > > commands listed.
> > > > >
> > > > > So it can't be PF, so at least we need a new group type.
> > > > > I am inclined to then say, operate it through VF itself.
> > > >
> > > > So it exactly matches the idea of transport virtqueue (a per VF/SF one).
> > > >
> > > There is no need for transport virtqueue for VF as VF device has same
> > uniform principle as PF.
> >
> > I don't understand here. I have explained that you have invented a function
> > duplication of transport virtqueue.
> >
> No. we haven’t.
> It is explained in other thread that functionality is different.
>
> > > If you want transport vq, please have it on the PF too.
> >
> > Nothing prevents this, actually transport virtqueue start from this.
> >
> > > And that also is not needed because there is already CVQ.
> >
> > I don't see why you keep mentioning CVQ. I don't see anyone that says
> > transport virtqueue is going to replace CVQ.
> >
> I explained everything in the doc. This is the Nth time being repeated...

The doc is wrong since it assumes that people worry that transport
virtqueue will replace cvq.

>
> > >
> > > > But it still requires a PCI part to bootstrap.
> > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > > It also does not require any hypervisor mediation for control
> > plane.
> > > > > > > >
> > > > > > > > 2. Encrypt the owner device workload to be not seen by
> > > > > > > > hypervisor
> > > > > > > >
> > > > > > > > Both methods does not affect the current direction.
> > > > > > > >
> > > > > > > > But if we force trap+emulation, it is 100% broken for TDISP.
> > > > > > > > And I would not promote that.
> > > > > > > >
> > > > > > > > > But you know this, don't you? This is why you mentioned
> > > > > > > > > encrypting
> > > > device.
> > > > > > > > > Maybe that works. It just does not work *as is*.
> > > > > > > > It works as_is. But current infrastructure does not block
> > > > > > > > the future
> > > > work.
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Such encryption is not possible, with the
> > > > > > > > > > > > trap+emulation method, where HV
> > > > > > > > > > > will have to decrypt the data coming over MMIO writes.
> > > > > > > > > > >
> > > > > > > > > > > I don't how what trap+emulation has to do with it. Do
> > > > > > > > > > > you refer to the shadow vq thing?
> > > > > > > > > >
> > > > > > > > > > The method proposed here does not hinder any TDISP
> > direction.
> > > > > > > > >
> > > > > > > > > direction? No, why would it. we can always add more
> > > > > > > > > commands that are safe for TDISP. commands you propose
> > > > > > > > > here are unsafe for
> > > > TDISP.
> > > > > > > > >
> > > > > > > > > > Without my proposal, do you have a method that does not
> > > > > > > > > > involve hypervisor
> > > > > > > > > intervention for virtio common and device config space,
> > > > > > > > > cvq and shadow
> > > > > > > vq?
> > > > > > > > > > If so, I would like to hear that as well because that
> > > > > > > > > > will align with
> > > > TDISP.
> > > > > > > > >
> > > > > > > > > I really did not give it much thought.  I suspect for
> > > > > > > > > TDISP it just might be cleaner to have guest agent migrate
> > device.
> > > > > > > > > Certainly removes all
> > > > > > > the messy questions.
> > > > > > > > > That, to me impliest there needs to be a way to send
> > > > > > > > > migration commands through VF itself. Does this "involve
> > > > > > > > > hypervisor intervention"? No one should care I think.
> > > > > > > > Too far of the future to envision. May be yes. When such
> > > > > > > > platform is built, for sure whoever migrates need migrate
> > > > > > > > its device side
> > > > too.
> > > > > > > > Some knowledge of migration driver is needed.
> > > > > > >
> > > > > > > So TDISP migration is so far in the future you do not need to
> > > > > > > bother
> > > > about it.
> > > > > > > Fine. Then don't bring it up pls.
> > > > > > >
> > > > > > As long as we are aligned to the requirement that a virtio
> > > > > > member device is
> > > > mapped to the guest VM without mediating the virtio interface, I am
> > good.
> > > > > > Again, other variants are fine, but above listed mapped variant
> > > > > > is the
> > > > minimum variant needed.
> > > > >
> > > > > I think it's worth supporting this. I wouldn't call this minimum
> > > > > there are other approaches.  And I am not so sure it's worth
> > > > > trying to support this in all kind of systems such as IOMMU
> > > > > without dirty bit support. If some old systems will need
> > > > > mediation, this is kind of like legacy interface. Not a big deal.
> > > > >
> > > >
> > > > +1
> > >
> > > There are users with the recent cpus that may not have the IOMMU dirty
> > page tracking support.
> > > So I don’t fully agree.
> >
> > There are setups that don't have SR-IOV or even PCI.
> >
> Which are those? Please expand those transports that need it.

For transport/platform without SR-IOV, you want to expand the
transport no matter whether or not it is possible.
For dirty page tracking, you want to do it in virtio no matter how
long it has been implemented by various platforms.

If you are correct, I can say, please wait for the platform support
for dirty page tracking.

Thanks





>
> > Let's have a unified standard.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-22  6:05                                                                                                 ` Parav Pandit
@ 2023-11-24  3:40                                                                                                   ` Jason Wang
  0 siblings, 0 replies; 341+ messages in thread
From: Jason Wang @ 2023-11-24  3:40 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin,
	virtio-comment@lists.oasis-open.org, cohuck@redhat.com,
	sburla@marvell.com, Shahaf Shuler, Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 2:05 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 10:57 AM
> >
> > On Wed, Nov 22, 2023 at 12:32 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 12:54 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 1:28 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 9:50 AM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > > > wrote:
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Monday, November 13, 2023 9:03 AM
> > > > > > > >
> > > > > > > > On Thu, Nov 9, 2023 at 2:25 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > Sent: Tuesday, November 7, 2023 9:35 AM
> > > > > > > > > >
> > > > > > > > > > On Mon, Nov 6, 2023 at 3:05 PM Parav Pandit
> > > > > > > > > > <parav@nvidia.com>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Nov 2, 2023 at 2:10 PM Parav Pandit
> > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> >
> > [...]
> >
> > > > > >
> > > > > I disagree.
> > > > > There are two different functionalities.
> > > > >
> > > > > Functionality_1: explicit ask for read or write
> > > > > Functionality_2: read what has changed
> > > >
> > > > This needs to be justified. I won't repeat the questions again here.
> > > >
> > > As explained the use case in theory of operation already.
> > >
> > > > >
> > > > > Should one merge 1 and 2 and complicate the command?
> > > > > I prefer not to.
> > > >
> > > > Again there're functional duplications. E.g your command duplicates
> > > > common_cfg for sure.
> > > Nop. it is not.
> > > Common cfg is accessed directly by guest member driver.
> >
> > It can be accessed directly, if we have adminq per VF.
> Sure, instead I proposed the cvq as there is no need to burn another queue.

Please, explain how CVQ is related here.

>
> >
> > >
> > > >
> > > > >
> > > > > Now having two different commands help for debugging to
> > > > > differentiate between mgmt. commands and guest initiated commands.
> > > > > :)
> > > > >
> > > > > > >
> > > > > > > > Guest configure the following one by one:
> > > > > > > >
> > > > > > > > 1) vq size
> > > > > > > > 2) vq addresses
> > > > > > > > 3) MSI-X
> > > > > > > >
> > > > > > > > etc?
> > > > > > > >
> > > > > > > I think you interpreted "incremental" differently than I described.
> > > > > > > In the device context read, the incremental is:
> > > > > > >
> > > > > > > If the hypervisor driver has read the device context twice,
> > > > > > > the second read
> > > > > > won't return any new data if nothing changed.
> > > > > >
> > > > > > See above.
> > > > > >
> > > > > Yeah, two separate commands needed.
> > > > >
> > > > > > > For example, if RSS configuration didn’t change between two
> > > > > > > reads, the
> > > > > > second read wont return the TLV for RSS Context.
> > > > > > >
> > > > > > > While for transport the need is, when guest asked, one device
> > > > > > > must read it
> > > > > > regardless of the change.
> > > > > > >
> > > > > > > So notion of incremental is not by address, but by the value.
> > > > > > >
> > > > > > > > > For example, VQ configuration is exchanged once between
> > > > > > > > > src and
> > > > dst.
> > > > > > > > > But VQ avail and used index may be updated multiple times.
> > > > > > > >
> > > > > > > > If it can work with multiple times of updating, why can't it
> > > > > > > > work if we just update it once?
> > > > > > > Functionally it can work.
> > > > > >
> > > > > > I think you answer yourself.
> > > > > >
> > > > > Yes, I don’t like abuse of command.
> > > >
> > > > How did you define abuse or can spec ever need to define that?
> > > I don’t have any different definition than dictionary definition for
> > > abuse. :)
> >
> > "Abuse" is pretty subjective. Spec have driver normative, please explain how
> > and if you can use that.
> >
> :)
>
> So be it.
>
> Spec does not need to define the dictionary.
>
> For explicit read and write simple read/write commands are right to me.
>
> > >
> > > >
> > > > >
> > > > > > > Performance wise, one does not want to update multiple times,
> > > > > > > unless there
> > > > > > is a change.
> > > > > > >
> > > > > > > Read as explained above is not meant to return same content again.
> > > > > > >
> > > > > > > >
> > > > > > > > > So here hypervisor do not want to read any specific set of
> > > > > > > > > fields and
> > > > > > > > hypervisor is not parsing them either.
> > > > > > > > > It is just a byte stream for it.
> > > > > > > >
> > > > > > > > Firstly, spec must define the device context format, so
> > > > > > > > hypervisor can understand which byte is what otherwise you
> > > > > > > > can't maintain migration compatibility.
> > > > > > > Device context is defined already in the latest version.
> > > > > > >
> > > > > > > > Secondly, you can't mandate how the hypervisor is written.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > As opposed to that, in case of transport, the guest
> > > > > > > > > explicitly asks to read or
> > > > > > > > write specific bytes.
> > > > > > > > > Therefore, it is not incremental.
> > > > > > > >
> > > > > > > > I'm totally lost. Which part of the transport is not incremental?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Additionally, if hypervisor has put the trap on virtio
> > > > > > > > > config, and because the memory device already has the
> > > > > > > > > interface for virtio config,
> > > > > > > > >
> > > > > > > > > Hypervisor can directly write/read from the virtual config
> > > > > > > > > to the member's
> > > > > > > > config space, without going through the device context, right?
> > > > > > > >
> > > > > > > > If it can do it or it can choose to not. I don't see how it
> > > > > > > > is related to the discussion here.
> > > > > > > >
> > > > > > > It is. I don’t see a point of hypervisor not using the native
> > > > > > > interface provided
> > > > > > by the member device.
> > > > > >
> > > > > > It really depends on the case, and I see how it duplicates with
> > > > > > the functionality that is provided by both:
> > > > > >
> > > > > > 1) The existing PCI transport
> > > > > >
> > > > > > or
> > > > > >
> > > > > > 2) The transport virtqueue
> > > > > >
> > > > > I would like to conclude that we disagree in our approaches.
> > > > > PCI transport is for member device to directly communicate from
> > > > > guest
> > > > driver to the device.
> > > > > This is uniform across PF, VFs, SIOV.
> > > >
> > > > For "PCi transport" did you mean the one defined in spec? If yes,
> > > > how can it work with SIOV with what you're saying here (a direct
> > > > communication channel)?
> > > >
> > > SIOV device may have same MMIO as VF.
> >
> > We circle back, SIOV is for scalability. If you claim registers don't scale, SVIO via
> > MMIO doesn't scale either.
> Hence, any new and slow things to stay off the MMIO.
> And for currently defined things, Lingshan will show the performance numbers of why they should be transported via a virtqeueue.
>
> >
> > >
> > > > >
> > > > > Admin commands are transport independent and their task is device
> > > > migration.
> > > > > One is not replacing the other.
> > > > >
> > > > > Transport virtqueue will never transport driver notifications,
> > > > > hence it does
> > > > not qualify at "transport".
> > > >
> > > > Another double standard.
> > > I disagree. You coined the term transport vq, so stand behind it to transport
> > everything.
> >
> > Nope, it works like other transport. It doesn't aim to replace any existing
> > transport.
> >
> > >
> > > >
> > > > MMIO will never transport device notification, hence it does not
> > > > qualify as "transport"?
> > > >
> > > How does interrupts work?
> >
> > It depends on the platform, no?
> I don’t know, how does a MMIO virtio hw device deliver an interrupt for x86 cpu?

It uses platform or architecture specific mechanisms.

>
> >
> > > Seems like missing basic functionality in transport.
> >
> > Not necessarily the charge of a transport. Virtio transport can't fly without a
> > platform.
> What is the equivalent of msix in mmio?

MSI is not unique to PCI, it can be used for MMIO as well if the
platform or architecture supports that.

> >
> > >
> > > > >
> > > > > For the vdpa case, there is no need for extra admin commands as
> > > > > the
> > > > mediation layer can directly use the interface available from the
> > > > member device itself.
> > > > >
> > > > > You continue to want to overload admin commands for dual purpose,
> > > > > does
> > > > not make sense to me.
> > > > >
> > > > > > >
> > > > > > >  > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > it is not good idea to overload management
> > > > > > > > > > > > > commands with actual run time
> > > > > > > > > > > > guest commands.
> > > > > > > > > > > > > The device context read writes are largely for
> > > > > > > > > > > > > incremental
> > > > updates.
> > > > > > > > > > > >
> > > > > > > > > > > > It doesn't matter if it is incremental or not but
> > > > > > > > > > > >
> > > > > > > > > > > It does because you want different functionality only
> > > > > > > > > > > for purpose of backward
> > > > > > > > > > compatibility.
> > > > > > > > > > > That also if the device does not offer them as portion
> > > > > > > > > > > of MMIO
> > > > BAR.
> > > > > > > > > >
> > > > > > > > > > I don't see how it is related to the "incremental part".
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 1) the function is there
> > > > > > > > > > > > 2) hypervisor can use that function if they want and
> > > > > > > > > > > > virtio
> > > > > > > > > > > > (spec) can't forbid that
> > > > > > > > > > > >
> > > > > > > > > > > It is not about forbidding or supporting.
> > > > > > > > > > > Its about what functionality to use for management
> > > > > > > > > > > plane and guest
> > > > > > > > plane.
> > > > > > > > > > > Both have different needs.
> > > > > > > > > >
> > > > > > > > > > People can have different views, there's nothing we can
> > > > > > > > > > prevent a hypervisor from using it as a transport as far as I can
> > see.
> > > > > > > > > For device context write command, it can be used (or
> > > > > > > > > probably
> > > > > > > > > abused) to do
> > > > > > > > write but I fail to see why to use it.
> > > > > > > >
> > > > > > > > The function is there, you can't prevent people from doing that.
> > > > > > > >
> > > > > > > One can always mess up itself. :) It is not prevented. It is
> > > > > > > just not right way to use the interface.
> > > > > > >
> > > > > > > > > Because member device already has the interface to do
> > > > > > > > > config read/write and
> > > > > > > > it is accessible to the hypervisor.
> > > > > > > >
> > > > > > > > Well, it looks self-contradictory again. Are you saying
> > > > > > > > another set of commands that is similar to device context is
> > > > > > > > needed for non-PCI
> > > > > > transport?
> > > > > > > >
> > > > > > > All these non pci transport discussion is just meaning less.
> > > > > > > Let MMIO bring the concept of member device at that point
> > > > > > > something make
> > > > > > sense to discuss.
> > > > > >
> > > > > > It's not necessarily MMIO. For example the SIOV, which I don't
> > > > > > think can use the existing PCI transport.
> > > > > >
> > > > > > > PCI SIOV is also the PCI device at the end.
> > > > > >
> > > > > > We don't want to end up with two sets of commands to save/load
> > > > > > SRIOV and SIOV at least.
> > > > > >
> > > > > This proposal ensures that SRIOV and SIOV devices are treated equally.
> > > >
> > > > How? Did you mean your proposal can work for SIOV? What's the
> > > > transport then?
> > > Yes. All majority of the device contexts should work for SIOV device as_is.
> > > Member id would be different.
> > > Some device context TLVs may be new as SIOV may have some
> > simplifications as it may not have the giant register space like current one.
> >
> > You only explain the migration part but not the transport part.
> >
> Because SIOV is still under construction in community.

Transport virtqueue is not just designed for SIOV.

> There is no point of defining SIOV transport for some half-cooked spec.

Then let's not claim your proposal suits for TDISP. The live migration
of TDISP is even far beyond half-cooked.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-22  7:48                                                                                                     ` Michael S. Tsirkin
@ 2023-11-24  3:56                                                                                                       ` Jason Wang
  2023-11-24  5:40                                                                                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-24  3:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Wed, Nov 22, 2023 at 3:48 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Nov 22, 2023 at 12:13:34PM +0800, Jason Wang wrote:
> > > > What's wrong if we just allow them to be R/W over adminq/cmmands?
> > > >
> > > As explained before,
> > > Each guest has its own dedicated non mediated interface as defined in virtio spec to not involve hypervisor.
> >
> > So what's wrong with inventing per VF queue to do that? For example
> > transport virtqueue.
>
> Nothing is wrong with this.
>
> But what is problematic is just re-using config space for migration because

It's not a reusing, it works exactly like this proposal:

1) VF config space is assigned to guest
2) using PF queue to migrate

The only difference is the command:

In this proposal, it is

1) virtio_dev_ctx_pci_vq_cfg structure
2) in transport virtqueue, it introduce a set of commands to access
one or several fields on the common cfg

Thanks

> it means we can not just say "don't access device after it is stopped"
> because yes you need to access it to save/restore state.
> And a new interface over admin cmds just for this side-steps the
> issue nicely.
>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-24  3:56                                                                                                       ` Jason Wang
@ 2023-11-24  5:40                                                                                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24  5:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 24, 2023 at 11:56:16AM +0800, Jason Wang wrote:
> On Wed, Nov 22, 2023 at 3:48 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Nov 22, 2023 at 12:13:34PM +0800, Jason Wang wrote:
> > > > > What's wrong if we just allow them to be R/W over adminq/cmmands?
> > > > >
> > > > As explained before,
> > > > Each guest has its own dedicated non mediated interface as defined in virtio spec to not involve hypervisor.
> > >
> > > So what's wrong with inventing per VF queue to do that? For example
> > > transport virtqueue.
> >
> > Nothing is wrong with this.
> >
> > But what is problematic is just re-using config space for migration because
> 
> It's not a reusing, it works exactly like this proposal:
> 
> 1) VF config space is assigned to guest
> 2) using PF queue to migrate
> 
> The only difference is the command:
> 
> In this proposal, it is
> 
> 1) virtio_dev_ctx_pci_vq_cfg structure
> 2) in transport virtqueue, it introduce a set of commands to access
> one or several fields on the common cfg
> 
> Thanks

The problem with 2) is no one seems to bother building it right now,
so I'm not sure we can with a straight face require people to
use this infrastructure which does not exist.
And the need this patchset is trying to address is real. So I think
we should address this proposal on its own merits not on how well
it compares with a theoretical transport virtqueue.

> > it means we can not just say "don't access device after it is stopped"
> > because yes you need to access it to save/restore state.
> > And a new interface over admin cmds just for this side-steps the
> > issue nicely.
> >
> > --
> > MST
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-24  3:07                                                                                                           ` Jason Wang
@ 2023-11-24 11:38                                                                                                             ` Michael S. Tsirkin
  2023-11-24 11:51                                                                                                               ` Jason Wang
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24 11:38 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 24, 2023 at 11:07:53AM +0800, Jason Wang wrote:
> How TDISP/attestation is related to a specific device driver? How can
> a legacy DPDK driver break things like attestation here?
> 
> > Hence it is not applicable.
> 
> Wow. The legacy tunnel was just invented by you for just several
> months and soon became a second-class citizen in the proposal here?

Legacy is not going to be a 1st class citizen and that was a
concious decision the TC made. In particular we know straight
away that there is no way to make them safely work
while preserving assumptions confidential computing
guests make (which I guess is what you mean here).
The whole point of isolating legacy mess in the special
commands was so we don't try to support them going forward,
don't try to add new features for legacy interfaces please.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-24 11:38                                                                                                             ` Michael S. Tsirkin
@ 2023-11-24 11:51                                                                                                               ` Jason Wang
  2023-11-24 12:10                                                                                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Jason Wang @ 2023-11-24 11:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 24, 2023 at 7:38 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Nov 24, 2023 at 11:07:53AM +0800, Jason Wang wrote:
> > How TDISP/attestation is related to a specific device driver? How can
> > a legacy DPDK driver break things like attestation here?
> >
> > > Hence it is not applicable.
> >
> > Wow. The legacy tunnel was just invented by you for just several
> > months and soon became a second-class citizen in the proposal here?
>
> Legacy is not going to be a 1st class citizen and that was a
> concious decision the TC made.
> In particular we know straight
> away that there is no way to make them safely work
> while preserving assumptions confidential computing
> guests make (which I guess is what you mean here).

Probably, but I think we need to support its migration without TD.

> The whole point of isolating legacy mess in the special
> commands was so we don't try to support them going forward,
> don't try to add new features for legacy interfaces please.

Thanks


>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-24 11:51                                                                                                               ` Jason Wang
@ 2023-11-24 12:10                                                                                                                 ` Michael S. Tsirkin
  2023-11-24 12:13                                                                                                                   ` Parav Pandit
  0 siblings, 1 reply; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24 12:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 24, 2023 at 07:51:06PM +0800, Jason Wang wrote:
> On Fri, Nov 24, 2023 at 7:38 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Nov 24, 2023 at 11:07:53AM +0800, Jason Wang wrote:
> > > How TDISP/attestation is related to a specific device driver? How can
> > > a legacy DPDK driver break things like attestation here?
> > >
> > > > Hence it is not applicable.
> > >
> > > Wow. The legacy tunnel was just invented by you for just several
> > > months and soon became a second-class citizen in the proposal here?
> >
> > Legacy is not going to be a 1st class citizen and that was a
> > concious decision the TC made.
> > In particular we know straight
> > away that there is no way to make them safely work
> > while preserving assumptions confidential computing
> > guests make (which I guess is what you mean here).
> 
> Probably, but I think we need to support its migration without TD.

Ah, migration of legacy guests is an interesting point.
Bringing up TD was just confusing.

I expect this to be rather painless to add precisely
because this proposal does not reply on either modern
or legacy pci layout. But details need to be thought
through, I agree. I don't see why this is not applicable -
definitely people are used to be able to migrate these
guests.


> > The whole point of isolating legacy mess in the special
> > commands was so we don't try to support them going forward,
> > don't try to add new features for legacy interfaces please.
> 
> Thanks
> 
> 
> >
> > --
> > MST
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

* RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-24 12:10                                                                                                                 ` Michael S. Tsirkin
@ 2023-11-24 12:13                                                                                                                   ` Parav Pandit
  2023-11-24 12:19                                                                                                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 341+ messages in thread
From: Parav Pandit @ 2023-11-24 12:13 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 24, 2023 5:40 PM
> 
> On Fri, Nov 24, 2023 at 07:51:06PM +0800, Jason Wang wrote:
> > On Fri, Nov 24, 2023 at 7:38 PM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > >
> > > On Fri, Nov 24, 2023 at 11:07:53AM +0800, Jason Wang wrote:
> > > > How TDISP/attestation is related to a specific device driver? How
> > > > can a legacy DPDK driver break things like attestation here?
> > > >
> > > > > Hence it is not applicable.
> > > >
> > > > Wow. The legacy tunnel was just invented by you for just several
> > > > months and soon became a second-class citizen in the proposal here?
> > >
> > > Legacy is not going to be a 1st class citizen and that was a
> > > concious decision the TC made.
> > > In particular we know straight
> > > away that there is no way to make them safely work while preserving
> > > assumptions confidential computing guests make (which I guess is
> > > what you mean here).
> >
> > Probably, but I think we need to support its migration without TD.
> 
> Ah, migration of legacy guests is an interesting point.
> Bringing up TD was just confusing.
> 
> I expect this to be rather painless to add precisely because this proposal does
> not reply on either modern or legacy pci layout. But details need to be
> thought through, I agree. I don't see why this is not applicable - definitely
> people are used to be able to migrate these guests.
>
The current layout in device context is for the modern.
The legacy one will be just another enum and struct that will be added subsequently.
 
> 
> > > The whole point of isolating legacy mess in the special commands was
> > > so we don't try to support them going forward, don't try to add new
> > > features for legacy interfaces please.
> >
> > Thanks
> >
> >
> > >
> > > --
> > > MST
> > >


^ permalink raw reply	[flat|nested] 341+ messages in thread

* Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  2023-11-24 12:13                                                                                                                   ` Parav Pandit
@ 2023-11-24 12:19                                                                                                                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 341+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24 12:19 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, virtio-comment@lists.oasis-open.org,
	cohuck@redhat.com, sburla@marvell.com, Shahaf Shuler,
	Maor Gottlieb, Yishai Hadas

On Fri, Nov 24, 2023 at 12:13:07PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, November 24, 2023 5:40 PM
> > 
> > On Fri, Nov 24, 2023 at 07:51:06PM +0800, Jason Wang wrote:
> > > On Fri, Nov 24, 2023 at 7:38 PM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > >
> > > > On Fri, Nov 24, 2023 at 11:07:53AM +0800, Jason Wang wrote:
> > > > > How TDISP/attestation is related to a specific device driver? How
> > > > > can a legacy DPDK driver break things like attestation here?
> > > > >
> > > > > > Hence it is not applicable.
> > > > >
> > > > > Wow. The legacy tunnel was just invented by you for just several
> > > > > months and soon became a second-class citizen in the proposal here?
> > > >
> > > > Legacy is not going to be a 1st class citizen and that was a
> > > > concious decision the TC made.
> > > > In particular we know straight
> > > > away that there is no way to make them safely work while preserving
> > > > assumptions confidential computing guests make (which I guess is
> > > > what you mean here).
> > >
> > > Probably, but I think we need to support its migration without TD.
> > 
> > Ah, migration of legacy guests is an interesting point.
> > Bringing up TD was just confusing.
> > 
> > I expect this to be rather painless to add precisely because this proposal does
> > not reply on either modern or legacy pci layout. But details need to be
> > thought through, I agree. I don't see why this is not applicable - definitely
> > people are used to be able to migrate these guests.
> >
> The current layout in device context is for the modern.
> The legacy one will be just another enum and struct that will be added subsequently.

Worth adding. The difficulty is when to send which state for
transitional devices. E.g. do they just always send both?
Don't shoot from the hip, Think about this, please and
address in the next version.


> > 
> > > > The whole point of isolating legacy mess in the special commands was
> > > > so we don't try to support them going forward, don't try to add new
> > > > features for legacy interfaces please.
> > >
> > > Thanks
> > >
> > >
> > > >
> > > > --
> > > > MST
> > > >
> 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 341+ messages in thread

end of thread, other threads:[~2023-11-24 12:20 UTC | newest]

Thread overview: 341+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-08 11:25 [virtio-comment] [PATCH v1 0/8] Introduce device migration support commands Parav Pandit
2023-10-08 11:25 ` [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration Parav Pandit
2023-10-09  8:49   ` Jason Wang
2023-10-09 10:06     ` Parav Pandit
2023-10-10  5:51       ` Jason Wang
2023-10-10  7:19         ` Parav Pandit
2023-10-10 12:41           ` Michael S. Tsirkin
2023-10-10 13:08             ` Parav Pandit
2023-10-10 14:00               ` Michael S. Tsirkin
2023-10-10 14:09                 ` Parav Pandit
2023-10-10 14:55                   ` Michael S. Tsirkin
2023-10-11  3:14           ` Jason Wang
2023-10-11  6:02             ` Michael S. Tsirkin
2023-10-11 10:47             ` Parav Pandit
2023-10-11 20:14               ` Michael S. Tsirkin
2023-10-12 10:21                 ` Parav Pandit
2023-10-13  1:15               ` Jason Wang
2023-10-13  6:36                 ` Parav Pandit
2023-10-17  1:41                   ` Jason Wang
2023-10-18  8:16                     ` Parav Pandit
2023-10-18 10:19                       ` Michael S. Tsirkin
2023-10-18 10:33                         ` Parav Pandit
2023-10-19  2:41                       ` Jason Wang
2023-10-13 11:41                 ` Michael S. Tsirkin
2023-10-09 12:02     ` Parav Pandit
2023-10-09 16:19       ` Michael S. Tsirkin
2023-10-09 17:21         ` Parav Pandit
2023-10-10  8:57           ` Zhu, Lingshan
2023-10-10  9:40             ` Parav Pandit
2023-10-11 10:25               ` Zhu, Lingshan
2023-10-11 11:43                 ` Parav Pandit
2023-10-12 10:21                   ` Zhu, Lingshan
2023-10-12 10:58                     ` Parav Pandit
2023-10-12 11:17                       ` Michael S. Tsirkin
2023-10-12 11:47                         ` Parav Pandit
2023-10-12 13:05                           ` Michael S. Tsirkin
2023-10-13  1:16                       ` Jason Wang
2023-10-13  6:36                         ` Parav Pandit
2023-10-17  1:53                           ` Jason Wang
2023-10-17  2:02                             ` Jason Wang
2023-10-17  3:19                               ` Parav Pandit
2023-10-17  3:26                             ` Parav Pandit
2023-10-18  0:52                               ` Jason Wang
2023-10-18  4:30                                 ` Parav Pandit
2023-10-18  6:14                                   ` Michael S. Tsirkin
2023-10-18  6:26                                     ` Parav Pandit
2023-10-19  2:41                                   ` Jason Wang
2023-10-19  4:29                                     ` Parav Pandit
2023-10-19  4:44                                       ` Jason Wang
2023-10-19  5:31                                         ` Parav Pandit
2023-10-19  6:35                                           ` Michael S. Tsirkin
2023-10-19  7:30                                             ` Parav Pandit
2023-10-19  8:31                                               ` Michael S. Tsirkin
2023-10-19  8:58                                                 ` Parav Pandit
2023-10-19  9:11                                                   ` Michael S. Tsirkin
2023-10-19  9:20                                                     ` Parav Pandit
2023-10-19  9:26                                                       ` Michael S. Tsirkin
2023-10-19  9:33                                                         ` Michael S. Tsirkin
2023-10-19  9:41                                                           ` Parav Pandit
2023-10-19  9:53                                                             ` Michael S. Tsirkin
2023-10-19  9:54                                                               ` Michael S. Tsirkin
2023-10-19 10:00                                                               ` Parav Pandit
2023-10-19 10:01                                                                 ` Parav Pandit
2023-10-19  9:39                                                         ` Parav Pandit
2023-10-19  9:49                                                           ` Michael S. Tsirkin
2023-10-19  9:57                                                             ` Parav Pandit
2023-10-23  3:45                                           ` Jason Wang
2023-10-23  4:42                                             ` Parav Pandit
2023-10-24  4:46                                               ` Jason Wang
2023-10-24  4:49                                                 ` Parav Pandit
2023-10-25  1:28                                                   ` Jason Wang
2023-10-25  7:02                                                     ` Parav Pandit
2023-10-26  0:46                                                       ` Jason Wang
2023-10-26  3:45                                                         ` Parav Pandit
2023-10-30  4:06                                                           ` Jason Wang
2023-10-30  4:46                                                             ` Parav Pandit
2023-10-31  1:34                                                               ` Jason Wang
2023-10-31  5:30                                                                 ` Parav Pandit
2023-11-01  0:33                                                                   ` Jason Wang
2023-11-01  3:31                                                                     ` Parav Pandit
2023-11-02  4:25                                                                       ` Jason Wang
2023-11-02  6:10                                                                         ` Parav Pandit
2023-11-06  6:34                                                                           ` Jason Wang
2023-11-06  7:05                                                                             ` Parav Pandit
2023-11-07  4:05                                                                               ` Jason Wang
2023-11-07  7:22                                                                                 ` Michael S. Tsirkin
2023-11-07  7:57                                                                                   ` Zhu, Lingshan
2023-11-07  8:05                                                                                     ` Michael S. Tsirkin
2023-11-08  4:28                                                                                   ` Jason Wang
2023-11-09  6:25                                                                                 ` Parav Pandit
2023-11-13  3:32                                                                                   ` Jason Wang
2023-11-15 17:39                                                                                     ` Parav Pandit
2023-11-16  4:20                                                                                       ` Jason Wang
2023-11-16  5:28                                                                                         ` Parav Pandit
2023-11-16  6:23                                                                                           ` Michael S. Tsirkin
2023-11-16  6:34                                                                                             ` Parav Pandit
2023-11-16  6:38                                                                                               ` Michael S. Tsirkin
2023-11-16  6:43                                                                                                 ` Parav Pandit
2023-11-16  6:56                                                                                                   ` Michael S. Tsirkin
2023-11-16  7:02                                                                                                     ` Parav Pandit
2023-11-16  7:14                                                                                                       ` Michael S. Tsirkin
2023-11-16  9:45                                                                                                         ` Parav Pandit
2023-11-21  4:22                                                                                               ` Jason Wang
2023-11-21 16:25                                                                                                 ` Parav Pandit
2023-11-22  4:13                                                                                                   ` Jason Wang
2023-11-22  7:48                                                                                                     ` Michael S. Tsirkin
2023-11-24  3:56                                                                                                       ` Jason Wang
2023-11-24  5:40                                                                                                         ` Michael S. Tsirkin
2023-11-21  7:24                                                                                           ` Jason Wang
2023-11-21 16:32                                                                                             ` Parav Pandit
2023-11-22  5:27                                                                                               ` Jason Wang
2023-11-22  6:05                                                                                                 ` Parav Pandit
2023-11-24  3:40                                                                                                   ` Jason Wang
2023-11-17 10:08                                                                                       ` Michael S. Tsirkin
2023-11-17 10:20                                                                                         ` Parav Pandit
2023-11-17 11:11                                                                                           ` Michael S. Tsirkin
2023-11-17 11:20                                                                                             ` Parav Pandit
2023-11-17 11:43                                                                                               ` Michael S. Tsirkin
2023-11-17 12:02                                                                                                 ` Parav Pandit
2023-11-17 12:30                                                                                                   ` Michael S. Tsirkin
2023-11-17 12:46                                                                                                     ` Parav Pandit
2023-11-17 13:54                                                                                                       ` Michael S. Tsirkin
2023-11-17 14:51                                                                                                         ` Parav Pandit
2023-11-17 15:09                                                                                                           ` Michael S. Tsirkin
2023-11-21  4:44                                                                                                             ` Jason Wang
2023-11-21 16:27                                                                                                               ` Parav Pandit
2023-11-22  4:16                                                                                                                 ` Jason Wang
2023-11-22  4:39                                                                                                                   ` Parav Pandit
2023-11-24  3:08                                                                                                                     ` Jason Wang
2023-11-21  5:25                                                                                                   ` Jason Wang
2023-11-21 16:30                                                                                                     ` Parav Pandit
2023-11-22  4:18                                                                                                       ` Jason Wang
2023-11-22  4:26                                                                                                         ` Parav Pandit
2023-11-24  3:07                                                                                                           ` Jason Wang
2023-11-24 11:38                                                                                                             ` Michael S. Tsirkin
2023-11-24 11:51                                                                                                               ` Jason Wang
2023-11-24 12:10                                                                                                                 ` Michael S. Tsirkin
2023-11-24 12:13                                                                                                                   ` Parav Pandit
2023-11-24 12:19                                                                                                                     ` Michael S. Tsirkin
2023-10-13 11:26                         ` Michael S. Tsirkin
2023-10-13 11:41                           ` Parav Pandit
2023-10-13 11:52                             ` Michael S. Tsirkin
2023-10-13 11:57                               ` Parav Pandit
2023-10-17  1:42                           ` Jason Wang
2023-10-13  9:06                       ` Zhu, Lingshan
2023-10-13 11:28                         ` Michael S. Tsirkin
2023-10-13 11:42                           ` Parav Pandit
2023-10-16  8:41                           ` Zhu, Lingshan
2023-10-16  9:00                             ` Michael S. Tsirkin
2023-10-16  9:44                               ` Zhu, Lingshan
2023-10-13 11:28                         ` Parav Pandit
2023-10-13 11:49                           ` Michael S. Tsirkin
2023-10-13 12:00                             ` Parav Pandit
2023-10-16  8:46                             ` Zhu, Lingshan
2023-10-16  9:44                           ` Zhu, Lingshan
2023-10-18  5:00                             ` Parav Pandit
2023-10-18  6:32                               ` Zhu, Lingshan
2023-10-18  6:34                                 ` Parav Pandit
2023-10-18  6:39                                 ` Zhu, Lingshan
2023-10-18  6:42                                   ` Parav Pandit
2023-10-11 19:51             ` Michael S. Tsirkin
2023-10-12 10:23               ` Zhu, Lingshan
2023-10-08 11:25 ` [virtio-comment] [PATCH v1 2/8] admin: Redefine reserved2 as command specific output Parav Pandit
2023-10-08 11:25 ` [virtio-comment] [PATCH v1 3/8] device-context: Define the device context fields for device migration Parav Pandit
2023-10-08 11:41   ` [virtio-comment] " Michael S. Tsirkin
2023-10-09  4:15     ` Parav Pandit
2023-10-09 15:54       ` Michael S. Tsirkin
2023-10-09 17:22         ` Parav Pandit
2023-10-09 10:34     ` Zhu, Lingshan
2023-10-09 14:30       ` Parav Pandit
2023-10-10  8:52         ` Zhu, Lingshan
2023-10-10  9:58           ` Parav Pandit
2023-10-11 10:07             ` Zhu, Lingshan
2023-10-11 10:54               ` Parav Pandit
2023-10-11 19:54                 ` Michael S. Tsirkin
2023-10-12 10:00                 ` Zhu, Lingshan
2023-10-12 10:06                   ` Michael S. Tsirkin
2023-10-12 10:13                     ` Parav Pandit
2023-10-12 10:52                     ` Zhu, Lingshan
2023-10-12 10:09                   ` Parav Pandit
2023-10-12 10:45                     ` Michael S. Tsirkin
2023-10-12 11:23                       ` Parav Pandit
2023-10-12 11:10                     ` Zhu, Lingshan
2023-10-12 11:37                       ` Parav Pandit
2023-10-12 13:03                         ` Michael S. Tsirkin
2023-10-12 13:13                           ` Parav Pandit
2023-10-13  1:18                         ` Jason Wang
2023-10-13  6:40                           ` Parav Pandit
2023-10-17  2:10                             ` Jason Wang
2023-10-17  3:45                               ` Parav Pandit
2023-10-18  0:52                                 ` Jason Wang
2023-10-18  5:28                                   ` Parav Pandit
2023-10-19  2:41                                     ` Jason Wang
2023-10-18  6:13                                   ` Michael S. Tsirkin
2023-10-13  9:44                         ` Zhu, Lingshan
2023-10-13 11:54                           ` Parav Pandit
2023-10-16  9:47                             ` Zhu, Lingshan
2023-10-18  5:02                               ` Parav Pandit
2023-10-18  6:20                                 ` Michael S. Tsirkin
2023-10-18  6:28                                   ` Parav Pandit
2023-10-18  6:35                                 ` Zhu, Lingshan
2023-10-18  6:41                                   ` Parav Pandit
2023-10-18  6:52                                     ` Zhu, Lingshan
2023-10-18  7:20                                       ` Parav Pandit
2023-10-18  8:42                                         ` Zhu, Lingshan
2023-10-18  8:53                                           ` Michael S. Tsirkin
2023-10-18  9:48                                           ` Parav Pandit
2023-10-18  9:56                                             ` Michael S. Tsirkin
2023-10-18 10:22                                               ` Parav Pandit
2023-10-18 10:47                                                 ` Michael S. Tsirkin
2023-10-18 10:57                                                   ` Parav Pandit
2023-10-19  8:18                                                   ` Zhu, Lingshan
2023-10-19  8:37                                                     ` Michael S. Tsirkin
2023-10-19  8:49                                                       ` Zhu, Lingshan
2023-10-19  8:55                                                         ` Michael S. Tsirkin
2023-10-23  3:44                                                 ` Jason Wang
2023-10-23  4:42                                                   ` Parav Pandit
2023-10-24  4:56                                                     ` Jason Wang
2023-10-24 10:01                                                       ` Parav Pandit
2023-10-25  1:28                                                         ` Jason Wang
2023-10-25  7:15                                                           ` Parav Pandit
2023-10-25  8:24                                                             ` Michael S. Tsirkin
2023-10-25  9:50                                                               ` Parav Pandit
2023-10-25 10:19                                                                 ` Michael S. Tsirkin
2023-10-25 10:22                                                                   ` Parav Pandit
2023-10-25 10:28                                                                     ` Michael S. Tsirkin
2023-10-26  3:32                                                                       ` Parav Pandit
2023-10-26  0:46                                                             ` Jason Wang
2023-10-26  3:50                                                               ` Parav Pandit
2023-10-30  4:04                                                                 ` Jason Wang
2023-10-30  4:27                                                                   ` Parav Pandit
2023-10-31  1:36                                                                     ` Jason Wang
2023-10-31  5:17                                                                       ` Parav Pandit
2023-11-01  0:33                                                                         ` Jason Wang
2023-11-01  3:07                                                                           ` Parav Pandit
2023-11-02  4:24                                                                             ` Jason Wang
2023-11-02  6:10                                                                               ` Parav Pandit
2023-11-02 14:01                                                                                 ` Michael S. Tsirkin
2023-11-06  6:35                                                                                 ` Jason Wang
2023-11-09  6:24                                                                                   ` Parav Pandit
2023-10-19  8:15                                             ` Zhu, Lingshan
2023-10-19  9:01                                               ` Parav Pandit
2023-10-19  9:09                                                 ` Zhu, Lingshan
2023-10-19  9:13                                                   ` Parav Pandit
2023-10-19  9:14                                                     ` Michael S. Tsirkin
2023-10-19  9:18                                                       ` Zhu, Lingshan
2023-10-19 10:33                                                         ` Parav Pandit
2023-10-19 11:19                                                           ` Michael S. Tsirkin
2023-10-19 12:02                                                             ` Parav Pandit
2023-10-20  9:31                                                           ` Zhu, Lingshan
2023-10-20  9:41                                                             ` Michael S. Tsirkin
2023-10-20 11:11                                                               ` Zhu, Lingshan
2023-10-20 12:47                                                                 ` Parav Pandit
2023-10-23  9:48                                                                   ` Zhu, Lingshan
2023-10-23 10:01                                                                     ` Parav Pandit
2023-10-23 10:14                                                                       ` Zhu, Lingshan
2023-10-23 10:26                                                                         ` Parav Pandit
2023-10-24 10:10                                                                           ` Zhu, Lingshan
2023-10-24 10:11                                                                             ` Parav Pandit
2023-10-21 15:34                                                                 ` Michael S. Tsirkin
2023-10-23 10:03                                                                   ` Zhu, Lingshan
2023-10-23 11:32                                                                     ` Michael S. Tsirkin
2023-10-24 10:27                                                                       ` Zhu, Lingshan
2023-10-25  8:33                                                                         ` Michael S. Tsirkin
2023-10-26  0:56                                                                           ` Jason Wang
2023-10-26  3:58                                                                             ` Parav Pandit
2023-10-30  3:59                                                                               ` Jason Wang
2023-10-30  4:49                                                                                 ` Parav Pandit
2023-10-26  6:22                                                                             ` Michael S. Tsirkin
2023-10-30  4:02                                                                               ` Jason Wang
2023-11-01  0:33                                                                               ` Jason Wang
2023-10-26  6:38                                                                           ` Zhu, Lingshan
2023-10-23  3:53                                                               ` Jason Wang
2023-10-23 11:33                                                                 ` Michael S. Tsirkin
2023-10-20 12:54                                                             ` Parav Pandit
2023-10-23 10:09                                                               ` Zhu, Lingshan
2023-10-23 10:14                                                                 ` Parav Pandit
2023-10-24 10:30                                                                   ` Zhu, Lingshan
2023-10-24 10:37                                                                     ` Parav Pandit
2023-10-26  6:44                                                                       ` Zhu, Lingshan
2023-10-26  7:04                                                                         ` Parav Pandit
2023-10-30  3:44                                                                           ` Zhu, Lingshan
2023-10-30  4:17                                                                             ` Parav Pandit
2023-10-30 10:02                                                                               ` Zhu, Lingshan
2023-10-30 10:23                                                                                 ` Parav Pandit
2023-10-30 11:34                                                                                   ` Michael S. Tsirkin
2023-10-30 12:02                                                                                     ` Parav Pandit
2023-10-31  9:35                                                                                     ` Zhu, Lingshan
2023-10-31  9:42                                                                                   ` Zhu, Lingshan
2023-10-31 10:14                                                                                     ` Michael S. Tsirkin
2023-11-01  0:42                                                                                       ` Jason Wang
2023-11-01  1:57                                                                                         ` Zhu, Lingshan
2023-11-01  1:57                                                                                       ` Zhu, Lingshan
2023-11-01  2:54                                                                                       ` Parav Pandit
2023-11-01  5:31                                                                                         ` Michael S. Tsirkin
2023-11-01  5:42                                                                                           ` Parav Pandit
2023-11-01  6:37                                                                                             ` Michael S. Tsirkin
2023-11-01  6:39                                                                                               ` Zhu, Lingshan
2023-11-01  6:50                                                                                                 ` Parav Pandit
2023-11-01  6:56                                                                                                   ` Zhu, Lingshan
2023-11-01  7:03                                                                                                     ` Parav Pandit
2023-11-01  7:46                                                                                                       ` Zhu, Lingshan
2023-11-01  7:54                                                                                                         ` Parav Pandit
2023-11-01  8:55                                                                                                           ` Zhu, Lingshan
2023-11-01  9:07                                                                                                             ` Michael S. Tsirkin
2023-11-01  9:42                                                                                                               ` Zhu, Lingshan
2023-11-01 10:23                                                                                                                 ` Michael S. Tsirkin
2023-11-01  8:36                                                                                                   ` Michael S. Tsirkin
2023-11-01 10:24                                                                                                     ` Parav Pandit
2023-11-01  6:47                                                                                               ` Parav Pandit
2023-11-01  8:28                                                                                                 ` Michael S. Tsirkin
2023-11-01  8:49                                                                                                   ` Parav Pandit
2023-11-01  9:06                                                                                                     ` Michael S. Tsirkin
2023-11-01 10:01                                                                                                       ` Parav Pandit
2023-10-30 11:27                                                                           ` Michael S. Tsirkin
2023-10-30 11:48                                                                             ` Parav Pandit
2023-10-31  9:45                                                                             ` Zhu, Lingshan
2023-10-19  9:16                                                     ` Zhu, Lingshan
2023-10-19  9:13                                                 ` Michael S. Tsirkin
2023-10-13 13:49                           ` Michael S. Tsirkin
2023-10-16  9:50                             ` Zhu, Lingshan
2023-11-02 14:21   ` Michael S. Tsirkin
2023-11-02 14:40     ` [virtio-comment] " Parav Pandit
2023-11-02 14:53       ` [virtio-comment] " Michael S. Tsirkin
2023-11-02 15:06         ` [virtio-comment] " Parav Pandit
2023-11-02 17:05           ` [virtio-comment] " Michael S. Tsirkin
2023-10-08 11:25 ` [virtio-comment] [PATCH v1 4/8] admin: Add device migration admin commands Parav Pandit
2023-10-18  6:46   ` [virtio-comment] " Michael S. Tsirkin
2023-10-18  8:24     ` [virtio-comment] " Parav Pandit
2023-10-18 10:26       ` [virtio-comment] " Michael S. Tsirkin
2023-10-18 10:41         ` [virtio-comment] " Parav Pandit
2023-10-08 11:25 ` [virtio-comment] [PATCH v1 5/8] admin: Add requirements of device migration commands Parav Pandit
2023-10-08 11:25 ` [virtio-comment] [PATCH v1 6/8] admin: Add theory of operation for write recording commands Parav Pandit
2023-10-08 11:25 ` [virtio-comment] [PATCH v1 7/8] admin: Add " Parav Pandit
2023-10-08 11:52   ` [virtio-comment] " Michael S. Tsirkin
2023-10-09  4:14     ` [virtio-comment] " Parav Pandit
2023-10-09 10:57       ` [virtio-comment] " Michael S. Tsirkin
2023-10-09 11:48         ` Parav Pandit
2023-10-09 16:15           ` Michael S. Tsirkin
2023-10-09 17:22             ` Parav Pandit
2023-10-08 11:25 ` [virtio-comment] [PATCH v1 8/8] admin: Add requirements of write reporting commands Parav Pandit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.