All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/1] virtio-gpu: get p2pdma distance
@ 2025-02-26  9:27 Julia Zhang
  2025-02-26 10:44 ` Matias Ezequiel Vara Larsen
  2025-02-26 10:51 ` Parav Pandit
  0 siblings, 2 replies; 21+ messages in thread
From: Julia Zhang @ 2025-02-26  9:27 UTC (permalink / raw)
  To: virtio-comment
  Cc: Michael S . Tsirkin, Chen Jiqian, Zhu Lingshan, Huang Rui,
	Julia Zhang, Matias Ezequiel Vara Larsen

PCI peer-to-peer DMA transaction may be used in guest for some scenes.
For example, dGPU prime feature will let virtio-iGPU access to
passthrough dGPU buffer.

To support P2P DMA transaction in guest, virtio-gpu needs to check the
compatibility which is represented by p2pdma_distance. This defines a
command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to allow guest send virtual pci
notation of two devices to host and get host physical p2pdma_distance
of these two PCI devices.

Fixes:https://github.com/oasis-tcs/virtio-spec/issues/217
Signed-off-by: Julia Zhang <julia.zhang@amd.com>
Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
---
Hi all,

We are trying to implement dGPU prime feature in guest which will let
virtio-iGPU read rendered data of passthrough dGPU. Before that, virtio gpu
driver needs to get p2pdma_distance to check if P2P DMA transaction is
possible or not.

To implement getting p2pdma_distance, QEMU needs to handle the command from
guest with virtual pci notations of two PCI devices and send it to host
kernel and return host physical distance back to guest.

So this defines the new command follow the suggestion in https:
//lore.kernel.org/all/20241207105537.542441-4-julia.zhang@amd.com/

changes from v1:
 - add issue link to commit msg

Regards,
Julia
---
 device-types/gpu/description.tex | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/device-types/gpu/description.tex b/device-types/gpu/description.tex
index 4435248..9d0f30b 100644
--- a/device-types/gpu/description.tex
+++ b/device-types/gpu/description.tex
@@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request header}\label{sec:Device Types / GPU De
         VIRTIO_GPU_CMD_SUBMIT_3D,
         VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
         VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
+        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
 
         /* cursor commands */
         VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300,
@@ -236,6 +237,7 @@ \subsubsection{Device Operation: Request header}\label{sec:Device Types / GPU De
         VIRTIO_GPU_RESP_OK_EDID,
         VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
         VIRTIO_GPU_RESP_OK_MAP_INFO,
+        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
 
         /* error responses */
         VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200,
@@ -791,6 +793,33 @@ \subsubsection{Device Operation: controlq (3d)}\label{sec:Device Types / GPU Dev
 };
 \end{lstlisting}
 
+\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the cumulative distance
+  between two devices before Peer-to-Peer(P2P) transaction. Request data
+  is \field{struct virtio_gpu_device_p2pdma_distance}. Response type is
+  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
+  \field{virtio_gpu_resp_distance}.
+
+\begin{lstlisting}
+struct virtio_gpu_device_p2pdma_distance {
+        struct virtio_gpu_ctrl_hdr hdr;
+        le32 provider_bus;
+        le32 provider_slot;
+        le32 provider_func;
+        le32 client_bus;
+        le32 client_slot;
+        le32 client_func;
+};
+
+struct virtio_gpu_resp_distance {
+        struct virtio_gpu_ctrl_hdr hdr;
+        le32 distance;
+}
+\end{lstlisting}
+
+The request data contains guest virtual pci notations of p2pdma provider
+and client. The response contains the distance between provider and client,
+which represents the compatibility of these two PCI devices.
+
 \end{description}
 
 \subsubsection{Device Operation: cursorq}\label{sec:Device Types / GPU Device / Device Operation / Device Operation: cursorq}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-02-26  9:27 [PATCH v2 1/1] virtio-gpu: get p2pdma distance Julia Zhang
@ 2025-02-26 10:44 ` Matias Ezequiel Vara Larsen
  2025-02-26 10:51 ` Parav Pandit
  1 sibling, 0 replies; 21+ messages in thread
From: Matias Ezequiel Vara Larsen @ 2025-02-26 10:44 UTC (permalink / raw)
  To: Julia Zhang
  Cc: virtio-comment, Michael S . Tsirkin, Chen Jiqian, Zhu Lingshan,
	Huang Rui

On Wed, Feb 26, 2025 at 05:27:54PM +0800, Julia Zhang wrote:
> PCI peer-to-peer DMA transaction may be used in guest for some scenes.
> For example, dGPU prime feature will let virtio-iGPU access to
> passthrough dGPU buffer.
> 
> To support P2P DMA transaction in guest, virtio-gpu needs to check the
> compatibility which is represented by p2pdma_distance. This defines a
> command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to allow guest send virtual pci
> notation of two devices to host and get host physical p2pdma_distance
> of these two PCI devices.
> 
> Fixes:https://github.com/oasis-tcs/virtio-spec/issues/217
> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
> ---
> Hi all,
> 
> We are trying to implement dGPU prime feature in guest which will let
> virtio-iGPU read rendered data of passthrough dGPU. Before that, virtio gpu
> driver needs to get p2pdma_distance to check if P2P DMA transaction is
> possible or not.
> 
> To implement getting p2pdma_distance, QEMU needs to handle the command from
> guest with virtual pci notations of two PCI devices and send it to host
> kernel and return host physical distance back to guest.
> 
> So this defines the new command follow the suggestion in https:
> //lore.kernel.org/all/20241207105537.542441-4-julia.zhang@amd.com/
> 
> changes from v1:
>  - add issue link to commit msg
> 
> Regards,
> Julia
> ---

I think what is missing is to explicitly request a vote as is explained
in 2) at
https://github.com/oasis-tcs/virtio-spec/blob/master/README.md#use-of-github-issues.

Thanks, Matias


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-02-26  9:27 [PATCH v2 1/1] virtio-gpu: get p2pdma distance Julia Zhang
  2025-02-26 10:44 ` Matias Ezequiel Vara Larsen
@ 2025-02-26 10:51 ` Parav Pandit
  2025-03-05 10:41   ` Zhang, Julia
  1 sibling, 1 reply; 21+ messages in thread
From: Parav Pandit @ 2025-02-26 10:51 UTC (permalink / raw)
  To: Julia Zhang, virtio-comment@lists.linux.dev
  Cc: Michael S . Tsirkin, Chen Jiqian, Zhu Lingshan, Huang Rui,
	Matias Ezequiel Vara Larsen


> From: Julia Zhang <julia.zhang@amd.com>
> Sent: Wednesday, February 26, 2025 2:58 PM
> To: virtio-comment@lists.linux.dev
> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>; Huang
> Rui <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>; Matias
> Ezequiel Vara Larsen <mvaralar@redhat.com>
> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
> 
> PCI peer-to-peer DMA transaction may be used in guest for some scenes.
> For example, dGPU prime feature will let virtio-iGPU access to passthrough
> dGPU buffer.
> 
> To support P2P DMA transaction in guest, virtio-gpu needs to check the
> compatibility which is represented by p2pdma_distance. This defines a
> command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to allow guest send virtual
> pci notation of two devices to host and get host physical p2pdma_distance of
> these two PCI devices.
> 
> Fixes:https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F
> github.com%2Foasis-tcs%2Fvirtio-
> spec%2Fissues%2F217&data=05%7C02%7Cparav%40nvidia.com%7C86ac1a37
> e844437318e308dd56480fa2%7C43083d15727340c1b7db39efd9ccc17a%7C0
> %7C0%7C638761589695627211%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0e
> U1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCI
> sIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=IHhSmc4IGcvfStZERt0tv6hp5Se
> Ifsm1M5xIb37T9Tg%3D&reserved=0
> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
> ---
> Hi all,
> 
> We are trying to implement dGPU prime feature in guest which will let virtio-
> iGPU read rendered data of passthrough dGPU. Before that, virtio gpu driver
> needs to get p2pdma_distance to check if P2P DMA transaction is possible or
> not.
> 
> To implement getting p2pdma_distance, QEMU needs to handle the
> command from guest with virtual pci notations of two PCI devices and send it
> to host kernel and return host physical distance back to guest.
> 
> So this defines the new command follow the suggestion in https:
> //lore.kernel.org/all/20241207105537.542441-4-julia.zhang@amd.com/
> 
> changes from v1:
>  - add issue link to commit msg
> 
> Regards,
> Julia
> ---
>  device-types/gpu/description.tex | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/device-types/gpu/description.tex b/device-
> types/gpu/description.tex
> index 4435248..9d0f30b 100644
> --- a/device-types/gpu/description.tex
> +++ b/device-types/gpu/description.tex
> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
> header}\label{sec:Device Types / GPU De
>          VIRTIO_GPU_CMD_SUBMIT_3D,
>          VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>          VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
> 
>          /* cursor commands */
>          VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6 +237,7 @@
> \subsubsection{Device Operation: Request header}\label{sec:Device Types /
> GPU De
>          VIRTIO_GPU_RESP_OK_EDID,
>          VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>          VIRTIO_GPU_RESP_OK_MAP_INFO,
> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
> 
>          /* error responses */
>          VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6 +793,33 @@
> \subsubsection{Device Operation: controlq (3d)}\label{sec:Device Types / GPU
> Dev  };  \end{lstlisting}
> 
> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the cumulative
> +distance
> +  between two devices before Peer-to-Peer(P2P) transaction. Request
> +data
> +  is \field{struct virtio_gpu_device_p2pdma_distance}. Response type is
> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
> +  \field{virtio_gpu_resp_distance}.
> +
> +\begin{lstlisting}
> +struct virtio_gpu_device_p2pdma_distance {
> +        struct virtio_gpu_ctrl_hdr hdr;
> +        le32 provider_bus;
> +        le32 provider_slot;
> +        le32 provider_func;
> +        le32 client_bus;
> +        le32 client_slot;
> +        le32 client_func;
> +};
> +
I assume above bdf information is what is visible to the guest?
Why cannot it see this distance in the guest VM itself?

> +struct virtio_gpu_resp_distance {
> +        struct virtio_gpu_ctrl_hdr hdr;
> +        le32 distance;
> +}
> +\end{lstlisting}
I am not gpu expert so didn't pay attention to details.
I believe you are missing basic information in the patch is what is the unit of distance?
Can you please add?

Can you please wait for few days to raise the vote and sort out this detail?

> +
> +The request data contains guest virtual pci notations of p2pdma
> +provider and client. The response contains the distance between
> +provider and client, which represents the compatibility of these two PCI
> devices.
> +
>  \end{description}
> 
>  \subsubsection{Device Operation: cursorq}\label{sec:Device Types / GPU
> Device / Device Operation / Device Operation: cursorq}
> --
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-02-26 10:51 ` Parav Pandit
@ 2025-03-05 10:41   ` Zhang, Julia
  2025-03-06  4:29     ` Parav Pandit
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-05 10:41 UTC (permalink / raw)
  To: Parav Pandit, Zhang, Julia, virtio-comment@lists.linux.dev
  Cc: Michael S . Tsirkin, Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen



On 2025/2/26 18:51, Parav Pandit wrote:
> 
>> From: Julia Zhang <julia.zhang@amd.com>
>> Sent: Wednesday, February 26, 2025 2:58 PM
>> To: virtio-comment@lists.linux.dev
>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>; Huang
>> Rui <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>; Matias
>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>
>> PCI peer-to-peer DMA transaction may be used in guest for some scenes.
>> For example, dGPU prime feature will let virtio-iGPU access to passthrough
>> dGPU buffer.
>>
>> To support P2P DMA transaction in guest, virtio-gpu needs to check the
>> compatibility which is represented by p2pdma_distance. This defines a
>> command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to allow guest send virtual
>> pci notation of two devices to host and get host physical p2pdma_distance of
>> these two PCI devices.
>>
>> Fixes:https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2F
>> github.com%2Foasis-tcs%2Fvirtio-
>> spec%2Fissues%2F217&data=05%7C02%7Cparav%40nvidia.com%7C86ac1a37
>> e844437318e308dd56480fa2%7C43083d15727340c1b7db39efd9ccc17a%7C0
>> %7C0%7C638761589695627211%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0e
>> U1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCI
>> sIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=IHhSmc4IGcvfStZERt0tv6hp5Se
>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
>> ---
>> Hi all,
>>
>> We are trying to implement dGPU prime feature in guest which will let virtio-
>> iGPU read rendered data of passthrough dGPU. Before that, virtio gpu driver
>> needs to get p2pdma_distance to check if P2P DMA transaction is possible or
>> not.
>>
>> To implement getting p2pdma_distance, QEMU needs to handle the
>> command from guest with virtual pci notations of two PCI devices and send it
>> to host kernel and return host physical distance back to guest.
>>
>> So this defines the new command follow the suggestion in https:
>> //lore.kernel.org/all/20241207105537.542441-4-julia.zhang@amd.com/
>>
>> changes from v1:
>>   - add issue link to commit msg
>>
>> Regards,
>> Julia
>> ---
>>   device-types/gpu/description.tex | 29 +++++++++++++++++++++++++++++
>>   1 file changed, 29 insertions(+)
>>
>> diff --git a/device-types/gpu/description.tex b/device-
>> types/gpu/description.tex
>> index 4435248..9d0f30b 100644
>> --- a/device-types/gpu/description.tex
>> +++ b/device-types/gpu/description.tex
>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
>> header}\label{sec:Device Types / GPU De
>>           VIRTIO_GPU_CMD_SUBMIT_3D,
>>           VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>           VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>
>>           /* cursor commands */
>>           VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6 +237,7 @@
>> \subsubsection{Device Operation: Request header}\label{sec:Device Types /
>> GPU De
>>           VIRTIO_GPU_RESP_OK_EDID,
>>           VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>           VIRTIO_GPU_RESP_OK_MAP_INFO,
>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>
>>           /* error responses */
>>           VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6 +793,33 @@
>> \subsubsection{Device Operation: controlq (3d)}\label{sec:Device Types / GPU
>> Dev  };  \end{lstlisting}
>>
>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the cumulative
>> +distance
>> +  between two devices before Peer-to-Peer(P2P) transaction. Request
>> +data
>> +  is \field{struct virtio_gpu_device_p2pdma_distance}. Response type is
>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
>> +  \field{virtio_gpu_resp_distance}.
>> +
>> +\begin{lstlisting}
>> +struct virtio_gpu_device_p2pdma_distance {
>> +        struct virtio_gpu_ctrl_hdr hdr;
>> +        le32 provider_bus;
>> +        le32 provider_slot;
>> +        le32 provider_func;
>> +        le32 client_bus;
>> +        le32 client_slot;
>> +        le32 client_func;
>> +};
>> +
> I assume above bdf information is what is visible to the guest?
> Why cannot it see this distance in the guest VM itself?

Guest can see virtual bdf information which is not the real bdf 
information of the host. So guest can't get the correct distance. That's 
why I need to implement a new command to get and pass the real distance.

Please check previous discussion here:
https://lore.kernel.org/all/bfe6a8b7-adcf-40e2-b7a2-4e64dc96862d@amd.com/

> 
>> +struct virtio_gpu_resp_distance {
>> +        struct virtio_gpu_ctrl_hdr hdr;
>> +        le32 distance;
>> +}
>> +\end{lstlisting}
> I am not gpu expert so didn't pay attention to details.
> I believe you are missing basic information in the patch is what is the unit of distance?
> Can you please add?

The definition of distance is explained in kernel file:  driver/pci/p2pdma.c:

/*
 * Calculate the P2PDMA mapping type and distance between two PCI devices.
 *
 * If the two devices are the same PCI function, return
 * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
 *
 * If they are two functions of the same device, return
 * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up to the bridge,
 * then one hop back down to another function of the same device).
 *
 * In the case where two devices are connected to the same PCIe switch,
 * return a distance of 4. This corresponds to the following PCI tree:
 *
 *     -+  Root Port
 *      \+ Switch Upstream Port
 *       +-+ Switch Downstream Port 0
 *       + \- Device A
 *       \-+ Switch Downstream Port 1
 *         \- Device B
 *
 * The distance is 4 because we traverse from Device A to Downstream Port 0
 * to the common Switch Upstream Port, back down to Downstream Port 1 and
 * then to Device B. The mapping type returned depends on the ACS
 * redirection setting of the ports along the path.
 *
 * If ACS redirect is set on any port in the path, traffic between the
 * devices will go through the host bridge, so return
 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
 * PCI_P2PDMA_MAP_BUS_ADDR.
 *
 * Any two devices that have a data path that goes through the host bridge
 * will consult a whitelist. If the host bridge is in the whitelist, return
 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to the number of
 * ports per above. If the device is not in the whitelist, return
 * PCI_P2PDMA_MAP_NOT_SUPPORTED.
 */

I didn’t change the unit, do you mean that I should add above information to this patch?

Regards,
Julia

> 
> Can you please wait for few days to raise the vote and sort out this detail?
> 
>> +
>> +The request data contains guest virtual pci notations of p2pdma
>> +provider and client. The response contains the distance between
>> +provider and client, which represents the compatibility of these two PCI
>> devices.
>> +
>>   \end{description}
>>
>>   \subsubsection{Device Operation: cursorq}\label{sec:Device Types / GPU
>> Device / Device Operation / Device Operation: cursorq}
>> --
>> 2.34.1
>>
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-05 10:41   ` Zhang, Julia
@ 2025-03-06  4:29     ` Parav Pandit
  2025-03-06  7:16       ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Parav Pandit @ 2025-03-06  4:29 UTC (permalink / raw)
  To: Zhang, Julia, virtio-comment@lists.linux.dev
  Cc: Michael S . Tsirkin, Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen

> From: Zhang, Julia <Julia.Zhang@amd.com>
> Sent: Wednesday, March 5, 2025 4:12 PM
> 
> On 2025/2/26 18:51, Parav Pandit wrote:
> >
> >> From: Julia Zhang <julia.zhang@amd.com>
> >> Sent: Wednesday, February 26, 2025 2:58 PM
> >> To: virtio-comment@lists.linux.dev
> >> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
> >> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>; Huang
> Rui
> >> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>; Matias
> >> Ezequiel Vara Larsen <mvaralar@redhat.com>
> >> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
> >>
> >> PCI peer-to-peer DMA transaction may be used in guest for some scenes.
> >> For example, dGPU prime feature will let virtio-iGPU access to
> >> passthrough dGPU buffer.
Virtio specification does not describe dGPU and iGPU.
So either describe it in the description or please drop that and explain the motivation without referring to those terms.

> >>
> >> To support P2P DMA transaction in guest, virtio-gpu needs to check
> >> the compatibility which is represented by p2pdma_distance. This
> >> defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to allow guest
> send
> >> virtual pci notation of two devices to host and get host physical
> >> p2pdma_distance of these two PCI devices.
> >>
[..]
> >> Ifsm1M5xIb37T9Tg%3D&reserved=0
> >> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> >> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
> >> ---
> >> Hi all,
> >>
> >> We are trying to implement dGPU prime feature in guest which will let
> >> virtio- iGPU read rendered data of passthrough dGPU. Before that,
> >> virtio gpu driver needs to get p2pdma_distance to check if P2P DMA
> >> transaction is possible or not.
> >>
> >> To implement getting p2pdma_distance, QEMU needs to handle the
> >> command from guest with virtual pci notations of two PCI devices and
> >> send it to host kernel and return host physical distance back to guest.
> >>
> >> So this defines the new command follow the suggestion in https:
> >> //lore.kernel.org/all/20241207105537.542441-4-julia.zhang@amd.com/
> >>
> >> changes from v1:
> >>   - add issue link to commit msg
> >>
> >> Regards,
> >> Julia
> >> ---
> >>   device-types/gpu/description.tex | 29
> +++++++++++++++++++++++++++++
> >>   1 file changed, 29 insertions(+)
> >>
> >> diff --git a/device-types/gpu/description.tex b/device-
> >> types/gpu/description.tex index 4435248..9d0f30b 100644
> >> --- a/device-types/gpu/description.tex
> >> +++ b/device-types/gpu/description.tex
> >> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
> >> header}\label{sec:Device Types / GPU De
> >>           VIRTIO_GPU_CMD_SUBMIT_3D,
> >>           VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
> >>           VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
> >> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
> >>
> >>           /* cursor commands */
> >>           VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6 +237,7
> @@
> >> \subsubsection{Device Operation: Request header}\label{sec:Device
> >> Types / GPU De
> >>           VIRTIO_GPU_RESP_OK_EDID,
> >>           VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
> >>           VIRTIO_GPU_RESP_OK_MAP_INFO,
> >> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
> >>
> >>           /* error responses */
> >>           VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6 +793,33 @@
> >> \subsubsection{Device Operation: controlq (3d)}\label{sec:Device
> >> Types / GPU Dev  };  \end{lstlisting}
> >>
> >> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the cumulative
> >> +distance
> >> +  between two devices before Peer-to-Peer(P2P) transaction. Request
> >> +data
> >> +  is \field{struct virtio_gpu_device_p2pdma_distance}. Response type
> >> +is
> >> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
> >> +  \field{virtio_gpu_resp_distance}.
> >> +
> >> +\begin{lstlisting}
> >> +struct virtio_gpu_device_p2pdma_distance {
> >> +        struct virtio_gpu_ctrl_hdr hdr;
> >> +        le32 provider_bus;
> >> +        le32 provider_slot;
> >> +        le32 provider_func;
> >> +        le32 client_bus;
> >> +        le32 client_slot;
> >> +        le32 client_func;
> >> +};
The requester device's PCI RID is anyway known to the responder. So no need to pass it here.
It is unclear if its provider or the client.

You only need to supply the other device RID.
In PCI spec the RID is 16-bit. Considering other advancements, RID of 32-bit would be ok too.

So please remove client_*.
Replace provider_X with le32 other_device_pci_RID.

Mention that other_device_RID follows PCI specification RID notation.

> >> +
> > I assume above bdf information is what is visible to the guest?
> > Why cannot it see this distance in the guest VM itself?
> 
> Guest can see virtual bdf information which is not the real bdf information of
> the host. So guest can't get the correct distance. That's why I need to
> implement a new command to get and pass the real distance.
> 
[..]

> 
> >
> >> +struct virtio_gpu_resp_distance {
> >> +        struct virtio_gpu_ctrl_hdr hdr;
> >> +        le32 distance;
> >> +}
> >> +\end{lstlisting}
> > I am not gpu expert so didn't pay attention to details.
> > I believe you are missing basic information in the patch is what is the unit of
> distance?
> > Can you please add?
> 
> The definition of distance is explained in kernel file:  driver/pci/p2pdma.c:
>
Description is good. Either it needs to be in the include/Linux/uapi/foo.h file and refer to it here.
Or please add the description as part of this patch.
Reference to kernel internal files can change and hence cannot be taken as reference for virtio net spec.
 
> /*
>  * Calculate the P2PDMA mapping type and distance between two PCI
> devices.
>  *
>  * If the two devices are the same PCI function, return
>  * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
>  *
>  * If they are two functions of the same device, return
>  * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up to the
> bridge,
>  * then one hop back down to another function of the same device).
>  *
>  * In the case where two devices are connected to the same PCIe switch,
>  * return a distance of 4. This corresponds to the following PCI tree:
>  *
>  *     -+  Root Port
>  *      \+ Switch Upstream Port
>  *       +-+ Switch Downstream Port 0
>  *       + \- Device A
>  *       \-+ Switch Downstream Port 1
>  *         \- Device B
>  *
>  * The distance is 4 because we traverse from Device A to Downstream Port 0
>  * to the common Switch Upstream Port, back down to Downstream Port 1
> and
>  * then to Device B. The mapping type returned depends on the ACS
>  * redirection setting of the ports along the path.
>  *
>  * If ACS redirect is set on any port in the path, traffic between the
>  * devices will go through the host bridge, so return
>  * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
>  * PCI_P2PDMA_MAP_BUS_ADDR.
>  *
>  * Any two devices that have a data path that goes through the host bridge
>  * will consult a whitelist. If the host bridge is in the whitelist, return
>  * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to the
> number of
>  * ports per above. If the device is not in the whitelist, return
>  * PCI_P2PDMA_MAP_NOT_SUPPORTED.
>  */
> 
> I didn't change the unit, do you mean that I should add above information to
> this patch?
> 
Yes. please see above.

> Regards,
> Julia
> 
> >
> > Can you please wait for few days to raise the vote and sort out this detail?
> >
> >> +
> >> +The request data contains guest virtual pci notations of p2pdma
> >> +provider and client. The response contains the distance between
> >> +provider and client, which represents the compatibility of these two
> >> +PCI
> >> devices.
> >> +
> >>   \end{description}
> >>
> >>   \subsubsection{Device Operation: cursorq}\label{sec:Device Types /
> >> GPU Device / Device Operation / Device Operation: cursorq}
> >> --
> >> 2.34.1
> >>
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-06  4:29     ` Parav Pandit
@ 2025-03-06  7:16       ` Zhang, Julia
  2025-03-10  3:48         ` Parav Pandit
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-06  7:16 UTC (permalink / raw)
  To: Parav Pandit, Zhang, Julia, virtio-comment@lists.linux.dev
  Cc: Michael S . Tsirkin, Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen



On 2025/3/6 12:29, Parav Pandit wrote:
>> From: Zhang, Julia <Julia.Zhang@amd.com>
>> Sent: Wednesday, March 5, 2025 4:12 PM
>>
>> On 2025/2/26 18:51, Parav Pandit wrote:
>>>
>>>> From: Julia Zhang <julia.zhang@amd.com>
>>>> Sent: Wednesday, February 26, 2025 2:58 PM
>>>> To: virtio-comment@lists.linux.dev
>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>; Huang
>> Rui
>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>; Matias
>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>>>
>>>> PCI peer-to-peer DMA transaction may be used in guest for some scenes.
>>>> For example, dGPU prime feature will let virtio-iGPU access to
>>>> passthrough dGPU buffer.
> Virtio specification does not describe dGPU and iGPU.
> So either describe it in the description or please drop that and explain the motivation without referring to those terms.

OK I see.

> 
>>>>
>>>> To support P2P DMA transaction in guest, virtio-gpu needs to check
>>>> the compatibility which is represented by p2pdma_distance. This
>>>> defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to allow guest
>> send
>>>> virtual pci notation of two devices to host and get host physical
>>>> p2pdma_distance of these two PCI devices.
>>>>
> [..]
>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>>>> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>> ---
>>>> Hi all,
>>>>
>>>> We are trying to implement dGPU prime feature in guest which will let
>>>> virtio- iGPU read rendered data of passthrough dGPU. Before that,
>>>> virtio gpu driver needs to get p2pdma_distance to check if P2P DMA
>>>> transaction is possible or not.
>>>>
>>>> To implement getting p2pdma_distance, QEMU needs to handle the
>>>> command from guest with virtual pci notations of two PCI devices and
>>>> send it to host kernel and return host physical distance back to guest.
>>>>
>>>> So this defines the new command follow the suggestion in https:
>>>> //lore.kernel.org/all/20241207105537.542441-4-julia.zhang@amd.com/
>>>>
>>>> changes from v1:
>>>>    - add issue link to commit msg
>>>>
>>>> Regards,
>>>> Julia
>>>> ---
>>>>    device-types/gpu/description.tex | 29
>> +++++++++++++++++++++++++++++
>>>>    1 file changed, 29 insertions(+)
>>>>
>>>> diff --git a/device-types/gpu/description.tex b/device-
>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
>>>> --- a/device-types/gpu/description.tex
>>>> +++ b/device-types/gpu/description.tex
>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
>>>> header}\label{sec:Device Types / GPU De
>>>>            VIRTIO_GPU_CMD_SUBMIT_3D,
>>>>            VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>>>            VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>>>
>>>>            /* cursor commands */
>>>>            VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6 +237,7
>> @@
>>>> \subsubsection{Device Operation: Request header}\label{sec:Device
>>>> Types / GPU De
>>>>            VIRTIO_GPU_RESP_OK_EDID,
>>>>            VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>>>            VIRTIO_GPU_RESP_OK_MAP_INFO,
>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>>>
>>>>            /* error responses */
>>>>            VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6 +793,33 @@
>>>> \subsubsection{Device Operation: controlq (3d)}\label{sec:Device
>>>> Types / GPU Dev  };  \end{lstlisting}
>>>>
>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the cumulative
>>>> +distance
>>>> +  between two devices before Peer-to-Peer(P2P) transaction. Request
>>>> +data
>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}. Response type
>>>> +is
>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
>>>> +  \field{virtio_gpu_resp_distance}.
>>>> +
>>>> +\begin{lstlisting}
>>>> +struct virtio_gpu_device_p2pdma_distance {
>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>> +        le32 provider_bus;
>>>> +        le32 provider_slot;
>>>> +        le32 provider_func;
>>>> +        le32 client_bus;
>>>> +        le32 client_slot;
>>>> +        le32 client_func;
>>>> +};
> The requester device's PCI RID is anyway known to the responder. So no need to pass it here.
> It is unclear if its provider or the client.

I don't think requester device's PCI RID is known to the responder, 
since the existing function pci_p2pdma_distance() also require provider 
and client information.

Do you mean that I should consider virtio GPU as provider by default? 
Then this command will only work for this specific use case instead of 
being used as a common function.

> 
> You only need to supply the other device RID.
> In PCI spec the RID is 16-bit. Considering other advancements, RID of 32-bit would be ok too.
> 
> So please remove client_*.
> Replace provider_X with le32 other_device_pci_RID.
> 
> Mention that other_device_RID follows PCI specification RID notation.

OK I will modify it.

> 
>>>> +
>>> I assume above bdf information is what is visible to the guest?
>>> Why cannot it see this distance in the guest VM itself?
>>
>> Guest can see virtual bdf information which is not the real bdf information of
>> the host. So guest can't get the correct distance. That's why I need to
>> implement a new command to get and pass the real distance.
>>
> [..]
> 
>>
>>>
>>>> +struct virtio_gpu_resp_distance {
>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>> +        le32 distance;
>>>> +}
>>>> +\end{lstlisting}
>>> I am not gpu expert so didn't pay attention to details.
>>> I believe you are missing basic information in the patch is what is the unit of
>> distance?
>>> Can you please add?
>>
>> The definition of distance is explained in kernel file:  driver/pci/p2pdma.c:
>>
> Description is good. Either it needs to be in the include/Linux/uapi/foo.h file and refer to it here.
> Or please add the description as part of this patch.
> Reference to kernel internal files can change and hence cannot be taken as reference for virtio net spec.
>   
>> /*
>>   * Calculate the P2PDMA mapping type and distance between two PCI
>> devices.
>>   *
>>   * If the two devices are the same PCI function, return
>>   * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
>>   *
>>   * If they are two functions of the same device, return
>>   * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up to the
>> bridge,
>>   * then one hop back down to another function of the same device).
>>   *
>>   * In the case where two devices are connected to the same PCIe switch,
>>   * return a distance of 4. This corresponds to the following PCI tree:
>>   *
>>   *     -+  Root Port
>>   *      \+ Switch Upstream Port
>>   *       +-+ Switch Downstream Port 0
>>   *       + \- Device A
>>   *       \-+ Switch Downstream Port 1
>>   *         \- Device B
>>   *
>>   * The distance is 4 because we traverse from Device A to Downstream Port 0
>>   * to the common Switch Upstream Port, back down to Downstream Port 1
>> and
>>   * then to Device B. The mapping type returned depends on the ACS
>>   * redirection setting of the ports along the path.
>>   *
>>   * If ACS redirect is set on any port in the path, traffic between the
>>   * devices will go through the host bridge, so return
>>   * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
>>   * PCI_P2PDMA_MAP_BUS_ADDR.
>>   *
>>   * Any two devices that have a data path that goes through the host bridge
>>   * will consult a whitelist. If the host bridge is in the whitelist, return
>>   * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to the
>> number of
>>   * ports per above. If the device is not in the whitelist, return
>>   * PCI_P2PDMA_MAP_NOT_SUPPORTED.
>>   */
>>
>> I didn't change the unit, do you mean that I should add above information to
>> this patch?
>>
> Yes. please see above.

OK, will do.

Julia

> 
>> Regards,
>> Julia
>>
>>>
>>> Can you please wait for few days to raise the vote and sort out this detail?
>>>
>>>> +
>>>> +The request data contains guest virtual pci notations of p2pdma
>>>> +provider and client. The response contains the distance between
>>>> +provider and client, which represents the compatibility of these two
>>>> +PCI
>>>> devices.
>>>> +
>>>>    \end{description}
>>>>
>>>>    \subsubsection{Device Operation: cursorq}\label{sec:Device Types /
>>>> GPU Device / Device Operation / Device Operation: cursorq}
>>>> --
>>>> 2.34.1
>>>>
>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-06  7:16       ` Zhang, Julia
@ 2025-03-10  3:48         ` Parav Pandit
  2025-03-10  7:37           ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Parav Pandit @ 2025-03-10  3:48 UTC (permalink / raw)
  To: Zhang, Julia, virtio-comment@lists.linux.dev
  Cc: Michael S . Tsirkin, Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen



> From: Zhang, Julia <Julia.Zhang@amd.com>
> Sent: Thursday, March 6, 2025 12:47 PM
> 
> On 2025/3/6 12:29, Parav Pandit wrote:
> >> From: Zhang, Julia <Julia.Zhang@amd.com>
> >> Sent: Wednesday, March 5, 2025 4:12 PM
> >>
> >> On 2025/2/26 18:51, Parav Pandit wrote:
> >>>
> >>>> From: Julia Zhang <julia.zhang@amd.com>
> >>>> Sent: Wednesday, February 26, 2025 2:58 PM
> >>>> To: virtio-comment@lists.linux.dev
> >>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
> >>>> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>;
> Huang
> >> Rui
> >>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>; Matias
> >>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
> >>>>
> >>>> PCI peer-to-peer DMA transaction may be used in guest for some
> scenes.
> >>>> For example, dGPU prime feature will let virtio-iGPU access to
> >>>> passthrough dGPU buffer.
> > Virtio specification does not describe dGPU and iGPU.
> > So either describe it in the description or please drop that and explain the
> motivation without referring to those terms.
> 
> OK I see.
> 
> >
> >>>>
> >>>> To support P2P DMA transaction in guest, virtio-gpu needs to check
> >>>> the compatibility which is represented by p2pdma_distance. This
> >>>> defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to allow
> guest
> >> send
> >>>> virtual pci notation of two devices to host and get host physical
> >>>> p2pdma_distance of these two PCI devices.
> >>>>
> > [..]
> >>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
> >>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> >>>> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>> ---
> >>>> Hi all,
> >>>>
> >>>> We are trying to implement dGPU prime feature in guest which will
> >>>> let
> >>>> virtio- iGPU read rendered data of passthrough dGPU. Before that,
> >>>> virtio gpu driver needs to get p2pdma_distance to check if P2P DMA
> >>>> transaction is possible or not.
> >>>>
> >>>> To implement getting p2pdma_distance, QEMU needs to handle the
> >>>> command from guest with virtual pci notations of two PCI devices
> >>>> and send it to host kernel and return host physical distance back to
> guest.
> >>>>
> >>>> So this defines the new command follow the suggestion in https:
> >>>> //lore.kernel.org/all/20241207105537.542441-4-julia.zhang@amd.com/
> >>>>
> >>>> changes from v1:
> >>>>    - add issue link to commit msg
> >>>>
> >>>> Regards,
> >>>> Julia
> >>>> ---
> >>>>    device-types/gpu/description.tex | 29
> >> +++++++++++++++++++++++++++++
> >>>>    1 file changed, 29 insertions(+)
> >>>>
> >>>> diff --git a/device-types/gpu/description.tex b/device-
> >>>> types/gpu/description.tex index 4435248..9d0f30b 100644
> >>>> --- a/device-types/gpu/description.tex
> >>>> +++ b/device-types/gpu/description.tex
> >>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
> >>>> header}\label{sec:Device Types / GPU De
> >>>>            VIRTIO_GPU_CMD_SUBMIT_3D,
> >>>>            VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
> >>>>            VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
> >>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
> >>>>
> >>>>            /* cursor commands */
> >>>>            VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6
> +237,7
> >> @@
> >>>> \subsubsection{Device Operation: Request header}\label{sec:Device
> >>>> Types / GPU De
> >>>>            VIRTIO_GPU_RESP_OK_EDID,
> >>>>            VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
> >>>>            VIRTIO_GPU_RESP_OK_MAP_INFO,
> >>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
> >>>>
> >>>>            /* error responses */
> >>>>            VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6 +793,33
> >>>> @@ \subsubsection{Device Operation: controlq (3d)}\label{sec:Device
> >>>> Types / GPU Dev  };  \end{lstlisting}
> >>>>
> >>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
> cumulative
> >>>> +distance
> >>>> +  between two devices before Peer-to-Peer(P2P) transaction.
> >>>> +Request data
> >>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}. Response
> >>>> +type is
> >>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
> >>>> +  \field{virtio_gpu_resp_distance}.
> >>>> +
> >>>> +\begin{lstlisting}
> >>>> +struct virtio_gpu_device_p2pdma_distance {
> >>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>> +        le32 provider_bus;
> >>>> +        le32 provider_slot;
> >>>> +        le32 provider_func;
> >>>> +        le32 client_bus;
> >>>> +        le32 client_slot;
> >>>> +        le32 client_func;
> >>>> +};
> > The requester device's PCI RID is anyway known to the responder. So no
> need to pass it here.
> > It is unclear if its provider or the client.
> 
> I don't think requester device's PCI RID is known to the responder, since the
> existing function pci_p2pdma_distance() also require provider and client
> information.
>
The existing function takes the provider pci device as input at C code level.
And proposed extension (without provider BDF) will also operate on the provider device on which the request/response transaction is happening.

> Do you mean that I should consider virtio GPU as provider by default?
Yes.

> Then this command will only work for this specific use case instead of being
> used as a common function.
> 
Linux kernel API is operating on the device. Similarly proposed extension will also be on the device level.
If needed, an higher-level software will iterate over the interesting PCI devices and invoke the pci_p2pdma_distance() API.
> >
> > You only need to supply the other device RID.
> > In PCI spec the RID is 16-bit. Considering other advancements, RID of 32-bit
> would be ok too.
> >
> > So please remove client_*.
> > Replace provider_X with le32 other_device_pci_RID.
> >
> > Mention that other_device_RID follows PCI specification RID notation.
> 
> OK I will modify it.
> 
> >
> >>>> +
> >>> I assume above bdf information is what is visible to the guest?
> >>> Why cannot it see this distance in the guest VM itself?
> >>
> >> Guest can see virtual bdf information which is not the real bdf
> >> information of the host. So guest can't get the correct distance.
> >> That's why I need to implement a new command to get and pass the real
> distance.
> >>
> > [..]
> >
> >>
> >>>
> >>>> +struct virtio_gpu_resp_distance {
> >>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>> +        le32 distance;
> >>>> +}
> >>>> +\end{lstlisting}
> >>> I am not gpu expert so didn't pay attention to details.
> >>> I believe you are missing basic information in the patch is what is
> >>> the unit of
> >> distance?
> >>> Can you please add?
> >>
> >> The definition of distance is explained in kernel file:  driver/pci/p2pdma.c:
> >>
> > Description is good. Either it needs to be in the include/Linux/uapi/foo.h file
> and refer to it here.
> > Or please add the description as part of this patch.
> > Reference to kernel internal files can change and hence cannot be taken as
> reference for virtio net spec.
> >
> >> /*
> >>   * Calculate the P2PDMA mapping type and distance between two PCI
> >> devices.
> >>   *
> >>   * If the two devices are the same PCI function, return
> >>   * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
> >>   *
> >>   * If they are two functions of the same device, return
> >>   * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up to the
> >> bridge,
> >>   * then one hop back down to another function of the same device).
> >>   *
> >>   * In the case where two devices are connected to the same PCIe switch,
> >>   * return a distance of 4. This corresponds to the following PCI tree:
> >>   *
> >>   *     -+  Root Port
> >>   *      \+ Switch Upstream Port
> >>   *       +-+ Switch Downstream Port 0
> >>   *       + \- Device A
> >>   *       \-+ Switch Downstream Port 1
> >>   *         \- Device B
> >>   *
> >>   * The distance is 4 because we traverse from Device A to Downstream
> Port 0
> >>   * to the common Switch Upstream Port, back down to Downstream Port
> >> 1 and
> >>   * then to Device B. The mapping type returned depends on the ACS
> >>   * redirection setting of the ports along the path.
> >>   *
> >>   * If ACS redirect is set on any port in the path, traffic between the
> >>   * devices will go through the host bridge, so return
> >>   * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
> >>   * PCI_P2PDMA_MAP_BUS_ADDR.
> >>   *
> >>   * Any two devices that have a data path that goes through the host bridge
> >>   * will consult a whitelist. If the host bridge is in the whitelist, return
> >>   * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to the
> >> number of
> >>   * ports per above. If the device is not in the whitelist, return
> >>   * PCI_P2PDMA_MAP_NOT_SUPPORTED.
> >>   */
> >>
> >> I didn't change the unit, do you mean that I should add above
> >> information to this patch?
> >>
> > Yes. please see above.
> 
> OK, will do.
> 
> Julia
> 
> >
> >> Regards,
> >> Julia
> >>
> >>>
> >>> Can you please wait for few days to raise the vote and sort out this detail?
> >>>
> >>>> +
> >>>> +The request data contains guest virtual pci notations of p2pdma
> >>>> +provider and client. The response contains the distance between
> >>>> +provider and client, which represents the compatibility of these
> >>>> +two PCI
> >>>> devices.
> >>>> +
> >>>>    \end{description}
> >>>>
> >>>>    \subsubsection{Device Operation: cursorq}\label{sec:Device Types
> >>>> / GPU Device / Device Operation / Device Operation: cursorq}
> >>>> --
> >>>> 2.34.1
> >>>>
> >>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-10  3:48         ` Parav Pandit
@ 2025-03-10  7:37           ` Zhang, Julia
  2025-03-10 10:18             ` Parav Pandit
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-10  7:37 UTC (permalink / raw)
  To: Parav Pandit, Zhang, Julia, virtio-comment@lists.linux.dev
  Cc: Michael S . Tsirkin, Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen



On 2025/3/10 11:48, Parav Pandit wrote:
> 
> 
>> From: Zhang, Julia <Julia.Zhang@amd.com>
>> Sent: Thursday, March 6, 2025 12:47 PM
>>
>> On 2025/3/6 12:29, Parav Pandit wrote:
>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>> Sent: Wednesday, March 5, 2025 4:12 PM
>>>>
>>>> On 2025/2/26 18:51, Parav Pandit wrote:
>>>>>
>>>>>> From: Julia Zhang <julia.zhang@amd.com>
>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
>>>>>> To: virtio-comment@lists.linux.dev
>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>;
>> Huang
>>>> Rui
>>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>; Matias
>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>>>>>
>>>>>> PCI peer-to-peer DMA transaction may be used in guest for some
>> scenes.
>>>>>> For example, dGPU prime feature will let virtio-iGPU access to
>>>>>> passthrough dGPU buffer.
>>> Virtio specification does not describe dGPU and iGPU.
>>> So either describe it in the description or please drop that and explain the
>> motivation without referring to those terms.
>>
>> OK I see.
>>
>>>
>>>>>>
>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs to check
>>>>>> the compatibility which is represented by p2pdma_distance. This
>>>>>> defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to allow
>> guest
>>>> send
>>>>>> virtual pci notation of two devices to host and get host physical
>>>>>> p2pdma_distance of these two PCI devices.
>>>>>>
>>> [..]
>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>> ---
>>>>>> Hi all,
>>>>>>
>>>>>> We are trying to implement dGPU prime feature in guest which will
>>>>>> let
>>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before that,
>>>>>> virtio gpu driver needs to get p2pdma_distance to check if P2P DMA
>>>>>> transaction is possible or not.
>>>>>>
>>>>>> To implement getting p2pdma_distance, QEMU needs to handle the
>>>>>> command from guest with virtual pci notations of two PCI devices
>>>>>> and send it to host kernel and return host physical distance back to
>> guest.
>>>>>>
>>>>>> So this defines the new command follow the suggestion in https:
>>>>>> //lore.kernel.org/all/20241207105537.542441-4-julia.zhang@amd.com/
>>>>>>
>>>>>> changes from v1:
>>>>>>     - add issue link to commit msg
>>>>>>
>>>>>> Regards,
>>>>>> Julia
>>>>>> ---
>>>>>>     device-types/gpu/description.tex | 29
>>>> +++++++++++++++++++++++++++++
>>>>>>     1 file changed, 29 insertions(+)
>>>>>>
>>>>>> diff --git a/device-types/gpu/description.tex b/device-
>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
>>>>>> --- a/device-types/gpu/description.tex
>>>>>> +++ b/device-types/gpu/description.tex
>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>             VIRTIO_GPU_CMD_SUBMIT_3D,
>>>>>>             VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>>>>>             VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>>>>>
>>>>>>             /* cursor commands */
>>>>>>             VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6
>> +237,7
>>>> @@
>>>>>> \subsubsection{Device Operation: Request header}\label{sec:Device
>>>>>> Types / GPU De
>>>>>>             VIRTIO_GPU_RESP_OK_EDID,
>>>>>>             VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>>>>>             VIRTIO_GPU_RESP_OK_MAP_INFO,
>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>>>>>
>>>>>>             /* error responses */
>>>>>>             VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6 +793,33
>>>>>> @@ \subsubsection{Device Operation: controlq (3d)}\label{sec:Device
>>>>>> Types / GPU Dev  };  \end{lstlisting}
>>>>>>
>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
>> cumulative
>>>>>> +distance
>>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
>>>>>> +Request data
>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}. Response
>>>>>> +type is
>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
>>>>>> +  \field{virtio_gpu_resp_distance}.
>>>>>> +
>>>>>> +\begin{lstlisting}
>>>>>> +struct virtio_gpu_device_p2pdma_distance {
>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>> +        le32 provider_bus;
>>>>>> +        le32 provider_slot;
>>>>>> +        le32 provider_func;
>>>>>> +        le32 client_bus;
>>>>>> +        le32 client_slot;
>>>>>> +        le32 client_func;
>>>>>> +};
>>> The requester device's PCI RID is anyway known to the responder. So no
>> need to pass it here.
>>> It is unclear if its provider or the client.
>>
>> I don't think requester device's PCI RID is known to the responder, since the
>> existing function pci_p2pdma_distance() also require provider and client
>> information.
>>
> The existing function takes the provider pci device as input at C code level.
> And proposed extension (without provider BDF) will also operate on the provider device on which the request/response transaction is happening.

Well at first, I just want virtio GPU to handle the command to get 
p2pdma_distance of two PCI devices(maybe none of them is virtio GPU). So 
when guest needs to calculate distance of any two PCI devices, it can 
reuse this command. But indeed, we only have this one use case at this 
moment, I guess considering virtio GPU as provider and remove provider 
PCI RID of this struct is fine.

But there's another problem: how does host kernel get virtio GPU RID and 
calculate distance between virtio GPU and dGPU? If I remove provider 
information of this struct, then QEMU still needs to pass virtio GPU RID 
to host kernel side.

> 
>> Do you mean that I should consider virtio GPU as provider by default?
> Yes.
> 
>> Then this command will only work for this specific use case instead of being
>> used as a common function.
>>
> Linux kernel API is operating on the device. Similarly proposed extension will also be on the device level.
> If needed, an higher-level software will iterate over the interesting PCI devices and invoke the pci_p2pdma_distance() API.

The reason of passing this command from guest to host is that in our 
case, we need to calculate p2pdma_distance of virtio GPU and dGPU on 
host side. So my original design is quiet simple:
1. Pass rid of virtio GPU(provider) and dGPU(client) to host side
2. Host kernel calculate real p2pdma_distance of these devices.
3. Return p2pdma_distance to guest side.

And I think the higher-level software should know the exact RID of two 
devices and then invoke pci_p2pdma_distance() to calculate distance. So 
I still think we should keep provider information in this struct and 
following the existing logic of getting p2pdma_distance.

Regards,
Julia

>>>
>>> You only need to supply the other device RID.
>>> In PCI spec the RID is 16-bit. Considering other advancements, RID of 32-bit
>> would be ok too.
>>>
>>> So please remove client_*.
>>> Replace provider_X with le32 other_device_pci_RID.
>>>
>>> Mention that other_device_RID follows PCI specification RID notation.
>>
>> OK I will modify it.
>>
>>>
>>>>>> +
>>>>> I assume above bdf information is what is visible to the guest?
>>>>> Why cannot it see this distance in the guest VM itself?
>>>>
>>>> Guest can see virtual bdf information which is not the real bdf
>>>> information of the host. So guest can't get the correct distance.
>>>> That's why I need to implement a new command to get and pass the real
>> distance.
>>>>
>>> [..]
>>>
>>>>
>>>>>
>>>>>> +struct virtio_gpu_resp_distance {
>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>> +        le32 distance;
>>>>>> +}
>>>>>> +\end{lstlisting}
>>>>> I am not gpu expert so didn't pay attention to details.
>>>>> I believe you are missing basic information in the patch is what is
>>>>> the unit of
>>>> distance?
>>>>> Can you please add?
>>>>
>>>> The definition of distance is explained in kernel file:  driver/pci/p2pdma.c:
>>>>
>>> Description is good. Either it needs to be in the include/Linux/uapi/foo.h file
>> and refer to it here.
>>> Or please add the description as part of this patch.
>>> Reference to kernel internal files can change and hence cannot be taken as
>> reference for virtio net spec.
>>>
>>>> /*
>>>>    * Calculate the P2PDMA mapping type and distance between two PCI
>>>> devices.
>>>>    *
>>>>    * If the two devices are the same PCI function, return
>>>>    * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
>>>>    *
>>>>    * If they are two functions of the same device, return
>>>>    * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up to the
>>>> bridge,
>>>>    * then one hop back down to another function of the same device).
>>>>    *
>>>>    * In the case where two devices are connected to the same PCIe switch,
>>>>    * return a distance of 4. This corresponds to the following PCI tree:
>>>>    *
>>>>    *     -+  Root Port
>>>>    *      \+ Switch Upstream Port
>>>>    *       +-+ Switch Downstream Port 0
>>>>    *       + \- Device A
>>>>    *       \-+ Switch Downstream Port 1
>>>>    *         \- Device B
>>>>    *
>>>>    * The distance is 4 because we traverse from Device A to Downstream
>> Port 0
>>>>    * to the common Switch Upstream Port, back down to Downstream Port
>>>> 1 and
>>>>    * then to Device B. The mapping type returned depends on the ACS
>>>>    * redirection setting of the ports along the path.
>>>>    *
>>>>    * If ACS redirect is set on any port in the path, traffic between the
>>>>    * devices will go through the host bridge, so return
>>>>    * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
>>>>    * PCI_P2PDMA_MAP_BUS_ADDR.
>>>>    *
>>>>    * Any two devices that have a data path that goes through the host bridge
>>>>    * will consult a whitelist. If the host bridge is in the whitelist, return
>>>>    * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to the
>>>> number of
>>>>    * ports per above. If the device is not in the whitelist, return
>>>>    * PCI_P2PDMA_MAP_NOT_SUPPORTED.
>>>>    */
>>>>
>>>> I didn't change the unit, do you mean that I should add above
>>>> information to this patch?
>>>>
>>> Yes. please see above.
>>
>> OK, will do.
>>
>> Julia
>>
>>>
>>>> Regards,
>>>> Julia
>>>>
>>>>>
>>>>> Can you please wait for few days to raise the vote and sort out this detail?
>>>>>
>>>>>> +
>>>>>> +The request data contains guest virtual pci notations of p2pdma
>>>>>> +provider and client. The response contains the distance between
>>>>>> +provider and client, which represents the compatibility of these
>>>>>> +two PCI
>>>>>> devices.
>>>>>> +
>>>>>>     \end{description}
>>>>>>
>>>>>>     \subsubsection{Device Operation: cursorq}\label{sec:Device Types
>>>>>> / GPU Device / Device Operation / Device Operation: cursorq}
>>>>>> --
>>>>>> 2.34.1
>>>>>>
>>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-10  7:37           ` Zhang, Julia
@ 2025-03-10 10:18             ` Parav Pandit
  2025-03-11  3:27               ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Parav Pandit @ 2025-03-10 10:18 UTC (permalink / raw)
  To: Zhang, Julia, virtio-comment@lists.linux.dev
  Cc: Michael S . Tsirkin, Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen


> From: Zhang, Julia <Julia.Zhang@amd.com>
> Sent: Monday, March 10, 2025 1:07 PM
> 
> On 2025/3/10 11:48, Parav Pandit wrote:
> >
> >
> >> From: Zhang, Julia <Julia.Zhang@amd.com>
> >> Sent: Thursday, March 6, 2025 12:47 PM
> >>
> >> On 2025/3/6 12:29, Parav Pandit wrote:
> >>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>> Sent: Wednesday, March 5, 2025 4:12 PM
> >>>>
> >>>> On 2025/2/26 18:51, Parav Pandit wrote:
> >>>>>
> >>>>>> From: Julia Zhang <julia.zhang@amd.com>
> >>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
> >>>>>> To: virtio-comment@lists.linux.dev
> >>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
> >>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>;
> >> Huang
> >>>> Rui
> >>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>; Matias
> >>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
> >>>>>>
> >>>>>> PCI peer-to-peer DMA transaction may be used in guest for some
> >> scenes.
> >>>>>> For example, dGPU prime feature will let virtio-iGPU access to
> >>>>>> passthrough dGPU buffer.
> >>> Virtio specification does not describe dGPU and iGPU.
> >>> So either describe it in the description or please drop that and
> >>> explain the
> >> motivation without referring to those terms.
> >>
> >> OK I see.
> >>
> >>>
> >>>>>>
> >>>>>> To support P2P DMA transaction in guest, virtio-gpu needs to
> >>>>>> check the compatibility which is represented by p2pdma_distance.
> >>>>>> This defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to
> allow
> >> guest
> >>>> send
> >>>>>> virtual pci notation of two devices to host and get host physical
> >>>>>> p2pdma_distance of these two PCI devices.
> >>>>>>
> >>> [..]
> >>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
> >>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> >>>>>> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>>>> ---
> >>>>>> Hi all,
> >>>>>>
> >>>>>> We are trying to implement dGPU prime feature in guest which will
> >>>>>> let
> >>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before that,
> >>>>>> virtio gpu driver needs to get p2pdma_distance to check if P2P
> >>>>>> DMA transaction is possible or not.
> >>>>>>
> >>>>>> To implement getting p2pdma_distance, QEMU needs to handle the
> >>>>>> command from guest with virtual pci notations of two PCI devices
> >>>>>> and send it to host kernel and return host physical distance back
> >>>>>> to
> >> guest.
> >>>>>>
> >>>>>> So this defines the new command follow the suggestion in https:
> >>>>>> //lore.kernel.org/all/20241207105537.542441-4-
> julia.zhang@amd.com
> >>>>>> /
> >>>>>>
> >>>>>> changes from v1:
> >>>>>>     - add issue link to commit msg
> >>>>>>
> >>>>>> Regards,
> >>>>>> Julia
> >>>>>> ---
> >>>>>>     device-types/gpu/description.tex | 29
> >>>> +++++++++++++++++++++++++++++
> >>>>>>     1 file changed, 29 insertions(+)
> >>>>>>
> >>>>>> diff --git a/device-types/gpu/description.tex b/device-
> >>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
> >>>>>> --- a/device-types/gpu/description.tex
> >>>>>> +++ b/device-types/gpu/description.tex
> >>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
> >>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>             VIRTIO_GPU_CMD_SUBMIT_3D,
> >>>>>>             VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
> >>>>>>             VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
> >>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
> >>>>>>
> >>>>>>             /* cursor commands */
> >>>>>>             VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6
> >> +237,7
> >>>> @@
> >>>>>> \subsubsection{Device Operation: Request header}\label{sec:Device
> >>>>>> Types / GPU De
> >>>>>>             VIRTIO_GPU_RESP_OK_EDID,
> >>>>>>             VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
> >>>>>>             VIRTIO_GPU_RESP_OK_MAP_INFO,
> >>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
> >>>>>>
> >>>>>>             /* error responses */
> >>>>>>             VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6
> >>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
> >>>>>> (3d)}\label{sec:Device Types / GPU Dev  };  \end{lstlisting}
> >>>>>>
> >>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
> >> cumulative
> >>>>>> +distance
> >>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
> >>>>>> +Request data
> >>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}. Response
> >>>>>> +type is
> >>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
> >>>>>> +  \field{virtio_gpu_resp_distance}.
> >>>>>> +
> >>>>>> +\begin{lstlisting}
> >>>>>> +struct virtio_gpu_device_p2pdma_distance {
> >>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>>>> +        le32 provider_bus;
> >>>>>> +        le32 provider_slot;
> >>>>>> +        le32 provider_func;
> >>>>>> +        le32 client_bus;
> >>>>>> +        le32 client_slot;
> >>>>>> +        le32 client_func;
> >>>>>> +};
> >>> The requester device's PCI RID is anyway known to the responder. So
> >>> no
> >> need to pass it here.
> >>> It is unclear if its provider or the client.
> >>
> >> I don't think requester device's PCI RID is known to the responder,
> >> since the existing function pci_p2pdma_distance() also require
> >> provider and client information.
> >>
> > The existing function takes the provider pci device as input at C code level.
> > And proposed extension (without provider BDF) will also operate on the
> provider device on which the request/response transaction is happening.
> 
> Well at first, I just want virtio GPU to handle the command to get
> p2pdma_distance of two PCI devices(maybe none of them is virtio GPU). 
Using virtio gpu to find the distance between two arbitrary devices which are not virtio is certainly outside the scope and looks wrong.
It should be done at some PCI topology level who exposed this devices.

> So
> when guest needs to calculate distance of any two PCI devices, it can reuse
> this command. But indeed, we only have this one use case at this moment, I
> guess considering virtio GPU as provider and remove provider PCI RID of this
> struct is fine.
Right. A driver can issue the command to device without mentioning its own device identifier.

> 
> But there's another problem: how does host kernel get virtio GPU RID and
> calculate distance between virtio GPU and dGPU? If I remove provider
> information of this struct, then QEMU still needs to pass virtio GPU RID to
> host kernel side.
The device implementation side can get to know about this as it has plugged in and built the topology.
The RID we are discussing are the vRID seen by the guest driver.
I probably missed to clarify this.
> 
> >
> >> Do you mean that I should consider virtio GPU as provider by default?
> > Yes.
> >
> >> Then this command will only work for this specific use case instead of
> being
> >> used as a common function.
> >>
> > Linux kernel API is operating on the device. Similarly proposed extension
> will also be on the device level.
> > If needed, an higher-level software will iterate over the interesting PCI
> devices and invoke the pci_p2pdma_distance() API.
> 
> The reason of passing this command from guest to host is that in our
> case, we need to calculate p2pdma_distance of virtio GPU and dGPU on
> host side. So my original design is quiet simple:
> 1. Pass rid of virtio GPU(provider) and dGPU(client) to host side
> 2. Host kernel calculate real p2pdma_distance of these devices.
> 3. Return p2pdma_distance to guest side.
> 
The virtio GPU is asking information about some other non virtio device in the system does not look correct to me.
I believe this is the PCIe topology level should be able to provide the info without encapsulating this via some arbitrary device type like virtio gpu.

> And I think the higher-level software should know the exact RID of two
> devices and then invoke pci_p2pdma_distance() to calculate distance. 
Not sure. if this can be discovered in physical level, it can be discovered at virtual guest VM level too as long as QEMU builds it in right way.

> So
> I still think we should keep provider information in this struct and
> following the existing logic of getting p2pdma_distance.
> 
> Regards,
> Julia
> 
> >>>
> >>> You only need to supply the other device RID.
> >>> In PCI spec the RID is 16-bit. Considering other advancements, RID of 32-
> bit
> >> would be ok too.
> >>>
> >>> So please remove client_*.
> >>> Replace provider_X with le32 other_device_pci_RID.
> >>>
> >>> Mention that other_device_RID follows PCI specification RID notation.
> >>
> >> OK I will modify it.
> >>
> >>>
> >>>>>> +
> >>>>> I assume above bdf information is what is visible to the guest?
> >>>>> Why cannot it see this distance in the guest VM itself?
> >>>>
> >>>> Guest can see virtual bdf information which is not the real bdf
> >>>> information of the host. So guest can't get the correct distance.
> >>>> That's why I need to implement a new command to get and pass the
> real
> >> distance.
> >>>>
> >>> [..]
> >>>
> >>>>
> >>>>>
> >>>>>> +struct virtio_gpu_resp_distance {
> >>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>>>> +        le32 distance;
> >>>>>> +}
> >>>>>> +\end{lstlisting}
> >>>>> I am not gpu expert so didn't pay attention to details.
> >>>>> I believe you are missing basic information in the patch is what is
> >>>>> the unit of
> >>>> distance?
> >>>>> Can you please add?
> >>>>
> >>>> The definition of distance is explained in kernel file:
> driver/pci/p2pdma.c:
> >>>>
> >>> Description is good. Either it needs to be in the include/Linux/uapi/foo.h
> file
> >> and refer to it here.
> >>> Or please add the description as part of this patch.
> >>> Reference to kernel internal files can change and hence cannot be taken
> as
> >> reference for virtio net spec.
> >>>
> >>>> /*
> >>>>    * Calculate the P2PDMA mapping type and distance between two PCI
> >>>> devices.
> >>>>    *
> >>>>    * If the two devices are the same PCI function, return
> >>>>    * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
> >>>>    *
> >>>>    * If they are two functions of the same device, return
> >>>>    * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up to
> the
> >>>> bridge,
> >>>>    * then one hop back down to another function of the same device).
> >>>>    *
> >>>>    * In the case where two devices are connected to the same PCIe
> switch,
> >>>>    * return a distance of 4. This corresponds to the following PCI tree:
> >>>>    *
> >>>>    *     -+  Root Port
> >>>>    *      \+ Switch Upstream Port
> >>>>    *       +-+ Switch Downstream Port 0
> >>>>    *       + \- Device A
> >>>>    *       \-+ Switch Downstream Port 1
> >>>>    *         \- Device B
> >>>>    *
> >>>>    * The distance is 4 because we traverse from Device A to Downstream
> >> Port 0
> >>>>    * to the common Switch Upstream Port, back down to Downstream
> Port
> >>>> 1 and
> >>>>    * then to Device B. The mapping type returned depends on the ACS
> >>>>    * redirection setting of the ports along the path.
> >>>>    *
> >>>>    * If ACS redirect is set on any port in the path, traffic between the
> >>>>    * devices will go through the host bridge, so return
> >>>>    * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
> >>>>    * PCI_P2PDMA_MAP_BUS_ADDR.
> >>>>    *
> >>>>    * Any two devices that have a data path that goes through the host
> bridge
> >>>>    * will consult a whitelist. If the host bridge is in the whitelist, return
> >>>>    * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to
> the
> >>>> number of
> >>>>    * ports per above. If the device is not in the whitelist, return
> >>>>    * PCI_P2PDMA_MAP_NOT_SUPPORTED.
> >>>>    */
> >>>>
> >>>> I didn't change the unit, do you mean that I should add above
> >>>> information to this patch?
> >>>>
> >>> Yes. please see above.
> >>
> >> OK, will do.
> >>
> >> Julia
> >>
> >>>
> >>>> Regards,
> >>>> Julia
> >>>>
> >>>>>
> >>>>> Can you please wait for few days to raise the vote and sort out this
> detail?
> >>>>>
> >>>>>> +
> >>>>>> +The request data contains guest virtual pci notations of p2pdma
> >>>>>> +provider and client. The response contains the distance between
> >>>>>> +provider and client, which represents the compatibility of these
> >>>>>> +two PCI
> >>>>>> devices.
> >>>>>> +
> >>>>>>     \end{description}
> >>>>>>
> >>>>>>     \subsubsection{Device Operation: cursorq}\label{sec:Device Types
> >>>>>> / GPU Device / Device Operation / Device Operation: cursorq}
> >>>>>> --
> >>>>>> 2.34.1
> >>>>>>
> >>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-10 10:18             ` Parav Pandit
@ 2025-03-11  3:27               ` Zhang, Julia
  2025-03-11  4:00                 ` Parav Pandit
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-11  3:27 UTC (permalink / raw)
  To: Parav Pandit, Zhang, Julia, virtio-comment@lists.linux.dev,
	Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen



On 2025/3/10 18:18, Parav Pandit wrote:
> 
>> From: Zhang, Julia <Julia.Zhang@amd.com>
>> Sent: Monday, March 10, 2025 1:07 PM
>>
>> On 2025/3/10 11:48, Parav Pandit wrote:
>>>
>>>
>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>> Sent: Thursday, March 6, 2025 12:47 PM
>>>>
>>>> On 2025/3/6 12:29, Parav Pandit wrote:
>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
>>>>>>
>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
>>>>>>>
>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
>>>>>>>> To: virtio-comment@lists.linux.dev
>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>;
>>>> Huang
>>>>>> Rui
>>>>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>; Matias
>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>>>>>>>
>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for some
>>>> scenes.
>>>>>>>> For example, dGPU prime feature will let virtio-iGPU access to
>>>>>>>> passthrough dGPU buffer.
>>>>> Virtio specification does not describe dGPU and iGPU.
>>>>> So either describe it in the description or please drop that and
>>>>> explain the
>>>> motivation without referring to those terms.
>>>>
>>>> OK I see.
>>>>
>>>>>
>>>>>>>>
>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs to
>>>>>>>> check the compatibility which is represented by p2pdma_distance.
>>>>>>>> This defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to
>> allow
>>>> guest
>>>>>> send
>>>>>>>> virtual pci notation of two devices to host and get host physical
>>>>>>>> p2pdma_distance of these two PCI devices.
>>>>>>>>
>>>>> [..]
>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>>>> ---
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> We are trying to implement dGPU prime feature in guest which will
>>>>>>>> let
>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before that,
>>>>>>>> virtio gpu driver needs to get p2pdma_distance to check if P2P
>>>>>>>> DMA transaction is possible or not.
>>>>>>>>
>>>>>>>> To implement getting p2pdma_distance, QEMU needs to handle the
>>>>>>>> command from guest with virtual pci notations of two PCI devices
>>>>>>>> and send it to host kernel and return host physical distance back
>>>>>>>> to
>>>> guest.
>>>>>>>>
>>>>>>>> So this defines the new command follow the suggestion in https:
>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
>> julia.zhang@amd.com
>>>>>>>> /
>>>>>>>>
>>>>>>>> changes from v1:
>>>>>>>>      - add issue link to commit msg
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Julia
>>>>>>>> ---
>>>>>>>>      device-types/gpu/description.tex | 29
>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>      1 file changed, 29 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
>>>>>>>> --- a/device-types/gpu/description.tex
>>>>>>>> +++ b/device-types/gpu/description.tex
>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>              VIRTIO_GPU_CMD_SUBMIT_3D,
>>>>>>>>              VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>>>>>>>              VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>>>>>>>
>>>>>>>>              /* cursor commands */
>>>>>>>>              VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6
>>>> +237,7
>>>>>> @@
>>>>>>>> \subsubsection{Device Operation: Request header}\label{sec:Device
>>>>>>>> Types / GPU De
>>>>>>>>              VIRTIO_GPU_RESP_OK_EDID,
>>>>>>>>              VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>>>>>>>              VIRTIO_GPU_RESP_OK_MAP_INFO,
>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>>>>>>>
>>>>>>>>              /* error responses */
>>>>>>>>              VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6
>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };  \end{lstlisting}
>>>>>>>>
>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
>>>> cumulative
>>>>>>>> +distance
>>>>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
>>>>>>>> +Request data
>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}. Response
>>>>>>>> +type is
>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
>>>>>>>> +  \field{virtio_gpu_resp_distance}.
>>>>>>>> +
>>>>>>>> +\begin{lstlisting}
>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>> +        le32 provider_bus;
>>>>>>>> +        le32 provider_slot;
>>>>>>>> +        le32 provider_func;
>>>>>>>> +        le32 client_bus;
>>>>>>>> +        le32 client_slot;
>>>>>>>> +        le32 client_func;
>>>>>>>> +};
>>>>> The requester device's PCI RID is anyway known to the responder. So
>>>>> no
>>>> need to pass it here.
>>>>> It is unclear if its provider or the client.
>>>>
>>>> I don't think requester device's PCI RID is known to the responder,
>>>> since the existing function pci_p2pdma_distance() also require
>>>> provider and client information.
>>>>
>>> The existing function takes the provider pci device as input at C code level.
>>> And proposed extension (without provider BDF) will also operate on the
>> provider device on which the request/response transaction is happening.
>>
>> Well at first, I just want virtio GPU to handle the command to get
>> p2pdma_distance of two PCI devices(maybe none of them is virtio GPU).
> Using virtio gpu to find the distance between two arbitrary devices which are not virtio is certainly outside the scope and looks wrong.
> It should be done at some PCI topology level who exposed this devices.
For our use case, virtio GPU is the provider, we are using virtio GPU to 
calculate distance between virtio GPU and dGPU, which are not arbitrary 
devices.
> 
>> So
>> when guest needs to calculate distance of any two PCI devices, it can reuse
>> this command. But indeed, we only have this one use case at this moment, I
>> guess considering virtio GPU as provider and remove provider PCI RID of this
>> struct is fine.
> Right. A driver can issue the command to device without mentioning its own device identifier.
Since calling function pci_p2pdma_distance() still needs to pass its own 
RID, I think at least QEMU has to tell host kernel the RID of virito GPU.
> 
>>
>> But there's another problem: how does host kernel get virtio GPU RID and
>> calculate distance between virtio GPU and dGPU? If I remove provider
>> information of this struct, then QEMU still needs to pass virtio GPU RID to
>> host kernel side.
> The device implementation side can get to know about this as it has plugged in and built the topology.
> The RID we are discussing are the vRID seen by the guest driver.
> I probably missed to clarify this.
But only getting the distance on the virtual PCI inside the guest is not 
enough to figure out if P2P is possible or not. We need to get the real 
distance from host side. Please take a look at the discussion here: 
https://lore.kernel.org/all/bfe6a8b7-adcf-40e2-b7a2-4e64dc96862d@amd.com/

Hi MST,
We are talking about how to get the correct p2pdma_distance from host 
side and pass it to guest side. Do you have any suggestions?

Julia

>>
>>>
>>>> Do you mean that I should consider virtio GPU as provider by default?
>>> Yes.
>>>
>>>> Then this command will only work for this specific use case instead of
>> being
>>>> used as a common function.
>>>>
>>> Linux kernel API is operating on the device. Similarly proposed extension
>> will also be on the device level.
>>> If needed, an higher-level software will iterate over the interesting PCI
>> devices and invoke the pci_p2pdma_distance() API.
>>
>> The reason of passing this command from guest to host is that in our
>> case, we need to calculate p2pdma_distance of virtio GPU and dGPU on
>> host side. So my original design is quiet simple:
>> 1. Pass rid of virtio GPU(provider) and dGPU(client) to host side
>> 2. Host kernel calculate real p2pdma_distance of these devices.
>> 3. Return p2pdma_distance to guest side.
>>
> The virtio GPU is asking information about some other non virtio device in the system does not look correct to me.
> I believe this is the PCIe topology level should be able to provide the info without encapsulating this via some arbitrary device type like virtio gpu.
> 
>> And I think the higher-level software should know the exact RID of two
>> devices and then invoke pci_p2pdma_distance() to calculate distance.
> Not sure. if this can be discovered in physical level, it can be discovered at virtual guest VM level too as long as QEMU builds it in right way.
> 
>> So
>> I still think we should keep provider information in this struct and
>> following the existing logic of getting p2pdma_distance.
>>
>> Regards,
>> Julia
>>
>>>>>
>>>>> You only need to supply the other device RID.
>>>>> In PCI spec the RID is 16-bit. Considering other advancements, RID of 32-
>> bit
>>>> would be ok too.
>>>>>
>>>>> So please remove client_*.
>>>>> Replace provider_X with le32 other_device_pci_RID.
>>>>>
>>>>> Mention that other_device_RID follows PCI specification RID notation.
>>>>
>>>> OK I will modify it.
>>>>
>>>>>
>>>>>>>> +
>>>>>>> I assume above bdf information is what is visible to the guest?
>>>>>>> Why cannot it see this distance in the guest VM itself?
>>>>>>
>>>>>> Guest can see virtual bdf information which is not the real bdf
>>>>>> information of the host. So guest can't get the correct distance.
>>>>>> That's why I need to implement a new command to get and pass the
>> real
>>>> distance.
>>>>>>
>>>>> [..]
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +struct virtio_gpu_resp_distance {
>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>> +        le32 distance;
>>>>>>>> +}
>>>>>>>> +\end{lstlisting}
>>>>>>> I am not gpu expert so didn't pay attention to details.
>>>>>>> I believe you are missing basic information in the patch is what is
>>>>>>> the unit of
>>>>>> distance?
>>>>>>> Can you please add?
>>>>>>
>>>>>> The definition of distance is explained in kernel file:
>> driver/pci/p2pdma.c:
>>>>>>
>>>>> Description is good. Either it needs to be in the include/Linux/uapi/foo.h
>> file
>>>> and refer to it here.
>>>>> Or please add the description as part of this patch.
>>>>> Reference to kernel internal files can change and hence cannot be taken
>> as
>>>> reference for virtio net spec.
>>>>>
>>>>>> /*
>>>>>>     * Calculate the P2PDMA mapping type and distance between two PCI
>>>>>> devices.
>>>>>>     *
>>>>>>     * If the two devices are the same PCI function, return
>>>>>>     * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
>>>>>>     *
>>>>>>     * If they are two functions of the same device, return
>>>>>>     * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up to
>> the
>>>>>> bridge,
>>>>>>     * then one hop back down to another function of the same device).
>>>>>>     *
>>>>>>     * In the case where two devices are connected to the same PCIe
>> switch,
>>>>>>     * return a distance of 4. This corresponds to the following PCI tree:
>>>>>>     *
>>>>>>     *     -+  Root Port
>>>>>>     *      \+ Switch Upstream Port
>>>>>>     *       +-+ Switch Downstream Port 0
>>>>>>     *       + \- Device A
>>>>>>     *       \-+ Switch Downstream Port 1
>>>>>>     *         \- Device B
>>>>>>     *
>>>>>>     * The distance is 4 because we traverse from Device A to Downstream
>>>> Port 0
>>>>>>     * to the common Switch Upstream Port, back down to Downstream
>> Port
>>>>>> 1 and
>>>>>>     * then to Device B. The mapping type returned depends on the ACS
>>>>>>     * redirection setting of the ports along the path.
>>>>>>     *
>>>>>>     * If ACS redirect is set on any port in the path, traffic between the
>>>>>>     * devices will go through the host bridge, so return
>>>>>>     * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
>>>>>>     * PCI_P2PDMA_MAP_BUS_ADDR.
>>>>>>     *
>>>>>>     * Any two devices that have a data path that goes through the host
>> bridge
>>>>>>     * will consult a whitelist. If the host bridge is in the whitelist, return
>>>>>>     * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to
>> the
>>>>>> number of
>>>>>>     * ports per above. If the device is not in the whitelist, return
>>>>>>     * PCI_P2PDMA_MAP_NOT_SUPPORTED.
>>>>>>     */
>>>>>>
>>>>>> I didn't change the unit, do you mean that I should add above
>>>>>> information to this patch?
>>>>>>
>>>>> Yes. please see above.
>>>>
>>>> OK, will do.
>>>>
>>>> Julia
>>>>
>>>>>
>>>>>> Regards,
>>>>>> Julia
>>>>>>
>>>>>>>
>>>>>>> Can you please wait for few days to raise the vote and sort out this
>> detail?
>>>>>>>
>>>>>>>> +
>>>>>>>> +The request data contains guest virtual pci notations of p2pdma
>>>>>>>> +provider and client. The response contains the distance between
>>>>>>>> +provider and client, which represents the compatibility of these
>>>>>>>> +two PCI
>>>>>>>> devices.
>>>>>>>> +
>>>>>>>>      \end{description}
>>>>>>>>
>>>>>>>>      \subsubsection{Device Operation: cursorq}\label{sec:Device Types
>>>>>>>> / GPU Device / Device Operation / Device Operation: cursorq}
>>>>>>>> --
>>>>>>>> 2.34.1
>>>>>>>>
>>>>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-11  3:27               ` Zhang, Julia
@ 2025-03-11  4:00                 ` Parav Pandit
  2025-03-11  7:29                   ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Parav Pandit @ 2025-03-11  4:00 UTC (permalink / raw)
  To: Zhang, Julia, virtio-comment@lists.linux.dev, Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen



> From: Zhang, Julia <Julia.Zhang@amd.com>
> Sent: Tuesday, March 11, 2025 8:58 AM
> 
> On 2025/3/10 18:18, Parav Pandit wrote:
> >
> >> From: Zhang, Julia <Julia.Zhang@amd.com>
> >> Sent: Monday, March 10, 2025 1:07 PM
> >>
> >> On 2025/3/10 11:48, Parav Pandit wrote:
> >>>
> >>>
> >>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>> Sent: Thursday, March 6, 2025 12:47 PM
> >>>>
> >>>> On 2025/3/6 12:29, Parav Pandit wrote:
> >>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
> >>>>>>
> >>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
> >>>>>>>
> >>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
> >>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
> >>>>>>>> To: virtio-comment@lists.linux.dev
> >>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
> >>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>;
> >>>> Huang
> >>>>>> Rui
> >>>>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>;
> Matias
> >>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
> >>>>>>>>
> >>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for some
> >>>> scenes.
> >>>>>>>> For example, dGPU prime feature will let virtio-iGPU access to
> >>>>>>>> passthrough dGPU buffer.
> >>>>> Virtio specification does not describe dGPU and iGPU.
> >>>>> So either describe it in the description or please drop that and
> >>>>> explain the
> >>>> motivation without referring to those terms.
> >>>>
> >>>> OK I see.
> >>>>
> >>>>>
> >>>>>>>>
> >>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs to
> >>>>>>>> check the compatibility which is represented by p2pdma_distance.
> >>>>>>>> This defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to
> >> allow
> >>>> guest
> >>>>>> send
> >>>>>>>> virtual pci notation of two devices to host and get host
> >>>>>>>> physical p2pdma_distance of these two PCI devices.
> >>>>>>>>
> >>>>> [..]
> >>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
> >>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> >>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>>>>>> ---
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> We are trying to implement dGPU prime feature in guest which
> >>>>>>>> will let
> >>>>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before
> >>>>>>>> that, virtio gpu driver needs to get p2pdma_distance to check
> >>>>>>>> if P2P DMA transaction is possible or not.
> >>>>>>>>
> >>>>>>>> To implement getting p2pdma_distance, QEMU needs to handle the
> >>>>>>>> command from guest with virtual pci notations of two PCI
> >>>>>>>> devices and send it to host kernel and return host physical
> >>>>>>>> distance back to
> >>>> guest.
> >>>>>>>>
> >>>>>>>> So this defines the new command follow the suggestion in https:
> >>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
> >> julia.zhang@amd.com
> >>>>>>>> /
> >>>>>>>>
> >>>>>>>> changes from v1:
> >>>>>>>>      - add issue link to commit msg
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Julia
> >>>>>>>> ---
> >>>>>>>>      device-types/gpu/description.tex | 29
> >>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>      1 file changed, 29 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
> >>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
> >>>>>>>> --- a/device-types/gpu/description.tex
> >>>>>>>> +++ b/device-types/gpu/description.tex
> >>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
> >>>>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>>>              VIRTIO_GPU_CMD_SUBMIT_3D,
> >>>>>>>>              VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
> >>>>>>>>              VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
> >>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
> >>>>>>>>
> >>>>>>>>              /* cursor commands */
> >>>>>>>>              VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6
> >>>> +237,7
> >>>>>> @@
> >>>>>>>> \subsubsection{Device Operation: Request
> >>>>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>>>              VIRTIO_GPU_RESP_OK_EDID,
> >>>>>>>>              VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
> >>>>>>>>              VIRTIO_GPU_RESP_OK_MAP_INFO,
> >>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
> >>>>>>>>
> >>>>>>>>              /* error responses */
> >>>>>>>>              VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6
> >>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
> >>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };  \end{lstlisting}
> >>>>>>>>
> >>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
> >>>> cumulative
> >>>>>>>> +distance
> >>>>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
> >>>>>>>> +Request data
> >>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
> >>>>>>>> +Response type is
> >>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
> >>>>>>>> +  \field{virtio_gpu_resp_distance}.
> >>>>>>>> +
> >>>>>>>> +\begin{lstlisting}
> >>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
> >>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>>>>>> +        le32 provider_bus;
> >>>>>>>> +        le32 provider_slot;
> >>>>>>>> +        le32 provider_func;
> >>>>>>>> +        le32 client_bus;
> >>>>>>>> +        le32 client_slot;
> >>>>>>>> +        le32 client_func;
> >>>>>>>> +};
> >>>>> The requester device's PCI RID is anyway known to the responder.
> >>>>> So no
> >>>> need to pass it here.
> >>>>> It is unclear if its provider or the client.
> >>>>
> >>>> I don't think requester device's PCI RID is known to the responder,
> >>>> since the existing function pci_p2pdma_distance() also require
> >>>> provider and client information.
> >>>>
> >>> The existing function takes the provider pci device as input at C code level.
> >>> And proposed extension (without provider BDF) will also operate on
> >>> the
> >> provider device on which the request/response transaction is happening.
> >>
> >> Well at first, I just want virtio GPU to handle the command to get
> >> p2pdma_distance of two PCI devices(maybe none of them is virtio GPU).
> > Using virtio gpu to find the distance between two arbitrary devices which
> are not virtio is certainly outside the scope and looks wrong.
> > It should be done at some PCI topology level who exposed this devices.
> For our use case, virtio GPU is the provider, we are using virtio GPU to
> calculate distance between virtio GPU and dGPU, which are not arbitrary
> devices.
The fact that in physical world this distance is calculated without involving the individual provider and client EP PCI devices, in virtual world doing this via virtual PCI EP devices does not look correct abstraction to me.
I tend to imagine that one needs to build a right virtual topology in QEMU or somewhere else who is providing these PCI devices such that guest kernel can calculate it accurately without depending on the EP PCI devices.


> >
> >> So
> >> when guest needs to calculate distance of any two PCI devices, it can
> >> reuse this command. But indeed, we only have this one use case at
> >> this moment, I guess considering virtio GPU as provider and remove
> >> provider PCI RID of this struct is fine.
> > Right. A driver can issue the command to device without mentioning its own
> device identifier.
> Since calling function pci_p2pdma_distance() still needs to pass its own RID, I
> think at least QEMU has to tell host kernel the RID of virito GPU.
Probably yes, would be interesting to see this UAPI on the host kernel too, (unrelated to this patch).

> >
> >>
> >> But there's another problem: how does host kernel get virtio GPU RID
> >> and calculate distance between virtio GPU and dGPU? If I remove
> >> provider information of this struct, then QEMU still needs to pass
> >> virtio GPU RID to host kernel side.
> > The device implementation side can get to know about this as it has plugged
> in and built the topology.
> > The RID we are discussing are the vRID seen by the guest driver.
> > I probably missed to clarify this.
> But only getting the distance on the virtual PCI inside the guest is not enough
> to figure out if P2P is possible or not. We need to get the real distance from
> host side. Please take a look at the discussion here:
> https://lore.kernel.org/all/bfe6a8b7-adcf-40e2-b7a2-4e64dc96862d@amd.com/

Finding out the distance between two PCI devices in physical world is role of the PCI bridge/switch topology builder.
So why should it change for the guest VM?
Why virtualized topology cannot be build in a way such that it matches to the physical world to get the accurate distance?
At least to me, doing this in the pci RP domain (and not PCI EP domain) seems the right way to go forward.

> 
> Hi MST,
> We are talking about how to get the correct p2pdma_distance from host side
> and pass it to guest side. Do you have any suggestions?
> 
> Julia
> 
> >>
> >>>
> >>>> Do you mean that I should consider virtio GPU as provider by default?
> >>> Yes.
> >>>
> >>>> Then this command will only work for this specific use case instead
> >>>> of
> >> being
> >>>> used as a common function.
> >>>>
> >>> Linux kernel API is operating on the device. Similarly proposed
> >>> extension
> >> will also be on the device level.
> >>> If needed, an higher-level software will iterate over the
> >>> interesting PCI
> >> devices and invoke the pci_p2pdma_distance() API.
> >>
> >> The reason of passing this command from guest to host is that in our
> >> case, we need to calculate p2pdma_distance of virtio GPU and dGPU on
> >> host side. So my original design is quiet simple:
> >> 1. Pass rid of virtio GPU(provider) and dGPU(client) to host side 2.
> >> Host kernel calculate real p2pdma_distance of these devices.
> >> 3. Return p2pdma_distance to guest side.
> >>
> > The virtio GPU is asking information about some other non virtio device in
> the system does not look correct to me.
> > I believe this is the PCIe topology level should be able to provide the info
> without encapsulating this via some arbitrary device type like virtio gpu.
> >
> >> And I think the higher-level software should know the exact RID of
> >> two devices and then invoke pci_p2pdma_distance() to calculate distance.
> > Not sure. if this can be discovered in physical level, it can be discovered at
> virtual guest VM level too as long as QEMU builds it in right way.
> >
> >> So
> >> I still think we should keep provider information in this struct and
> >> following the existing logic of getting p2pdma_distance.
> >>
> >> Regards,
> >> Julia
> >>
> >>>>>
> >>>>> You only need to supply the other device RID.
> >>>>> In PCI spec the RID is 16-bit. Considering other advancements, RID
> >>>>> of 32-
> >> bit
> >>>> would be ok too.
> >>>>>
> >>>>> So please remove client_*.
> >>>>> Replace provider_X with le32 other_device_pci_RID.
> >>>>>
> >>>>> Mention that other_device_RID follows PCI specification RID notation.
> >>>>
> >>>> OK I will modify it.
> >>>>
> >>>>>
> >>>>>>>> +
> >>>>>>> I assume above bdf information is what is visible to the guest?
> >>>>>>> Why cannot it see this distance in the guest VM itself?
> >>>>>>
> >>>>>> Guest can see virtual bdf information which is not the real bdf
> >>>>>> information of the host. So guest can't get the correct distance.
> >>>>>> That's why I need to implement a new command to get and pass the
> >> real
> >>>> distance.
> >>>>>>
> >>>>> [..]
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> +struct virtio_gpu_resp_distance {
> >>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>>>>>> +        le32 distance;
> >>>>>>>> +}
> >>>>>>>> +\end{lstlisting}
> >>>>>>> I am not gpu expert so didn't pay attention to details.
> >>>>>>> I believe you are missing basic information in the patch is what
> >>>>>>> is the unit of
> >>>>>> distance?
> >>>>>>> Can you please add?
> >>>>>>
> >>>>>> The definition of distance is explained in kernel file:
> >> driver/pci/p2pdma.c:
> >>>>>>
> >>>>> Description is good. Either it needs to be in the
> >>>>> include/Linux/uapi/foo.h
> >> file
> >>>> and refer to it here.
> >>>>> Or please add the description as part of this patch.
> >>>>> Reference to kernel internal files can change and hence cannot be
> >>>>> taken
> >> as
> >>>> reference for virtio net spec.
> >>>>>
> >>>>>> /*
> >>>>>>     * Calculate the P2PDMA mapping type and distance between two
> >>>>>> PCI devices.
> >>>>>>     *
> >>>>>>     * If the two devices are the same PCI function, return
> >>>>>>     * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
> >>>>>>     *
> >>>>>>     * If they are two functions of the same device, return
> >>>>>>     * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up
> to
> >> the
> >>>>>> bridge,
> >>>>>>     * then one hop back down to another function of the same device).
> >>>>>>     *
> >>>>>>     * In the case where two devices are connected to the same
> >>>>>> PCIe
> >> switch,
> >>>>>>     * return a distance of 4. This corresponds to the following PCI tree:
> >>>>>>     *
> >>>>>>     *     -+  Root Port
> >>>>>>     *      \+ Switch Upstream Port
> >>>>>>     *       +-+ Switch Downstream Port 0
> >>>>>>     *       + \- Device A
> >>>>>>     *       \-+ Switch Downstream Port 1
> >>>>>>     *         \- Device B
> >>>>>>     *
> >>>>>>     * The distance is 4 because we traverse from Device A to
> >>>>>> Downstream
> >>>> Port 0
> >>>>>>     * to the common Switch Upstream Port, back down to Downstream
> >> Port
> >>>>>> 1 and
> >>>>>>     * then to Device B. The mapping type returned depends on the ACS
> >>>>>>     * redirection setting of the ports along the path.
> >>>>>>     *
> >>>>>>     * If ACS redirect is set on any port in the path, traffic between the
> >>>>>>     * devices will go through the host bridge, so return
> >>>>>>     * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
> >>>>>>     * PCI_P2PDMA_MAP_BUS_ADDR.
> >>>>>>     *
> >>>>>>     * Any two devices that have a data path that goes through the
> >>>>>> host
> >> bridge
> >>>>>>     * will consult a whitelist. If the host bridge is in the whitelist, return
> >>>>>>     * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to
> >> the
> >>>>>> number of
> >>>>>>     * ports per above. If the device is not in the whitelist, return
> >>>>>>     * PCI_P2PDMA_MAP_NOT_SUPPORTED.
> >>>>>>     */
> >>>>>>
> >>>>>> I didn't change the unit, do you mean that I should add above
> >>>>>> information to this patch?
> >>>>>>
> >>>>> Yes. please see above.
> >>>>
> >>>> OK, will do.
> >>>>
> >>>> Julia
> >>>>
> >>>>>
> >>>>>> Regards,
> >>>>>> Julia
> >>>>>>
> >>>>>>>
> >>>>>>> Can you please wait for few days to raise the vote and sort out
> >>>>>>> this
> >> detail?
> >>>>>>>
> >>>>>>>> +
> >>>>>>>> +The request data contains guest virtual pci notations of
> >>>>>>>> +p2pdma provider and client. The response contains the distance
> >>>>>>>> +between provider and client, which represents the
> >>>>>>>> +compatibility of these two PCI
> >>>>>>>> devices.
> >>>>>>>> +
> >>>>>>>>      \end{description}
> >>>>>>>>
> >>>>>>>>      \subsubsection{Device Operation: cursorq}\label{sec:Device
> >>>>>>>> Types / GPU Device / Device Operation / Device Operation:
> >>>>>>>> cursorq}
> >>>>>>>> --
> >>>>>>>> 2.34.1
> >>>>>>>>
> >>>>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-11  4:00                 ` Parav Pandit
@ 2025-03-11  7:29                   ` Zhang, Julia
  2025-03-11 15:14                     ` Parav Pandit
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-11  7:29 UTC (permalink / raw)
  To: Parav Pandit, Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev,
	Zhang, Julia



On 2025/3/11 12:00, Parav Pandit wrote:
> 
> 
>> From: Zhang, Julia <Julia.Zhang@amd.com>
>> Sent: Tuesday, March 11, 2025 8:58 AM
>>
>> On 2025/3/10 18:18, Parav Pandit wrote:
>>>
>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>> Sent: Monday, March 10, 2025 1:07 PM
>>>>
>>>> On 2025/3/10 11:48, Parav Pandit wrote:
>>>>>
>>>>>
>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>> Sent: Thursday, March 6, 2025 12:47 PM
>>>>>>
>>>>>> On 2025/3/6 12:29, Parav Pandit wrote:
>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
>>>>>>>>
>>>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
>>>>>>>>>
>>>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
>>>>>>>>>> To: virtio-comment@lists.linux.dev
>>>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>>>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan <lingshan.zhu@amd.com>;
>>>>>> Huang
>>>>>>>> Rui
>>>>>>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>;
>> Matias
>>>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>>>>>>>>>
>>>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for some
>>>>>> scenes.
>>>>>>>>>> For example, dGPU prime feature will let virtio-iGPU access to
>>>>>>>>>> passthrough dGPU buffer.
>>>>>>> Virtio specification does not describe dGPU and iGPU.
>>>>>>> So either describe it in the description or please drop that and
>>>>>>> explain the
>>>>>> motivation without referring to those terms.
>>>>>>
>>>>>> OK I see.
>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs to
>>>>>>>>>> check the compatibility which is represented by p2pdma_distance.
>>>>>>>>>> This defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE to
>>>> allow
>>>>>> guest
>>>>>>>> send
>>>>>>>>>> virtual pci notation of two devices to host and get host
>>>>>>>>>> physical p2pdma_distance of these two PCI devices.
>>>>>>>>>>
>>>>>>> [..]
>>>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>>>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> We are trying to implement dGPU prime feature in guest which
>>>>>>>>>> will let
>>>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before
>>>>>>>>>> that, virtio gpu driver needs to get p2pdma_distance to check
>>>>>>>>>> if P2P DMA transaction is possible or not.
>>>>>>>>>>
>>>>>>>>>> To implement getting p2pdma_distance, QEMU needs to handle the
>>>>>>>>>> command from guest with virtual pci notations of two PCI
>>>>>>>>>> devices and send it to host kernel and return host physical
>>>>>>>>>> distance back to
>>>>>> guest.
>>>>>>>>>>
>>>>>>>>>> So this defines the new command follow the suggestion in https:
>>>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
>>>> julia.zhang@amd.com
>>>>>>>>>> /
>>>>>>>>>>
>>>>>>>>>> changes from v1:
>>>>>>>>>>       - add issue link to commit msg
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Julia
>>>>>>>>>> ---
>>>>>>>>>>       device-types/gpu/description.tex | 29
>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>       1 file changed, 29 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
>>>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
>>>>>>>>>> --- a/device-types/gpu/description.tex
>>>>>>>>>> +++ b/device-types/gpu/description.tex
>>>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation: Request
>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>               VIRTIO_GPU_CMD_SUBMIT_3D,
>>>>>>>>>>               VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>>>>>>>>>               VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>>>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>>>>>>>>>
>>>>>>>>>>               /* cursor commands */
>>>>>>>>>>               VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@ -236,6
>>>>>> +237,7
>>>>>>>> @@
>>>>>>>>>> \subsubsection{Device Operation: Request
>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>               VIRTIO_GPU_RESP_OK_EDID,
>>>>>>>>>>               VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>>>>>>>>>               VIRTIO_GPU_RESP_OK_MAP_INFO,
>>>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>>>>>>>>>
>>>>>>>>>>               /* error responses */
>>>>>>>>>>               VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6
>>>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
>>>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };  \end{lstlisting}
>>>>>>>>>>
>>>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
>>>>>> cumulative
>>>>>>>>>> +distance
>>>>>>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
>>>>>>>>>> +Request data
>>>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
>>>>>>>>>> +Response type is
>>>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
>>>>>>>>>> +  \field{virtio_gpu_resp_distance}.
>>>>>>>>>> +
>>>>>>>>>> +\begin{lstlisting}
>>>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>>>> +        le32 provider_bus;
>>>>>>>>>> +        le32 provider_slot;
>>>>>>>>>> +        le32 provider_func;
>>>>>>>>>> +        le32 client_bus;
>>>>>>>>>> +        le32 client_slot;
>>>>>>>>>> +        le32 client_func;
>>>>>>>>>> +};
>>>>>>> The requester device's PCI RID is anyway known to the responder.
>>>>>>> So no
>>>>>> need to pass it here.
>>>>>>> It is unclear if its provider or the client.
>>>>>>
>>>>>> I don't think requester device's PCI RID is known to the responder,
>>>>>> since the existing function pci_p2pdma_distance() also require
>>>>>> provider and client information.
>>>>>>
>>>>> The existing function takes the provider pci device as input at C code level.
>>>>> And proposed extension (without provider BDF) will also operate on
>>>>> the
>>>> provider device on which the request/response transaction is happening.
>>>>
>>>> Well at first, I just want virtio GPU to handle the command to get
>>>> p2pdma_distance of two PCI devices(maybe none of them is virtio GPU).
>>> Using virtio gpu to find the distance between two arbitrary devices which
>> are not virtio is certainly outside the scope and looks wrong.
>>> It should be done at some PCI topology level who exposed this devices.
>> For our use case, virtio GPU is the provider, we are using virtio GPU to
>> calculate distance between virtio GPU and dGPU, which are not arbitrary
>> devices.
> The fact that in physical world this distance is calculated without involving the individual provider and client EP PCI devices, in virtual world doing this via virtual PCI EP devices does not look correct abstraction to me.

In physical world, this distance is calculated by calling existing 
function pci_p2pdma_distance(), which will pass the the provider and 
client. I understand that you mean virtio GPU should be provider by 
default, but my point is that when I call pci_p2pdma_distance() to 
calculate distance between virtio GPU and dGPU, I still need to pass the 
RID of them to host kernel side. And because host kernel doesn't know 
which one GPU is the virtio GPU, this struct still needs to include 
provider information.

> I tend to imagine that one needs to build a right virtual topology in QEMU or somewhere else who is providing these PCI devices such that guest kernel can calculate it accurately without depending on the EP PCI devices.

But the virtio GPU BDF in guest is different from host side, guest 
kernel cannot calculate the correct physical distance from guest side.
> 
> 
>>>
>>>> So
>>>> when guest needs to calculate distance of any two PCI devices, it can
>>>> reuse this command. But indeed, we only have this one use case at
>>>> this moment, I guess considering virtio GPU as provider and remove
>>>> provider PCI RID of this struct is fine.
>>> Right. A driver can issue the command to device without mentioning its own
>> device identifier.
>> Since calling function pci_p2pdma_distance() still needs to pass its own RID, I
>> think at least QEMU has to tell host kernel the RID of virito GPU.
> Probably yes, would be interesting to see this UAPI on the host kernel too, (unrelated to this patch).
> 
>>>
>>>>
>>>> But there's another problem: how does host kernel get virtio GPU RID
>>>> and calculate distance between virtio GPU and dGPU? If I remove
>>>> provider information of this struct, then QEMU still needs to pass
>>>> virtio GPU RID to host kernel side.
>>> The device implementation side can get to know about this as it has plugged
>> in and built the topology.
>>> The RID we are discussing are the vRID seen by the guest driver.
>>> I probably missed to clarify this.
>> But only getting the distance on the virtual PCI inside the guest is not enough
>> to figure out if P2P is possible or not. We need to get the real distance from
>> host side. Please take a look at the discussion here:
>> https://lore.kernel.org/all/bfe6a8b7-adcf-40e2-b7a2-4e64dc96862d@amd.com/
> 
> Finding out the distance between two PCI devices in physical world is role of the PCI bridge/switch topology builder.
> So why should it change for the guest VM?

Because as discussed in above link, before letting virtio GPU and dGPU 
do P2P, we have to check if these two devices can communicate with each 
other by calculating the distance in physical world.
It's not just the role of PCI bridge/switch topology builder. Checking 
this distance is a necessary step before P2P like amdgpu does:
amdgpu_dma_buf_attach
->pci_p2pdma_distance

> Why virtualized topology cannot be build in a way such that it matches to the physical world to get the accurate distance?
> At least to me, doing this in the pci RP domain (and not PCI EP domain) seems the right way to go forward.

Because the BDF of virtio GPU and passthrough dGPU on virtual PCI bridge 
is different from the physical world, which makes guest kernel cannot 
get the accurate distance.

Julia
> 
>>
>> Hi MST,
>> We are talking about how to get the correct p2pdma_distance from host side
>> and pass it to guest side. Do you have any suggestions?
>>
>> Julia
>>
>>>>
>>>>>
>>>>>> Do you mean that I should consider virtio GPU as provider by default?
>>>>> Yes.
>>>>>
>>>>>> Then this command will only work for this specific use case instead
>>>>>> of
>>>> being
>>>>>> used as a common function.
>>>>>>
>>>>> Linux kernel API is operating on the device. Similarly proposed
>>>>> extension
>>>> will also be on the device level.
>>>>> If needed, an higher-level software will iterate over the
>>>>> interesting PCI
>>>> devices and invoke the pci_p2pdma_distance() API.
>>>>
>>>> The reason of passing this command from guest to host is that in our
>>>> case, we need to calculate p2pdma_distance of virtio GPU and dGPU on
>>>> host side. So my original design is quiet simple:
>>>> 1. Pass rid of virtio GPU(provider) and dGPU(client) to host side 2.
>>>> Host kernel calculate real p2pdma_distance of these devices.
>>>> 3. Return p2pdma_distance to guest side.
>>>>
>>> The virtio GPU is asking information about some other non virtio device in
>> the system does not look correct to me.
>>> I believe this is the PCIe topology level should be able to provide the info
>> without encapsulating this via some arbitrary device type like virtio gpu.
>>>
>>>> And I think the higher-level software should know the exact RID of
>>>> two devices and then invoke pci_p2pdma_distance() to calculate distance.
>>> Not sure. if this can be discovered in physical level, it can be discovered at
>> virtual guest VM level too as long as QEMU builds it in right way.
>>>
>>>> So
>>>> I still think we should keep provider information in this struct and
>>>> following the existing logic of getting p2pdma_distance.
>>>>
>>>> Regards,
>>>> Julia
>>>>
>>>>>>>
>>>>>>> You only need to supply the other device RID.
>>>>>>> In PCI spec the RID is 16-bit. Considering other advancements, RID
>>>>>>> of 32-
>>>> bit
>>>>>> would be ok too.
>>>>>>>
>>>>>>> So please remove client_*.
>>>>>>> Replace provider_X with le32 other_device_pci_RID.
>>>>>>>
>>>>>>> Mention that other_device_RID follows PCI specification RID notation.
>>>>>>
>>>>>> OK I will modify it.
>>>>>>
>>>>>>>
>>>>>>>>>> +
>>>>>>>>> I assume above bdf information is what is visible to the guest?
>>>>>>>>> Why cannot it see this distance in the guest VM itself?
>>>>>>>>
>>>>>>>> Guest can see virtual bdf information which is not the real bdf
>>>>>>>> information of the host. So guest can't get the correct distance.
>>>>>>>> That's why I need to implement a new command to get and pass the
>>>> real
>>>>>> distance.
>>>>>>>>
>>>>>>> [..]
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +struct virtio_gpu_resp_distance {
>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>>>> +        le32 distance;
>>>>>>>>>> +}
>>>>>>>>>> +\end{lstlisting}
>>>>>>>>> I am not gpu expert so didn't pay attention to details.
>>>>>>>>> I believe you are missing basic information in the patch is what
>>>>>>>>> is the unit of
>>>>>>>> distance?
>>>>>>>>> Can you please add?
>>>>>>>>
>>>>>>>> The definition of distance is explained in kernel file:
>>>> driver/pci/p2pdma.c:
>>>>>>>>
>>>>>>> Description is good. Either it needs to be in the
>>>>>>> include/Linux/uapi/foo.h
>>>> file
>>>>>> and refer to it here.
>>>>>>> Or please add the description as part of this patch.
>>>>>>> Reference to kernel internal files can change and hence cannot be
>>>>>>> taken
>>>> as
>>>>>> reference for virtio net spec.
>>>>>>>
>>>>>>>> /*
>>>>>>>>      * Calculate the P2PDMA mapping type and distance between two
>>>>>>>> PCI devices.
>>>>>>>>      *
>>>>>>>>      * If the two devices are the same PCI function, return
>>>>>>>>      * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
>>>>>>>>      *
>>>>>>>>      * If they are two functions of the same device, return
>>>>>>>>      * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up
>> to
>>>> the
>>>>>>>> bridge,
>>>>>>>>      * then one hop back down to another function of the same device).
>>>>>>>>      *
>>>>>>>>      * In the case where two devices are connected to the same
>>>>>>>> PCIe
>>>> switch,
>>>>>>>>      * return a distance of 4. This corresponds to the following PCI tree:
>>>>>>>>      *
>>>>>>>>      *     -+  Root Port
>>>>>>>>      *      \+ Switch Upstream Port
>>>>>>>>      *       +-+ Switch Downstream Port 0
>>>>>>>>      *       + \- Device A
>>>>>>>>      *       \-+ Switch Downstream Port 1
>>>>>>>>      *         \- Device B
>>>>>>>>      *
>>>>>>>>      * The distance is 4 because we traverse from Device A to
>>>>>>>> Downstream
>>>>>> Port 0
>>>>>>>>      * to the common Switch Upstream Port, back down to Downstream
>>>> Port
>>>>>>>> 1 and
>>>>>>>>      * then to Device B. The mapping type returned depends on the ACS
>>>>>>>>      * redirection setting of the ports along the path.
>>>>>>>>      *
>>>>>>>>      * If ACS redirect is set on any port in the path, traffic between the
>>>>>>>>      * devices will go through the host bridge, so return
>>>>>>>>      * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
>>>>>>>>      * PCI_P2PDMA_MAP_BUS_ADDR.
>>>>>>>>      *
>>>>>>>>      * Any two devices that have a data path that goes through the
>>>>>>>> host
>>>> bridge
>>>>>>>>      * will consult a whitelist. If the host bridge is in the whitelist, return
>>>>>>>>      * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set to
>>>> the
>>>>>>>> number of
>>>>>>>>      * ports per above. If the device is not in the whitelist, return
>>>>>>>>      * PCI_P2PDMA_MAP_NOT_SUPPORTED.
>>>>>>>>      */
>>>>>>>>
>>>>>>>> I didn't change the unit, do you mean that I should add above
>>>>>>>> information to this patch?
>>>>>>>>
>>>>>>> Yes. please see above.
>>>>>>
>>>>>> OK, will do.
>>>>>>
>>>>>> Julia
>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Julia
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Can you please wait for few days to raise the vote and sort out
>>>>>>>>> this
>>>> detail?
>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +The request data contains guest virtual pci notations of
>>>>>>>>>> +p2pdma provider and client. The response contains the distance
>>>>>>>>>> +between provider and client, which represents the
>>>>>>>>>> +compatibility of these two PCI
>>>>>>>>>> devices.
>>>>>>>>>> +
>>>>>>>>>>       \end{description}
>>>>>>>>>>
>>>>>>>>>>       \subsubsection{Device Operation: cursorq}\label{sec:Device
>>>>>>>>>> Types / GPU Device / Device Operation / Device Operation:
>>>>>>>>>> cursorq}
>>>>>>>>>> --
>>>>>>>>>> 2.34.1
>>>>>>>>>>
>>>>>>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-11  7:29                   ` Zhang, Julia
@ 2025-03-11 15:14                     ` Parav Pandit
  2025-03-12  6:20                       ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Parav Pandit @ 2025-03-11 15:14 UTC (permalink / raw)
  To: Zhang, Julia, Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev


> From: Zhang, Julia <Julia.Zhang@amd.com>
> Sent: Tuesday, March 11, 2025 12:59 PM
> 
> On 2025/3/11 12:00, Parav Pandit wrote:
> >
> >
> >> From: Zhang, Julia <Julia.Zhang@amd.com>
> >> Sent: Tuesday, March 11, 2025 8:58 AM
> >>
> >> On 2025/3/10 18:18, Parav Pandit wrote:
> >>>
> >>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>> Sent: Monday, March 10, 2025 1:07 PM
> >>>>
> >>>> On 2025/3/10 11:48, Parav Pandit wrote:
> >>>>>
> >>>>>
> >>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>> Sent: Thursday, March 6, 2025 12:47 PM
> >>>>>>
> >>>>>> On 2025/3/6 12:29, Parav Pandit wrote:
> >>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
> >>>>>>>>
> >>>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
> >>>>>>>>>
> >>>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
> >>>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
> >>>>>>>>>> To: virtio-comment@lists.linux.dev
> >>>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
> >>>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan
> <lingshan.zhu@amd.com>;
> >>>>>> Huang
> >>>>>>>> Rui
> >>>>>>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>;
> >> Matias
> >>>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
> >>>>>>>>>>
> >>>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for
> >>>>>>>>>> some
> >>>>>> scenes.
> >>>>>>>>>> For example, dGPU prime feature will let virtio-iGPU access
> >>>>>>>>>> to passthrough dGPU buffer.
> >>>>>>> Virtio specification does not describe dGPU and iGPU.
> >>>>>>> So either describe it in the description or please drop that and
> >>>>>>> explain the
> >>>>>> motivation without referring to those terms.
> >>>>>>
> >>>>>> OK I see.
> >>>>>>
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs to
> >>>>>>>>>> check the compatibility which is represented by
> p2pdma_distance.
> >>>>>>>>>> This defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE
> to
> >>>> allow
> >>>>>> guest
> >>>>>>>> send
> >>>>>>>>>> virtual pci notation of two devices to host and get host
> >>>>>>>>>> physical p2pdma_distance of these two PCI devices.
> >>>>>>>>>>
> >>>>>>> [..]
> >>>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
> >>>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> >>>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen
> >>>>>>>>>> <mvaralar@redhat.com>
> >>>>>>>>>> ---
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> We are trying to implement dGPU prime feature in guest which
> >>>>>>>>>> will let
> >>>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before
> >>>>>>>>>> that, virtio gpu driver needs to get p2pdma_distance to check
> >>>>>>>>>> if P2P DMA transaction is possible or not.
> >>>>>>>>>>
> >>>>>>>>>> To implement getting p2pdma_distance, QEMU needs to handle
> >>>>>>>>>> the command from guest with virtual pci notations of two PCI
> >>>>>>>>>> devices and send it to host kernel and return host physical
> >>>>>>>>>> distance back to
> >>>>>> guest.
> >>>>>>>>>>
> >>>>>>>>>> So this defines the new command follow the suggestion in https:
> >>>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
> >>>> julia.zhang@amd.com
> >>>>>>>>>> /
> >>>>>>>>>>
> >>>>>>>>>> changes from v1:
> >>>>>>>>>>       - add issue link to commit msg
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Julia
> >>>>>>>>>> ---
> >>>>>>>>>>       device-types/gpu/description.tex | 29
> >>>>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>       1 file changed, 29 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
> >>>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
> >>>>>>>>>> --- a/device-types/gpu/description.tex
> >>>>>>>>>> +++ b/device-types/gpu/description.tex
> >>>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation:
> Request
> >>>>>>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>>>>>               VIRTIO_GPU_CMD_SUBMIT_3D,
> >>>>>>>>>>               VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
> >>>>>>>>>>               VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
> >>>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
> >>>>>>>>>>
> >>>>>>>>>>               /* cursor commands */
> >>>>>>>>>>               VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@
> >>>>>>>>>> -236,6
> >>>>>> +237,7
> >>>>>>>> @@
> >>>>>>>>>> \subsubsection{Device Operation: Request
> >>>>>>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>>>>>               VIRTIO_GPU_RESP_OK_EDID,
> >>>>>>>>>>               VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
> >>>>>>>>>>               VIRTIO_GPU_RESP_OK_MAP_INFO,
> >>>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
> >>>>>>>>>>
> >>>>>>>>>>               /* error responses */
> >>>>>>>>>>               VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6
> >>>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
> >>>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };  \end{lstlisting}
> >>>>>>>>>>
> >>>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
> >>>>>> cumulative
> >>>>>>>>>> +distance
> >>>>>>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
> >>>>>>>>>> +Request data
> >>>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
> >>>>>>>>>> +Response type is
> >>>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
> >>>>>>>>>> +  \field{virtio_gpu_resp_distance}.
> >>>>>>>>>> +
> >>>>>>>>>> +\begin{lstlisting}
> >>>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
> >>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>>>>>>>> +        le32 provider_bus;
> >>>>>>>>>> +        le32 provider_slot;
> >>>>>>>>>> +        le32 provider_func;
> >>>>>>>>>> +        le32 client_bus;
> >>>>>>>>>> +        le32 client_slot;
> >>>>>>>>>> +        le32 client_func;
> >>>>>>>>>> +};
> >>>>>>> The requester device's PCI RID is anyway known to the responder.
> >>>>>>> So no
> >>>>>> need to pass it here.
> >>>>>>> It is unclear if its provider or the client.
> >>>>>>
> >>>>>> I don't think requester device's PCI RID is known to the
> >>>>>> responder, since the existing function pci_p2pdma_distance() also
> >>>>>> require provider and client information.
> >>>>>>
> >>>>> The existing function takes the provider pci device as input at C code
> level.
> >>>>> And proposed extension (without provider BDF) will also operate on
> >>>>> the
> >>>> provider device on which the request/response transaction is
> happening.
> >>>>
> >>>> Well at first, I just want virtio GPU to handle the command to get
> >>>> p2pdma_distance of two PCI devices(maybe none of them is virtio
> GPU).
> >>> Using virtio gpu to find the distance between two arbitrary devices
> >>> which
> >> are not virtio is certainly outside the scope and looks wrong.
> >>> It should be done at some PCI topology level who exposed this devices.
> >> For our use case, virtio GPU is the provider, we are using virtio GPU
> >> to calculate distance between virtio GPU and dGPU, which are not
> >> arbitrary devices.
> > The fact that in physical world this distance is calculated without involving
> the individual provider and client EP PCI devices, in virtual world doing this via
> virtual PCI EP devices does not look correct abstraction to me.
> 
> In physical world, this distance is calculated by calling existing function
> pci_p2pdma_distance(), which will pass the the provider and client. I
> understand that you mean virtio GPU should be provider by default, but my
> point is that when I call pci_p2pdma_distance() to calculate distance between
> virtio GPU and dGPU, I still need to pass the RID of them to host kernel side.
> And because host kernel doesn't know which one GPU is the virtio GPU, this
> struct still needs to include provider information.
> 
> > I tend to imagine that one needs to build a right virtual topology in QEMU
> or somewhere else who is providing these PCI devices such that guest kernel
> can calculate it accurately without depending on the EP PCI devices.
> 
> But the virtio GPU BDF in guest is different from host side, guest kernel cannot
> calculate the correct physical distance from guest side.
Sure virtio GPU BDF is likely vRID.
Guest should request the right distance calculator entity, possibly the one that is providing the connectivity (and distance), which likely is pci RC/bridge/switch etc.

Imagine a virtio PCI driver asking to to the virtio PCI device, how far is it from the vCPU for NUMA awareness.
It seems wrong to me.
Because it is unable to follow the same flow between virtual and physical world.

> >
> >
> >>>
> >>>> So
> >>>> when guest needs to calculate distance of any two PCI devices, it
> >>>> can reuse this command. But indeed, we only have this one use case
> >>>> at this moment, I guess considering virtio GPU as provider and
> >>>> remove provider PCI RID of this struct is fine.
> >>> Right. A driver can issue the command to device without mentioning
> >>> its own
> >> device identifier.
> >> Since calling function pci_p2pdma_distance() still needs to pass its
> >> own RID, I think at least QEMU has to tell host kernel the RID of virito GPU.
> > Probably yes, would be interesting to see this UAPI on the host kernel too,
> (unrelated to this patch).
> >
> >>>
> >>>>
> >>>> But there's another problem: how does host kernel get virtio GPU
> >>>> RID and calculate distance between virtio GPU and dGPU? If I remove
> >>>> provider information of this struct, then QEMU still needs to pass
> >>>> virtio GPU RID to host kernel side.
> >>> The device implementation side can get to know about this as it has
> >>> plugged
> >> in and built the topology.
> >>> The RID we are discussing are the vRID seen by the guest driver.
> >>> I probably missed to clarify this.
> >> But only getting the distance on the virtual PCI inside the guest is
> >> not enough to figure out if P2P is possible or not. We need to get
> >> the real distance from host side. Please take a look at the discussion here:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
> >> e.kernel.org%2Fall%2Fbfe6a8b7-adcf-40e2-b7a2-
> 4e64dc96862d%40amd.com%2
> >>
> F&data=05%7C02%7Cparav%40nvidia.com%7Ccf5112991dc145cd786508dd60
> 6e772
> >>
> 9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638772749796637
> 911%7CU
> >>
> nknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAw
> MCIsIlAi
> >>
> OiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=
> 7wrTh
> >> 4PogFwvK6YzRgBQi2GeOus0CarX0%2Bu5DerCQWs%3D&reserved=0
> >
> > Finding out the distance between two PCI devices in physical world is role of
> the PCI bridge/switch topology builder.
> > So why should it change for the guest VM?
> 
> Because as discussed in above link, before letting virtio GPU and dGPU do
> P2P, we have to check if these two devices can communicate with each other
> by calculating the distance in physical world.
Ok, so that distance should be provided by the same way in physical and virtual world.
The hypercall should be in PCI RC subsystem, not in virtio arbitrary device.

> It's not just the role of PCI bridge/switch topology builder. 
Well, I think it is. Because this is how the physical world able to provide.

> Checking this
> distance is a necessary step before P2P like amdgpu does:
> amdgpu_dma_buf_attach
> ->pci_p2pdma_distance
> 
> > Why virtualized topology cannot be build in a way such that it matches to
> the physical world to get the accurate distance?
> > At least to me, doing this in the pci RP domain (and not PCI EP domain)
> seems the right way to go forward.
> 
> Because the BDF of virtio GPU and passthrough dGPU on virtual PCI bridge is
> different from the physical world, which makes guest kernel cannot get the
> accurate distance.
I suggested one option is QEMU to build right topology as that of the physical world so that physical and virtual systems can run same code.

> 
> Julia
> >
> >>
> >> Hi MST,
> >> We are talking about how to get the correct p2pdma_distance from host
> >> side and pass it to guest side. Do you have any suggestions?
> >>
> >> Julia
> >>
> >>>>
> >>>>>
> >>>>>> Do you mean that I should consider virtio GPU as provider by default?
> >>>>> Yes.
> >>>>>
> >>>>>> Then this command will only work for this specific use case
> >>>>>> instead of
> >>>> being
> >>>>>> used as a common function.
> >>>>>>
> >>>>> Linux kernel API is operating on the device. Similarly proposed
> >>>>> extension
> >>>> will also be on the device level.
> >>>>> If needed, an higher-level software will iterate over the
> >>>>> interesting PCI
> >>>> devices and invoke the pci_p2pdma_distance() API.
> >>>>
> >>>> The reason of passing this command from guest to host is that in
> >>>> our case, we need to calculate p2pdma_distance of virtio GPU and
> >>>> dGPU on host side. So my original design is quiet simple:
> >>>> 1. Pass rid of virtio GPU(provider) and dGPU(client) to host side 2.
> >>>> Host kernel calculate real p2pdma_distance of these devices.
> >>>> 3. Return p2pdma_distance to guest side.
> >>>>
> >>> The virtio GPU is asking information about some other non virtio
> >>> device in
> >> the system does not look correct to me.
> >>> I believe this is the PCIe topology level should be able to provide
> >>> the info
> >> without encapsulating this via some arbitrary device type like virtio gpu.
> >>>
> >>>> And I think the higher-level software should know the exact RID of
> >>>> two devices and then invoke pci_p2pdma_distance() to calculate
> distance.
> >>> Not sure. if this can be discovered in physical level, it can be
> >>> discovered at
> >> virtual guest VM level too as long as QEMU builds it in right way.
> >>>
> >>>> So
> >>>> I still think we should keep provider information in this struct
> >>>> and following the existing logic of getting p2pdma_distance.
> >>>>
> >>>> Regards,
> >>>> Julia
> >>>>
> >>>>>>>
> >>>>>>> You only need to supply the other device RID.
> >>>>>>> In PCI spec the RID is 16-bit. Considering other advancements,
> >>>>>>> RID of 32-
> >>>> bit
> >>>>>> would be ok too.
> >>>>>>>
> >>>>>>> So please remove client_*.
> >>>>>>> Replace provider_X with le32 other_device_pci_RID.
> >>>>>>>
> >>>>>>> Mention that other_device_RID follows PCI specification RID
> notation.
> >>>>>>
> >>>>>> OK I will modify it.
> >>>>>>
> >>>>>>>
> >>>>>>>>>> +
> >>>>>>>>> I assume above bdf information is what is visible to the guest?
> >>>>>>>>> Why cannot it see this distance in the guest VM itself?
> >>>>>>>>
> >>>>>>>> Guest can see virtual bdf information which is not the real bdf
> >>>>>>>> information of the host. So guest can't get the correct distance.
> >>>>>>>> That's why I need to implement a new command to get and pass
> >>>>>>>> the
> >>>> real
> >>>>>> distance.
> >>>>>>>>
> >>>>>>> [..]
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> +struct virtio_gpu_resp_distance {
> >>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>>>>>>>> +        le32 distance;
> >>>>>>>>>> +}
> >>>>>>>>>> +\end{lstlisting}
> >>>>>>>>> I am not gpu expert so didn't pay attention to details.
> >>>>>>>>> I believe you are missing basic information in the patch is
> >>>>>>>>> what is the unit of
> >>>>>>>> distance?
> >>>>>>>>> Can you please add?
> >>>>>>>>
> >>>>>>>> The definition of distance is explained in kernel file:
> >>>> driver/pci/p2pdma.c:
> >>>>>>>>
> >>>>>>> Description is good. Either it needs to be in the
> >>>>>>> include/Linux/uapi/foo.h
> >>>> file
> >>>>>> and refer to it here.
> >>>>>>> Or please add the description as part of this patch.
> >>>>>>> Reference to kernel internal files can change and hence cannot
> >>>>>>> be taken
> >>>> as
> >>>>>> reference for virtio net spec.
> >>>>>>>
> >>>>>>>> /*
> >>>>>>>>      * Calculate the P2PDMA mapping type and distance between
> >>>>>>>> two PCI devices.
> >>>>>>>>      *
> >>>>>>>>      * If the two devices are the same PCI function, return
> >>>>>>>>      * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
> >>>>>>>>      *
> >>>>>>>>      * If they are two functions of the same device, return
> >>>>>>>>      * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop
> up
> >> to
> >>>> the
> >>>>>>>> bridge,
> >>>>>>>>      * then one hop back down to another function of the same
> device).
> >>>>>>>>      *
> >>>>>>>>      * In the case where two devices are connected to the same
> >>>>>>>> PCIe
> >>>> switch,
> >>>>>>>>      * return a distance of 4. This corresponds to the following PCI
> tree:
> >>>>>>>>      *
> >>>>>>>>      *     -+  Root Port
> >>>>>>>>      *      \+ Switch Upstream Port
> >>>>>>>>      *       +-+ Switch Downstream Port 0
> >>>>>>>>      *       + \- Device A
> >>>>>>>>      *       \-+ Switch Downstream Port 1
> >>>>>>>>      *         \- Device B
> >>>>>>>>      *
> >>>>>>>>      * The distance is 4 because we traverse from Device A to
> >>>>>>>> Downstream
> >>>>>> Port 0
> >>>>>>>>      * to the common Switch Upstream Port, back down to
> >>>>>>>> Downstream
> >>>> Port
> >>>>>>>> 1 and
> >>>>>>>>      * then to Device B. The mapping type returned depends on the
> ACS
> >>>>>>>>      * redirection setting of the ports along the path.
> >>>>>>>>      *
> >>>>>>>>      * If ACS redirect is set on any port in the path, traffic between
> the
> >>>>>>>>      * devices will go through the host bridge, so return
> >>>>>>>>      * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
> >>>>>>>>      * PCI_P2PDMA_MAP_BUS_ADDR.
> >>>>>>>>      *
> >>>>>>>>      * Any two devices that have a data path that goes through
> >>>>>>>> the host
> >>>> bridge
> >>>>>>>>      * will consult a whitelist. If the host bridge is in the whitelist,
> return
> >>>>>>>>      * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set
> to
> >>>> the
> >>>>>>>> number of
> >>>>>>>>      * ports per above. If the device is not in the whitelist, return
> >>>>>>>>      * PCI_P2PDMA_MAP_NOT_SUPPORTED.
> >>>>>>>>      */
> >>>>>>>>
> >>>>>>>> I didn't change the unit, do you mean that I should add above
> >>>>>>>> information to this patch?
> >>>>>>>>
> >>>>>>> Yes. please see above.
> >>>>>>
> >>>>>> OK, will do.
> >>>>>>
> >>>>>> Julia
> >>>>>>
> >>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Julia
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Can you please wait for few days to raise the vote and sort
> >>>>>>>>> out this
> >>>> detail?
> >>>>>>>>>
> >>>>>>>>>> +
> >>>>>>>>>> +The request data contains guest virtual pci notations of
> >>>>>>>>>> +p2pdma provider and client. The response contains the
> >>>>>>>>>> +distance between provider and client, which represents the
> >>>>>>>>>> +compatibility of these two PCI
> >>>>>>>>>> devices.
> >>>>>>>>>> +
> >>>>>>>>>>       \end{description}
> >>>>>>>>>>
> >>>>>>>>>>       \subsubsection{Device Operation:
> >>>>>>>>>> cursorq}\label{sec:Device Types / GPU Device / Device Operation
> / Device Operation:
> >>>>>>>>>> cursorq}
> >>>>>>>>>> --
> >>>>>>>>>> 2.34.1
> >>>>>>>>>>
> >>>>>>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-11 15:14                     ` Parav Pandit
@ 2025-03-12  6:20                       ` Zhang, Julia
  2025-03-12 10:11                         ` Parav Pandit
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-12  6:20 UTC (permalink / raw)
  To: Parav Pandit, Zhang, Julia, Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev



On 2025/3/11 23:14, Parav Pandit wrote:
> 
>> From: Zhang, Julia <Julia.Zhang@amd.com>
>> Sent: Tuesday, March 11, 2025 12:59 PM
>>
>> On 2025/3/11 12:00, Parav Pandit wrote:
>>>
>>>
>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>> Sent: Tuesday, March 11, 2025 8:58 AM
>>>>
>>>> On 2025/3/10 18:18, Parav Pandit wrote:
>>>>>
>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>> Sent: Monday, March 10, 2025 1:07 PM
>>>>>>
>>>>>> On 2025/3/10 11:48, Parav Pandit wrote:
>>>>>>>
>>>>>>>
>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>> Sent: Thursday, March 6, 2025 12:47 PM
>>>>>>>>
>>>>>>>> On 2025/3/6 12:29, Parav Pandit wrote:
>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
>>>>>>>>>>
>>>>>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
>>>>>>>>>>>
>>>>>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
>>>>>>>>>>>> To: virtio-comment@lists.linux.dev
>>>>>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>>>>>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan
>> <lingshan.zhu@amd.com>;
>>>>>>>> Huang
>>>>>>>>>> Rui
>>>>>>>>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>;
>>>> Matias
>>>>>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>>>>>>>>>>>
>>>>>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for
>>>>>>>>>>>> some
>>>>>>>> scenes.
>>>>>>>>>>>> For example, dGPU prime feature will let virtio-iGPU access
>>>>>>>>>>>> to passthrough dGPU buffer.
>>>>>>>>> Virtio specification does not describe dGPU and iGPU.
>>>>>>>>> So either describe it in the description or please drop that and
>>>>>>>>> explain the
>>>>>>>> motivation without referring to those terms.
>>>>>>>>
>>>>>>>> OK I see.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs to
>>>>>>>>>>>> check the compatibility which is represented by
>> p2pdma_distance.
>>>>>>>>>>>> This defines a command VIRTIO_GPU_CMD_P2PDMA_DISTANCE
>> to
>>>>>> allow
>>>>>>>> guest
>>>>>>>>>> send
>>>>>>>>>>>> virtual pci notation of two devices to host and get host
>>>>>>>>>>>> physical p2pdma_distance of these two PCI devices.
>>>>>>>>>>>>
>>>>>>>>> [..]
>>>>>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>>>>>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen
>>>>>>>>>>>> <mvaralar@redhat.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> We are trying to implement dGPU prime feature in guest which
>>>>>>>>>>>> will let
>>>>>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before
>>>>>>>>>>>> that, virtio gpu driver needs to get p2pdma_distance to check
>>>>>>>>>>>> if P2P DMA transaction is possible or not.
>>>>>>>>>>>>
>>>>>>>>>>>> To implement getting p2pdma_distance, QEMU needs to handle
>>>>>>>>>>>> the command from guest with virtual pci notations of two PCI
>>>>>>>>>>>> devices and send it to host kernel and return host physical
>>>>>>>>>>>> distance back to
>>>>>>>> guest.
>>>>>>>>>>>>
>>>>>>>>>>>> So this defines the new command follow the suggestion in https:
>>>>>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
>>>>>> julia.zhang@amd.com
>>>>>>>>>>>> /
>>>>>>>>>>>>
>>>>>>>>>>>> changes from v1:
>>>>>>>>>>>>        - add issue link to commit msg
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Julia
>>>>>>>>>>>> ---
>>>>>>>>>>>>        device-types/gpu/description.tex | 29
>>>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>        1 file changed, 29 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
>>>>>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
>>>>>>>>>>>> --- a/device-types/gpu/description.tex
>>>>>>>>>>>> +++ b/device-types/gpu/description.tex
>>>>>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation:
>> Request
>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>>>                VIRTIO_GPU_CMD_SUBMIT_3D,
>>>>>>>>>>>>                VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>>>>>>>>>>>                VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>>>>>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>>>>>>>>>>>
>>>>>>>>>>>>                /* cursor commands */
>>>>>>>>>>>>                VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@
>>>>>>>>>>>> -236,6
>>>>>>>> +237,7
>>>>>>>>>> @@
>>>>>>>>>>>> \subsubsection{Device Operation: Request
>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>>>                VIRTIO_GPU_RESP_OK_EDID,
>>>>>>>>>>>>                VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>>>>>>>>>>>                VIRTIO_GPU_RESP_OK_MAP_INFO,
>>>>>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>>>>>>>>>>>
>>>>>>>>>>>>                /* error responses */
>>>>>>>>>>>>                VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@ -791,6
>>>>>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
>>>>>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };  \end{lstlisting}
>>>>>>>>>>>>
>>>>>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
>>>>>>>> cumulative
>>>>>>>>>>>> +distance
>>>>>>>>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
>>>>>>>>>>>> +Request data
>>>>>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
>>>>>>>>>>>> +Response type is
>>>>>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data is
>>>>>>>>>>>> +  \field{virtio_gpu_resp_distance}.
>>>>>>>>>>>> +
>>>>>>>>>>>> +\begin{lstlisting}
>>>>>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
>>>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>>>>>> +        le32 provider_bus;
>>>>>>>>>>>> +        le32 provider_slot;
>>>>>>>>>>>> +        le32 provider_func;
>>>>>>>>>>>> +        le32 client_bus;
>>>>>>>>>>>> +        le32 client_slot;
>>>>>>>>>>>> +        le32 client_func;
>>>>>>>>>>>> +};
>>>>>>>>> The requester device's PCI RID is anyway known to the responder.
>>>>>>>>> So no
>>>>>>>> need to pass it here.
>>>>>>>>> It is unclear if its provider or the client.
>>>>>>>>
>>>>>>>> I don't think requester device's PCI RID is known to the
>>>>>>>> responder, since the existing function pci_p2pdma_distance() also
>>>>>>>> require provider and client information.
>>>>>>>>
>>>>>>> The existing function takes the provider pci device as input at C code
>> level.
>>>>>>> And proposed extension (without provider BDF) will also operate on
>>>>>>> the
>>>>>> provider device on which the request/response transaction is
>> happening.
>>>>>>
>>>>>> Well at first, I just want virtio GPU to handle the command to get
>>>>>> p2pdma_distance of two PCI devices(maybe none of them is virtio
>> GPU).
>>>>> Using virtio gpu to find the distance between two arbitrary devices
>>>>> which
>>>> are not virtio is certainly outside the scope and looks wrong.
>>>>> It should be done at some PCI topology level who exposed this devices.
>>>> For our use case, virtio GPU is the provider, we are using virtio GPU
>>>> to calculate distance between virtio GPU and dGPU, which are not
>>>> arbitrary devices.
>>> The fact that in physical world this distance is calculated without involving
>> the individual provider and client EP PCI devices, in virtual world doing this via
>> virtual PCI EP devices does not look correct abstraction to me.
>>
>> In physical world, this distance is calculated by calling existing function
>> pci_p2pdma_distance(), which will pass the the provider and client. I
>> understand that you mean virtio GPU should be provider by default, but my
>> point is that when I call pci_p2pdma_distance() to calculate distance between
>> virtio GPU and dGPU, I still need to pass the RID of them to host kernel side.
>> And because host kernel doesn't know which one GPU is the virtio GPU, this
>> struct still needs to include provider information.
>>
>>> I tend to imagine that one needs to build a right virtual topology in QEMU
>> or somewhere else who is providing these PCI devices such that guest kernel
>> can calculate it accurately without depending on the EP PCI devices.
>>
>> But the virtio GPU BDF in guest is different from host side, guest kernel cannot
>> calculate the correct physical distance from guest side.
> Sure virtio GPU BDF is likely vRID.
> Guest should request the right distance calculator entity, possibly the one that is providing the connectivity (and distance), which likely is pci RC/bridge/switch etc.

Yeah, the problem here is pci RC/bridge/switch of guest is emulated, 
which is not going to provide the exact same connectivity as host does.

> 
> Imagine a virtio PCI driver asking to to the virtio PCI device, how far is it from the vCPU for NUMA awareness.
> It seems wrong to me.

It's not asking virtio PCI device to calculate the distance, it's asking 
virtio PCI device to pass the command. Calculating the distance is still 
going to request the right calculator entity, which is host PCI 
RC/bridge/switch.

> Because it is unable to follow the same flow between virtual and physical world.
> 
>>>
>>>
>>>>>
>>>>>> So
>>>>>> when guest needs to calculate distance of any two PCI devices, it
>>>>>> can reuse this command. But indeed, we only have this one use case
>>>>>> at this moment, I guess considering virtio GPU as provider and
>>>>>> remove provider PCI RID of this struct is fine.
>>>>> Right. A driver can issue the command to device without mentioning
>>>>> its own
>>>> device identifier.
>>>> Since calling function pci_p2pdma_distance() still needs to pass its
>>>> own RID, I think at least QEMU has to tell host kernel the RID of virito GPU.
>>> Probably yes, would be interesting to see this UAPI on the host kernel too,
>> (unrelated to this patch).
>>>
>>>>>
>>>>>>
>>>>>> But there's another problem: how does host kernel get virtio GPU
>>>>>> RID and calculate distance between virtio GPU and dGPU? If I remove
>>>>>> provider information of this struct, then QEMU still needs to pass
>>>>>> virtio GPU RID to host kernel side.
>>>>> The device implementation side can get to know about this as it has
>>>>> plugged
>>>> in and built the topology.
>>>>> The RID we are discussing are the vRID seen by the guest driver.
>>>>> I probably missed to clarify this.
>>>> But only getting the distance on the virtual PCI inside the guest is
>>>> not enough to figure out if P2P is possible or not. We need to get
>>>> the real distance from host side. Please take a look at the discussion here:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flor
>>>> e.kernel.org%2Fall%2Fbfe6a8b7-adcf-40e2-b7a2-
>> 4e64dc96862d%40amd.com%2
>>>>
>> F&data=05%7C02%7Cparav%40nvidia.com%7Ccf5112991dc145cd786508dd60
>> 6e772
>>>>
>> 9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638772749796637
>> 911%7CU
>>>>
>> nknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAw
>> MCIsIlAi
>>>>
>> OiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=
>> 7wrTh
>>>> 4PogFwvK6YzRgBQi2GeOus0CarX0%2Bu5DerCQWs%3D&reserved=0
>>>
>>> Finding out the distance between two PCI devices in physical world is role of
>> the PCI bridge/switch topology builder.
>>> So why should it change for the guest VM?
>>
>> Because as discussed in above link, before letting virtio GPU and dGPU do
>> P2P, we have to check if these two devices can communicate with each other
>> by calculating the distance in physical world.
> Ok, so that distance should be provided by the same way in physical and virtual world.
> The hypercall should be in PCI RC subsystem, not in virtio arbitrary device.
> 
>> It's not just the role of PCI bridge/switch topology builder.
> Well, I think it is. Because this is how the physical world able to provide.
> 
>> Checking this
>> distance is a necessary step before P2P like amdgpu does:
>> amdgpu_dma_buf_attach
>> ->pci_p2pdma_distance
>>
>>> Why virtualized topology cannot be build in a way such that it matches to
>> the physical world to get the accurate distance?
>>> At least to me, doing this in the pci RP domain (and not PCI EP domain)
>> seems the right way to go forward.
>>
>> Because the BDF of virtio GPU and passthrough dGPU on virtual PCI bridge is
>> different from the physical world, which makes guest kernel cannot get the
>> accurate distance.
> I suggested one option is QEMU to build right topology as that of the physical world so that physical and virtual systems can run same code.

Changing the way QEMU emulate PCI topology doesn't make any sense. There 
are so many differences about PCI devices between host and guest. I 
don't think this is a good option.

Julia
> 
>>
>> Julia
>>>
>>>>
>>>> Hi MST,
>>>> We are talking about how to get the correct p2pdma_distance from host
>>>> side and pass it to guest side. Do you have any suggestions?
>>>>
>>>> Julia
>>>>
>>>>>>
>>>>>>>
>>>>>>>> Do you mean that I should consider virtio GPU as provider by default?
>>>>>>> Yes.
>>>>>>>
>>>>>>>> Then this command will only work for this specific use case
>>>>>>>> instead of
>>>>>> being
>>>>>>>> used as a common function.
>>>>>>>>
>>>>>>> Linux kernel API is operating on the device. Similarly proposed
>>>>>>> extension
>>>>>> will also be on the device level.
>>>>>>> If needed, an higher-level software will iterate over the
>>>>>>> interesting PCI
>>>>>> devices and invoke the pci_p2pdma_distance() API.
>>>>>>
>>>>>> The reason of passing this command from guest to host is that in
>>>>>> our case, we need to calculate p2pdma_distance of virtio GPU and
>>>>>> dGPU on host side. So my original design is quiet simple:
>>>>>> 1. Pass rid of virtio GPU(provider) and dGPU(client) to host side 2.
>>>>>> Host kernel calculate real p2pdma_distance of these devices.
>>>>>> 3. Return p2pdma_distance to guest side.
>>>>>>
>>>>> The virtio GPU is asking information about some other non virtio
>>>>> device in
>>>> the system does not look correct to me.
>>>>> I believe this is the PCIe topology level should be able to provide
>>>>> the info
>>>> without encapsulating this via some arbitrary device type like virtio gpu.
>>>>>
>>>>>> And I think the higher-level software should know the exact RID of
>>>>>> two devices and then invoke pci_p2pdma_distance() to calculate
>> distance.
>>>>> Not sure. if this can be discovered in physical level, it can be
>>>>> discovered at
>>>> virtual guest VM level too as long as QEMU builds it in right way.
>>>>>
>>>>>> So
>>>>>> I still think we should keep provider information in this struct
>>>>>> and following the existing logic of getting p2pdma_distance.
>>>>>>
>>>>>> Regards,
>>>>>> Julia
>>>>>>
>>>>>>>>>
>>>>>>>>> You only need to supply the other device RID.
>>>>>>>>> In PCI spec the RID is 16-bit. Considering other advancements,
>>>>>>>>> RID of 32-
>>>>>> bit
>>>>>>>> would be ok too.
>>>>>>>>>
>>>>>>>>> So please remove client_*.
>>>>>>>>> Replace provider_X with le32 other_device_pci_RID.
>>>>>>>>>
>>>>>>>>> Mention that other_device_RID follows PCI specification RID
>> notation.
>>>>>>>>
>>>>>>>> OK I will modify it.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>> I assume above bdf information is what is visible to the guest?
>>>>>>>>>>> Why cannot it see this distance in the guest VM itself?
>>>>>>>>>>
>>>>>>>>>> Guest can see virtual bdf information which is not the real bdf
>>>>>>>>>> information of the host. So guest can't get the correct distance.
>>>>>>>>>> That's why I need to implement a new command to get and pass
>>>>>>>>>> the
>>>>>> real
>>>>>>>> distance.
>>>>>>>>>>
>>>>>>>>> [..]
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +struct virtio_gpu_resp_distance {
>>>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>>>>>> +        le32 distance;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +\end{lstlisting}
>>>>>>>>>>> I am not gpu expert so didn't pay attention to details.
>>>>>>>>>>> I believe you are missing basic information in the patch is
>>>>>>>>>>> what is the unit of
>>>>>>>>>> distance?
>>>>>>>>>>> Can you please add?
>>>>>>>>>>
>>>>>>>>>> The definition of distance is explained in kernel file:
>>>>>> driver/pci/p2pdma.c:
>>>>>>>>>>
>>>>>>>>> Description is good. Either it needs to be in the
>>>>>>>>> include/Linux/uapi/foo.h
>>>>>> file
>>>>>>>> and refer to it here.
>>>>>>>>> Or please add the description as part of this patch.
>>>>>>>>> Reference to kernel internal files can change and hence cannot
>>>>>>>>> be taken
>>>>>> as
>>>>>>>> reference for virtio net spec.
>>>>>>>>>
>>>>>>>>>> /*
>>>>>>>>>>       * Calculate the P2PDMA mapping type and distance between
>>>>>>>>>> two PCI devices.
>>>>>>>>>>       *
>>>>>>>>>>       * If the two devices are the same PCI function, return
>>>>>>>>>>       * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
>>>>>>>>>>       *
>>>>>>>>>>       * If they are two functions of the same device, return
>>>>>>>>>>       * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop
>> up
>>>> to
>>>>>> the
>>>>>>>>>> bridge,
>>>>>>>>>>       * then one hop back down to another function of the same
>> device).
>>>>>>>>>>       *
>>>>>>>>>>       * In the case where two devices are connected to the same
>>>>>>>>>> PCIe
>>>>>> switch,
>>>>>>>>>>       * return a distance of 4. This corresponds to the following PCI
>> tree:
>>>>>>>>>>       *
>>>>>>>>>>       *     -+  Root Port
>>>>>>>>>>       *      \+ Switch Upstream Port
>>>>>>>>>>       *       +-+ Switch Downstream Port 0
>>>>>>>>>>       *       + \- Device A
>>>>>>>>>>       *       \-+ Switch Downstream Port 1
>>>>>>>>>>       *         \- Device B
>>>>>>>>>>       *
>>>>>>>>>>       * The distance is 4 because we traverse from Device A to
>>>>>>>>>> Downstream
>>>>>>>> Port 0
>>>>>>>>>>       * to the common Switch Upstream Port, back down to
>>>>>>>>>> Downstream
>>>>>> Port
>>>>>>>>>> 1 and
>>>>>>>>>>       * then to Device B. The mapping type returned depends on the
>> ACS
>>>>>>>>>>       * redirection setting of the ports along the path.
>>>>>>>>>>       *
>>>>>>>>>>       * If ACS redirect is set on any port in the path, traffic between
>> the
>>>>>>>>>>       * devices will go through the host bridge, so return
>>>>>>>>>>       * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE; otherwise return
>>>>>>>>>>       * PCI_P2PDMA_MAP_BUS_ADDR.
>>>>>>>>>>       *
>>>>>>>>>>       * Any two devices that have a data path that goes through
>>>>>>>>>> the host
>>>>>> bridge
>>>>>>>>>>       * will consult a whitelist. If the host bridge is in the whitelist,
>> return
>>>>>>>>>>       * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE with the distance set
>> to
>>>>>> the
>>>>>>>>>> number of
>>>>>>>>>>       * ports per above. If the device is not in the whitelist, return
>>>>>>>>>>       * PCI_P2PDMA_MAP_NOT_SUPPORTED.
>>>>>>>>>>       */
>>>>>>>>>>
>>>>>>>>>> I didn't change the unit, do you mean that I should add above
>>>>>>>>>> information to this patch?
>>>>>>>>>>
>>>>>>>>> Yes. please see above.
>>>>>>>>
>>>>>>>> OK, will do.
>>>>>>>>
>>>>>>>> Julia
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Julia
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Can you please wait for few days to raise the vote and sort
>>>>>>>>>>> out this
>>>>>> detail?
>>>>>>>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>>> +The request data contains guest virtual pci notations of
>>>>>>>>>>>> +p2pdma provider and client. The response contains the
>>>>>>>>>>>> +distance between provider and client, which represents the
>>>>>>>>>>>> +compatibility of these two PCI
>>>>>>>>>>>> devices.
>>>>>>>>>>>> +
>>>>>>>>>>>>        \end{description}
>>>>>>>>>>>>
>>>>>>>>>>>>        \subsubsection{Device Operation:
>>>>>>>>>>>> cursorq}\label{sec:Device Types / GPU Device / Device Operation
>> / Device Operation:
>>>>>>>>>>>> cursorq}
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.34.1
>>>>>>>>>>>>
>>>>>>>>>>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-12  6:20                       ` Zhang, Julia
@ 2025-03-12 10:11                         ` Parav Pandit
  2025-03-13  3:31                           ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Parav Pandit @ 2025-03-12 10:11 UTC (permalink / raw)
  To: Zhang, Julia, Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev


> From: Zhang, Julia <Julia.Zhang@amd.com>
> Sent: Wednesday, March 12, 2025 11:50 AM
> 
> On 2025/3/11 23:14, Parav Pandit wrote:
> >
> >> From: Zhang, Julia <Julia.Zhang@amd.com>
> >> Sent: Tuesday, March 11, 2025 12:59 PM
> >>
> >> On 2025/3/11 12:00, Parav Pandit wrote:
> >>>
> >>>
> >>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>> Sent: Tuesday, March 11, 2025 8:58 AM
> >>>>
> >>>> On 2025/3/10 18:18, Parav Pandit wrote:
> >>>>>
> >>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>> Sent: Monday, March 10, 2025 1:07 PM
> >>>>>>
> >>>>>> On 2025/3/10 11:48, Parav Pandit wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>>>> Sent: Thursday, March 6, 2025 12:47 PM
> >>>>>>>>
> >>>>>>>> On 2025/3/6 12:29, Parav Pandit wrote:
> >>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
> >>>>>>>>>>
> >>>>>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
> >>>>>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
> >>>>>>>>>>>> To: virtio-comment@lists.linux.dev
> >>>>>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
> >>>>>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan
> >> <lingshan.zhu@amd.com>;
> >>>>>>>> Huang
> >>>>>>>>>> Rui
> >>>>>>>>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>;
> >>>> Matias
> >>>>>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>>>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
> >>>>>>>>>>>>
> >>>>>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for
> >>>>>>>>>>>> some
> >>>>>>>> scenes.
> >>>>>>>>>>>> For example, dGPU prime feature will let virtio-iGPU access
> >>>>>>>>>>>> to passthrough dGPU buffer.
> >>>>>>>>> Virtio specification does not describe dGPU and iGPU.
> >>>>>>>>> So either describe it in the description or please drop that
> >>>>>>>>> and explain the
> >>>>>>>> motivation without referring to those terms.
> >>>>>>>>
> >>>>>>>> OK I see.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs
> >>>>>>>>>>>> to check the compatibility which is represented by
> >> p2pdma_distance.
> >>>>>>>>>>>> This defines a command
> VIRTIO_GPU_CMD_P2PDMA_DISTANCE
> >> to
> >>>>>> allow
> >>>>>>>> guest
> >>>>>>>>>> send
> >>>>>>>>>>>> virtual pci notation of two devices to host and get host
> >>>>>>>>>>>> physical p2pdma_distance of these two PCI devices.
> >>>>>>>>>>>>
> >>>>>>>>> [..]
> >>>>>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
> >>>>>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> >>>>>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen
> >>>>>>>>>>>> <mvaralar@redhat.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> We are trying to implement dGPU prime feature in guest
> >>>>>>>>>>>> which will let
> >>>>>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before
> >>>>>>>>>>>> that, virtio gpu driver needs to get p2pdma_distance to
> >>>>>>>>>>>> check if P2P DMA transaction is possible or not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> To implement getting p2pdma_distance, QEMU needs to
> handle
> >>>>>>>>>>>> the command from guest with virtual pci notations of two
> >>>>>>>>>>>> PCI devices and send it to host kernel and return host
> >>>>>>>>>>>> physical distance back to
> >>>>>>>> guest.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So this defines the new command follow the suggestion in
> https:
> >>>>>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
> >>>>>> julia.zhang@amd.com
> >>>>>>>>>>>> /
> >>>>>>>>>>>>
> >>>>>>>>>>>> changes from v1:
> >>>>>>>>>>>>        - add issue link to commit msg
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Julia
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>        device-types/gpu/description.tex | 29
> >>>>>>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>>>        1 file changed, 29 insertions(+)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
> >>>>>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
> >>>>>>>>>>>> --- a/device-types/gpu/description.tex
> >>>>>>>>>>>> +++ b/device-types/gpu/description.tex
> >>>>>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation:
> >> Request
> >>>>>>>>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>>>>>>>                VIRTIO_GPU_CMD_SUBMIT_3D,
> >>>>>>>>>>>>                VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
> >>>>>>>>>>>>                VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
> >>>>>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
> >>>>>>>>>>>>
> >>>>>>>>>>>>                /* cursor commands */
> >>>>>>>>>>>>                VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@
> >>>>>>>>>>>> -236,6
> >>>>>>>> +237,7
> >>>>>>>>>> @@
> >>>>>>>>>>>> \subsubsection{Device Operation: Request
> >>>>>>>>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>>>>>>>                VIRTIO_GPU_RESP_OK_EDID,
> >>>>>>>>>>>>                VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
> >>>>>>>>>>>>                VIRTIO_GPU_RESP_OK_MAP_INFO,
> >>>>>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
> >>>>>>>>>>>>
> >>>>>>>>>>>>                /* error responses */
> >>>>>>>>>>>>                VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@
> >>>>>>>>>>>> -791,6
> >>>>>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
> >>>>>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };
> >>>>>>>>>>>> \end{lstlisting}
> >>>>>>>>>>>>
> >>>>>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
> >>>>>>>> cumulative
> >>>>>>>>>>>> +distance
> >>>>>>>>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
> >>>>>>>>>>>> +Request data
> >>>>>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
> >>>>>>>>>>>> +Response type is
> >>>>>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data
> is
> >>>>>>>>>>>> +  \field{virtio_gpu_resp_distance}.
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +\begin{lstlisting}
> >>>>>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
> >>>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>>>>>>>>>> +        le32 provider_bus;
> >>>>>>>>>>>> +        le32 provider_slot;
> >>>>>>>>>>>> +        le32 provider_func;
> >>>>>>>>>>>> +        le32 client_bus;
> >>>>>>>>>>>> +        le32 client_slot;
> >>>>>>>>>>>> +        le32 client_func;
> >>>>>>>>>>>> +};
> >>>>>>>>> The requester device's PCI RID is anyway known to the responder.
> >>>>>>>>> So no
> >>>>>>>> need to pass it here.
> >>>>>>>>> It is unclear if its provider or the client.
> >>>>>>>>
> >>>>>>>> I don't think requester device's PCI RID is known to the
> >>>>>>>> responder, since the existing function pci_p2pdma_distance()
> >>>>>>>> also require provider and client information.
> >>>>>>>>
> >>>>>>> The existing function takes the provider pci device as input at
> >>>>>>> C code
> >> level.
> >>>>>>> And proposed extension (without provider BDF) will also operate
> >>>>>>> on the
> >>>>>> provider device on which the request/response transaction is
> >> happening.
> >>>>>>
> >>>>>> Well at first, I just want virtio GPU to handle the command to
> >>>>>> get p2pdma_distance of two PCI devices(maybe none of them is
> >>>>>> virtio
> >> GPU).
> >>>>> Using virtio gpu to find the distance between two arbitrary
> >>>>> devices which
> >>>> are not virtio is certainly outside the scope and looks wrong.
> >>>>> It should be done at some PCI topology level who exposed this
> devices.
> >>>> For our use case, virtio GPU is the provider, we are using virtio
> >>>> GPU to calculate distance between virtio GPU and dGPU, which are
> >>>> not arbitrary devices.
> >>> The fact that in physical world this distance is calculated without
> >>> involving
> >> the individual provider and client EP PCI devices, in virtual world
> >> doing this via virtual PCI EP devices does not look correct abstraction to
> me.
> >>
> >> In physical world, this distance is calculated by calling existing
> >> function pci_p2pdma_distance(), which will pass the the provider and
> >> client. I understand that you mean virtio GPU should be provider by
> >> default, but my point is that when I call pci_p2pdma_distance() to
> >> calculate distance between virtio GPU and dGPU, I still need to pass the
> RID of them to host kernel side.
> >> And because host kernel doesn't know which one GPU is the virtio GPU,
> >> this struct still needs to include provider information.
> >>
> >>> I tend to imagine that one needs to build a right virtual topology
> >>> in QEMU
> >> or somewhere else who is providing these PCI devices such that guest
> >> kernel can calculate it accurately without depending on the EP PCI devices.
> >>
> >> But the virtio GPU BDF in guest is different from host side, guest
> >> kernel cannot calculate the correct physical distance from guest side.
> > Sure virtio GPU BDF is likely vRID.
> > Guest should request the right distance calculator entity, possibly the one
> that is providing the connectivity (and distance), which likely is pci
> RC/bridge/switch etc.
> 
> Yeah, the problem here is pci RC/bridge/switch of guest is emulated, which is
> not going to provide the exact same connectivity as host does.
> 
It is the role of the emulated entity to provide topology such that guest PCI driver can discover that (and not arbitrary virtio gpu driver).

> >
> > Imagine a virtio PCI driver asking to to the virtio PCI device, how far is it
> from the vCPU for NUMA awareness.
> > It seems wrong to me.
> 
> It's not asking virtio PCI device to calculate the distance, it's asking virtio PCI
> device to pass the command. 
Pass the command is same thing as virtio driver asking virtio pci device.
How virtio pci device implements is internal details not specific to the spec.

> Calculating the distance is still going to request
> the right calculator entity, which is host PCI RC/bridge/switch.
Ok. so please have such hypercall in the PCI RC/bridge/switch domain. Not in the PCIe EP domain.

> 
> > Because it is unable to follow the same flow between virtual and physical
> world.
> >
> >>>
> >>>
> >>>>>
> >>>>>> So
> >>>>>> when guest needs to calculate distance of any two PCI devices, it
> >>>>>> can reuse this command. But indeed, we only have this one use
> >>>>>> case at this moment, I guess considering virtio GPU as provider
> >>>>>> and remove provider PCI RID of this struct is fine.
> >>>>> Right. A driver can issue the command to device without mentioning
> >>>>> its own
> >>>> device identifier.
> >>>> Since calling function pci_p2pdma_distance() still needs to pass
> >>>> its own RID, I think at least QEMU has to tell host kernel the RID of virito
> GPU.
> >>> Probably yes, would be interesting to see this UAPI on the host
> >>> kernel too,
> >> (unrelated to this patch).
> >>>
> >>>>>
> >>>>>>
> >>>>>> But there's another problem: how does host kernel get virtio GPU
> >>>>>> RID and calculate distance between virtio GPU and dGPU? If I
> >>>>>> remove provider information of this struct, then QEMU still needs
> >>>>>> to pass virtio GPU RID to host kernel side.
> >>>>> The device implementation side can get to know about this as it
> >>>>> has plugged
> >>>> in and built the topology.
> >>>>> The RID we are discussing are the vRID seen by the guest driver.
> >>>>> I probably missed to clarify this.
> >>>> But only getting the distance on the virtual PCI inside the guest
> >>>> is not enough to figure out if P2P is possible or not. We need to
> >>>> get the real distance from host side. Please take a look at the discussion
> here:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
> >>>>
> or%2F&data=05%7C02%7Cparav%40nvidia.com%7C9538a40c4599428537590
> 8dd6
> >>>>
> 12df4ca%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C6387735722
> 4449
> >>>>
> 6025%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
> jAuMD
> >>>>
> AwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C
> %7C
> >>>>
> &sdata=10ypT8BuukdQ1cGF6OY77wPdOo4H3owSPHoATgCX1lw%3D&reserve
> d=0
> >>>> e.kernel.org%2Fall%2Fbfe6a8b7-adcf-40e2-b7a2-
> >> 4e64dc96862d%40amd.com%2
> >>>>
> >>
> F&data=05%7C02%7Cparav%40nvidia.com%7Ccf5112991dc145cd786508dd60
> >> 6e772
> >>>>
> >>
> 9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638772749796637
> >> 911%7CU
> >>>>
> >>
> nknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAw
> >> MCIsIlAi
> >>>>
> >>
> OiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=
> >> 7wrTh
> >>>> 4PogFwvK6YzRgBQi2GeOus0CarX0%2Bu5DerCQWs%3D&reserved=0
> >>>
> >>> Finding out the distance between two PCI devices in physical world
> >>> is role of
> >> the PCI bridge/switch topology builder.
> >>> So why should it change for the guest VM?
> >>
> >> Because as discussed in above link, before letting virtio GPU and
> >> dGPU do P2P, we have to check if these two devices can communicate
> >> with each other by calculating the distance in physical world.
> > Ok, so that distance should be provided by the same way in physical and
> virtual world.
> > The hypercall should be in PCI RC subsystem, not in virtio arbitrary device.
> >
> >> It's not just the role of PCI bridge/switch topology builder.
> > Well, I think it is. Because this is how the physical world able to provide.
> >
> >> Checking this
> >> distance is a necessary step before P2P like amdgpu does:
> >> amdgpu_dma_buf_attach
> >> ->pci_p2pdma_distance
> >>
> >>> Why virtualized topology cannot be build in a way such that it
> >>> matches to
> >> the physical world to get the accurate distance?
> >>> At least to me, doing this in the pci RP domain (and not PCI EP
> >>> domain)
> >> seems the right way to go forward.
> >>
> >> Because the BDF of virtio GPU and passthrough dGPU on virtual PCI
> >> bridge is different from the physical world, which makes guest kernel
> >> cannot get the accurate distance.
> > I suggested one option is QEMU to build right topology as that of the
> physical world so that physical and virtual systems can run same code.
> 
> Changing the way QEMU emulate PCI topology doesn't make any sense. 
Why not? If the QEMU is enabling P2P communication, it needs to enable guest to discover the way PCI driver (not virtio) discovers in physical world.
When guest access the vCPUs the hypercall is in the vCPU domain. Vcpu's information does not come from virtio device.
Similarly, PCI topology info in my view should come from the PCI RC/Bridge/switch emulation side.
I probably repeated myself here, by providing more examples of drawing the parallels between physical and virtual world.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-12 10:11                         ` Parav Pandit
@ 2025-03-13  3:31                           ` Zhang, Julia
  2025-03-14  2:15                             ` Parav Pandit
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-13  3:31 UTC (permalink / raw)
  To: Parav Pandit, Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev,
	Zhang, Julia



On 2025/3/12 18:11, Parav Pandit wrote:
> 
>> From: Zhang, Julia <Julia.Zhang@amd.com>
>> Sent: Wednesday, March 12, 2025 11:50 AM
>>
>> On 2025/3/11 23:14, Parav Pandit wrote:
>>>
>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>> Sent: Tuesday, March 11, 2025 12:59 PM
>>>>
>>>> On 2025/3/11 12:00, Parav Pandit wrote:
>>>>>
>>>>>
>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>> Sent: Tuesday, March 11, 2025 8:58 AM
>>>>>>
>>>>>> On 2025/3/10 18:18, Parav Pandit wrote:
>>>>>>>
>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>> Sent: Monday, March 10, 2025 1:07 PM
>>>>>>>>
>>>>>>>> On 2025/3/10 11:48, Parav Pandit wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>> Sent: Thursday, March 6, 2025 12:47 PM
>>>>>>>>>>
>>>>>>>>>> On 2025/3/6 12:29, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
>>>>>>>>>>>>>> To: virtio-comment@lists.linux.dev
>>>>>>>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>>>>>>>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan
>>>> <lingshan.zhu@amd.com>;
>>>>>>>>>> Huang
>>>>>>>>>>>> Rui
>>>>>>>>>>>>>> <ray.huang@amd.com>; Julia Zhang <julia.zhang@amd.com>;
>>>>>> Matias
>>>>>>>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>>>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for
>>>>>>>>>>>>>> some
>>>>>>>>>> scenes.
>>>>>>>>>>>>>> For example, dGPU prime feature will let virtio-iGPU access
>>>>>>>>>>>>>> to passthrough dGPU buffer.
>>>>>>>>>>> Virtio specification does not describe dGPU and iGPU.
>>>>>>>>>>> So either describe it in the description or please drop that
>>>>>>>>>>> and explain the
>>>>>>>>>> motivation without referring to those terms.
>>>>>>>>>>
>>>>>>>>>> OK I see.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs
>>>>>>>>>>>>>> to check the compatibility which is represented by
>>>> p2pdma_distance.
>>>>>>>>>>>>>> This defines a command
>> VIRTIO_GPU_CMD_P2PDMA_DISTANCE
>>>> to
>>>>>>>> allow
>>>>>>>>>> guest
>>>>>>>>>>>> send
>>>>>>>>>>>>>> virtual pci notation of two devices to host and get host
>>>>>>>>>>>>>> physical p2pdma_distance of these two PCI devices.
>>>>>>>>>>>>>>
>>>>>>>>>>> [..]
>>>>>>>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>>>>>>>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen
>>>>>>>>>>>>>> <mvaralar@redhat.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are trying to implement dGPU prime feature in guest
>>>>>>>>>>>>>> which will let
>>>>>>>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU. Before
>>>>>>>>>>>>>> that, virtio gpu driver needs to get p2pdma_distance to
>>>>>>>>>>>>>> check if P2P DMA transaction is possible or not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> To implement getting p2pdma_distance, QEMU needs to
>> handle
>>>>>>>>>>>>>> the command from guest with virtual pci notations of two
>>>>>>>>>>>>>> PCI devices and send it to host kernel and return host
>>>>>>>>>>>>>> physical distance back to
>>>>>>>>>> guest.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So this defines the new command follow the suggestion in
>> https:
>>>>>>>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
>>>>>>>> julia.zhang@amd.com
>>>>>>>>>>>>>> /
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> changes from v1:
>>>>>>>>>>>>>>         - add issue link to commit msg
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Julia
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>         device-types/gpu/description.tex | 29
>>>>>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>>>         1 file changed, 29 insertions(+)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
>>>>>>>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
>>>>>>>>>>>>>> --- a/device-types/gpu/description.tex
>>>>>>>>>>>>>> +++ b/device-types/gpu/description.tex
>>>>>>>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation:
>>>> Request
>>>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>>>>>                 VIRTIO_GPU_CMD_SUBMIT_3D,
>>>>>>>>>>>>>>                 VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>>>>>>>>>>>>>                 VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>>>>>>>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                 /* cursor commands */
>>>>>>>>>>>>>>                 VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300, @@
>>>>>>>>>>>>>> -236,6
>>>>>>>>>> +237,7
>>>>>>>>>>>> @@
>>>>>>>>>>>>>> \subsubsection{Device Operation: Request
>>>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>>>>>                 VIRTIO_GPU_RESP_OK_EDID,
>>>>>>>>>>>>>>                 VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>>>>>>>>>>>>>                 VIRTIO_GPU_RESP_OK_MAP_INFO,
>>>>>>>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                 /* error responses */
>>>>>>>>>>>>>>                 VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@
>>>>>>>>>>>>>> -791,6
>>>>>>>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
>>>>>>>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };
>>>>>>>>>>>>>> \end{lstlisting}
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates the
>>>>>>>>>> cumulative
>>>>>>>>>>>>>> +distance
>>>>>>>>>>>>>> +  between two devices before Peer-to-Peer(P2P) transaction.
>>>>>>>>>>>>>> +Request data
>>>>>>>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
>>>>>>>>>>>>>> +Response type is
>>>>>>>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response data
>> is
>>>>>>>>>>>>>> +  \field{virtio_gpu_resp_distance}.
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +\begin{lstlisting}
>>>>>>>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
>>>>>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>>>>>>>> +        le32 provider_bus;
>>>>>>>>>>>>>> +        le32 provider_slot;
>>>>>>>>>>>>>> +        le32 provider_func;
>>>>>>>>>>>>>> +        le32 client_bus;
>>>>>>>>>>>>>> +        le32 client_slot;
>>>>>>>>>>>>>> +        le32 client_func;
>>>>>>>>>>>>>> +};
>>>>>>>>>>> The requester device's PCI RID is anyway known to the responder.
>>>>>>>>>>> So no
>>>>>>>>>> need to pass it here.
>>>>>>>>>>> It is unclear if its provider or the client.
>>>>>>>>>>
>>>>>>>>>> I don't think requester device's PCI RID is known to the
>>>>>>>>>> responder, since the existing function pci_p2pdma_distance()
>>>>>>>>>> also require provider and client information.
>>>>>>>>>>
>>>>>>>>> The existing function takes the provider pci device as input at
>>>>>>>>> C code
>>>> level.
>>>>>>>>> And proposed extension (without provider BDF) will also operate
>>>>>>>>> on the
>>>>>>>> provider device on which the request/response transaction is
>>>> happening.
>>>>>>>>
>>>>>>>> Well at first, I just want virtio GPU to handle the command to
>>>>>>>> get p2pdma_distance of two PCI devices(maybe none of them is
>>>>>>>> virtio
>>>> GPU).
>>>>>>> Using virtio gpu to find the distance between two arbitrary
>>>>>>> devices which
>>>>>> are not virtio is certainly outside the scope and looks wrong.
>>>>>>> It should be done at some PCI topology level who exposed this
>> devices.
>>>>>> For our use case, virtio GPU is the provider, we are using virtio
>>>>>> GPU to calculate distance between virtio GPU and dGPU, which are
>>>>>> not arbitrary devices.
>>>>> The fact that in physical world this distance is calculated without
>>>>> involving
>>>> the individual provider and client EP PCI devices, in virtual world
>>>> doing this via virtual PCI EP devices does not look correct abstraction to
>> me.
>>>>
>>>> In physical world, this distance is calculated by calling existing
>>>> function pci_p2pdma_distance(), which will pass the the provider and
>>>> client. I understand that you mean virtio GPU should be provider by
>>>> default, but my point is that when I call pci_p2pdma_distance() to
>>>> calculate distance between virtio GPU and dGPU, I still need to pass the
>> RID of them to host kernel side.
>>>> And because host kernel doesn't know which one GPU is the virtio GPU,
>>>> this struct still needs to include provider information.
>>>>
>>>>> I tend to imagine that one needs to build a right virtual topology
>>>>> in QEMU
>>>> or somewhere else who is providing these PCI devices such that guest
>>>> kernel can calculate it accurately without depending on the EP PCI devices.
>>>>
>>>> But the virtio GPU BDF in guest is different from host side, guest
>>>> kernel cannot calculate the correct physical distance from guest side.
>>> Sure virtio GPU BDF is likely vRID.
>>> Guest should request the right distance calculator entity, possibly the one
>> that is providing the connectivity (and distance), which likely is pci
>> RC/bridge/switch etc.
>>
>> Yeah, the problem here is pci RC/bridge/switch of guest is emulated, which is
>> not going to provide the exact same connectivity as host does.
>>
> It is the role of the emulated entity to provide topology such that guest PCI driver can discover that (and not arbitrary virtio gpu driver).
> 
>>>
>>> Imagine a virtio PCI driver asking to to the virtio PCI device, how far is it
>> from the vCPU for NUMA awareness.
>>> It seems wrong to me.
>>
>> It's not asking virtio PCI device to calculate the distance, it's asking virtio PCI
>> device to pass the command.
> Pass the command is same thing as virtio driver asking virtio pci device.
> How virtio pci device implements is internal details not specific to the spec.
> 
>> Calculating the distance is still going to request
>> the right calculator entity, which is host PCI RC/bridge/switch.
> Ok. so please have such hypercall in the PCI RC/bridge/switch domain. Not in the PCIe EP domain.
> 
>>
>>> Because it is unable to follow the same flow between virtual and physical
>> world.
>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>> So
>>>>>>>> when guest needs to calculate distance of any two PCI devices, it
>>>>>>>> can reuse this command. But indeed, we only have this one use
>>>>>>>> case at this moment, I guess considering virtio GPU as provider
>>>>>>>> and remove provider PCI RID of this struct is fine.
>>>>>>> Right. A driver can issue the command to device without mentioning
>>>>>>> its own
>>>>>> device identifier.
>>>>>> Since calling function pci_p2pdma_distance() still needs to pass
>>>>>> its own RID, I think at least QEMU has to tell host kernel the RID of virito
>> GPU.
>>>>> Probably yes, would be interesting to see this UAPI on the host
>>>>> kernel too,
>>>> (unrelated to this patch).
>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> But there's another problem: how does host kernel get virtio GPU
>>>>>>>> RID and calculate distance between virtio GPU and dGPU? If I
>>>>>>>> remove provider information of this struct, then QEMU still needs
>>>>>>>> to pass virtio GPU RID to host kernel side.
>>>>>>> The device implementation side can get to know about this as it
>>>>>>> has plugged
>>>>>> in and built the topology.
>>>>>>> The RID we are discussing are the vRID seen by the guest driver.
>>>>>>> I probably missed to clarify this.
>>>>>> But only getting the distance on the virtual PCI inside the guest
>>>>>> is not enough to figure out if P2P is possible or not. We need to
>>>>>> get the real distance from host side. Please take a look at the discussion
>> here:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fl
>>>>>>
>> or%2F&data=05%7C02%7Cparav%40nvidia.com%7C9538a40c4599428537590
>> 8dd6
>>>>>>
>> 12df4ca%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C6387735722
>> 4449
>>>>>>
>> 6025%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
>> jAuMD
>>>>>>
>> AwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C
>> %7C
>>>>>>
>> &sdata=10ypT8BuukdQ1cGF6OY77wPdOo4H3owSPHoATgCX1lw%3D&reserve
>> d=0
>>>>>> e.kernel.org%2Fall%2Fbfe6a8b7-adcf-40e2-b7a2-
>>>> 4e64dc96862d%40amd.com%2
>>>>>>
>>>>
>> F&data=05%7C02%7Cparav%40nvidia.com%7Ccf5112991dc145cd786508dd60
>>>> 6e772
>>>>>>
>>>>
>> 9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638772749796637
>>>> 911%7CU
>>>>>>
>>>>
>> nknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAw
>>>> MCIsIlAi
>>>>>>
>>>>
>> OiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=
>>>> 7wrTh
>>>>>> 4PogFwvK6YzRgBQi2GeOus0CarX0%2Bu5DerCQWs%3D&reserved=0
>>>>>
>>>>> Finding out the distance between two PCI devices in physical world
>>>>> is role of
>>>> the PCI bridge/switch topology builder.
>>>>> So why should it change for the guest VM?
>>>>
>>>> Because as discussed in above link, before letting virtio GPU and
>>>> dGPU do P2P, we have to check if these two devices can communicate
>>>> with each other by calculating the distance in physical world.
>>> Ok, so that distance should be provided by the same way in physical and
>> virtual world.
>>> The hypercall should be in PCI RC subsystem, not in virtio arbitrary device.
>>>
>>>> It's not just the role of PCI bridge/switch topology builder.
>>> Well, I think it is. Because this is how the physical world able to provide.
>>>
>>>> Checking this
>>>> distance is a necessary step before P2P like amdgpu does:
>>>> amdgpu_dma_buf_attach
>>>> ->pci_p2pdma_distance
>>>>
>>>>> Why virtualized topology cannot be build in a way such that it
>>>>> matches to
>>>> the physical world to get the accurate distance?
>>>>> At least to me, doing this in the pci RP domain (and not PCI EP
>>>>> domain)
>>>> seems the right way to go forward.
>>>>
>>>> Because the BDF of virtio GPU and passthrough dGPU on virtual PCI
>>>> bridge is different from the physical world, which makes guest kernel
>>>> cannot get the accurate distance.
>>> I suggested one option is QEMU to build right topology as that of the
>> physical world so that physical and virtual systems can run same code.
>>
>> Changing the way QEMU emulate PCI topology doesn't make any sense.
> Why not? If the QEMU is enabling P2P communication, it needs to enable guest to discover the way PCI driver (not virtio) discovers in physical world.
> When guest access the vCPUs the hypercall is in the vCPU domain. Vcpu's information does not come from virtio device.
> Similarly, PCI topology info in my view should come from the PCI RC/Bridge/switch emulation side.
> I probably repeated myself here, by providing more examples of drawing the parallels between physical and virtual world.

1. When amdgpu driver need to calculate p2pdma_distance, the PCI 
information come from amdgpu device instead of PCI RC/Bridge.

2. In guest side, PCI topology info is from QEMU emulation, which is 
different from physical world. Do you mean QEMU should emulate exact 
same PCI topology as physical world so that we can get this distance 
from guest by just calling pci_p2pdma_distance()? Can you explain more 
details about how to let guest discover the way PCI driver discovers in 
physical world?

Hi MST,
Do you have any suggestions about how to get physical p2pdma_distance 
between virtio GPU and passthrough dGPU?

Julia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-13  3:31                           ` Zhang, Julia
@ 2025-03-14  2:15                             ` Parav Pandit
  2025-03-14  3:02                               ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Parav Pandit @ 2025-03-14  2:15 UTC (permalink / raw)
  To: Zhang, Julia, Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev



> From: Zhang, Julia <Julia.Zhang@amd.com>
> Sent: Thursday, March 13, 2025 9:02 AM
> 
> On 2025/3/12 18:11, Parav Pandit wrote:
> >
> >> From: Zhang, Julia <Julia.Zhang@amd.com>
> >> Sent: Wednesday, March 12, 2025 11:50 AM
> >>
> >> On 2025/3/11 23:14, Parav Pandit wrote:
> >>>
> >>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>> Sent: Tuesday, March 11, 2025 12:59 PM
> >>>>
> >>>> On 2025/3/11 12:00, Parav Pandit wrote:
> >>>>>
> >>>>>
> >>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>> Sent: Tuesday, March 11, 2025 8:58 AM
> >>>>>>
> >>>>>> On 2025/3/10 18:18, Parav Pandit wrote:
> >>>>>>>
> >>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>>>> Sent: Monday, March 10, 2025 1:07 PM
> >>>>>>>>
> >>>>>>>> On 2025/3/10 11:48, Parav Pandit wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>>>>>> Sent: Thursday, March 6, 2025 12:47 PM
> >>>>>>>>>>
> >>>>>>>>>> On 2025/3/6 12:29, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
> >>>>>>>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
> >>>>>>>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
> >>>>>>>>>>>>>> To: virtio-comment@lists.linux.dev
> >>>>>>>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
> >>>>>>>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan
> >>>> <lingshan.zhu@amd.com>;
> >>>>>>>>>> Huang
> >>>>>>>>>>>> Rui
> >>>>>>>>>>>>>> <ray.huang@amd.com>; Julia Zhang
> <julia.zhang@amd.com>;
> >>>>>> Matias
> >>>>>>>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
> >>>>>>>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for
> >>>>>>>>>>>>>> some
> >>>>>>>>>> scenes.
> >>>>>>>>>>>>>> For example, dGPU prime feature will let virtio-iGPU
> >>>>>>>>>>>>>> access to passthrough dGPU buffer.
> >>>>>>>>>>> Virtio specification does not describe dGPU and iGPU.
> >>>>>>>>>>> So either describe it in the description or please drop that
> >>>>>>>>>>> and explain the
> >>>>>>>>>> motivation without referring to those terms.
> >>>>>>>>>>
> >>>>>>>>>> OK I see.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs
> >>>>>>>>>>>>>> to check the compatibility which is represented by
> >>>> p2pdma_distance.
> >>>>>>>>>>>>>> This defines a command
> >> VIRTIO_GPU_CMD_P2PDMA_DISTANCE
> >>>> to
> >>>>>>>> allow
> >>>>>>>>>> guest
> >>>>>>>>>>>> send
> >>>>>>>>>>>>>> virtual pci notation of two devices to host and get host
> >>>>>>>>>>>>>> physical p2pdma_distance of these two PCI devices.
> >>>>>>>>>>>>>>
> >>>>>>>>>>> [..]
> >>>>>>>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
> >>>>>>>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
> >>>>>>>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen
> >>>>>>>>>>>>>> <mvaralar@redhat.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We are trying to implement dGPU prime feature in guest
> >>>>>>>>>>>>>> which will let
> >>>>>>>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU.
> >>>>>>>>>>>>>> Before that, virtio gpu driver needs to get
> >>>>>>>>>>>>>> p2pdma_distance to check if P2P DMA transaction is
> possible or not.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> To implement getting p2pdma_distance, QEMU needs to
> >> handle
> >>>>>>>>>>>>>> the command from guest with virtual pci notations of two
> >>>>>>>>>>>>>> PCI devices and send it to host kernel and return host
> >>>>>>>>>>>>>> physical distance back to
> >>>>>>>>>> guest.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> So this defines the new command follow the suggestion in
> >> https:
> >>>>>>>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
> >>>>>>>> julia.zhang@amd.com
> >>>>>>>>>>>>>> /
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> changes from v1:
> >>>>>>>>>>>>>>         - add issue link to commit msg
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Julia
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>         device-types/gpu/description.tex | 29
> >>>>>>>>>>>> +++++++++++++++++++++++++++++
> >>>>>>>>>>>>>>         1 file changed, 29 insertions(+)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
> >>>>>>>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
> >>>>>>>>>>>>>> --- a/device-types/gpu/description.tex
> >>>>>>>>>>>>>> +++ b/device-types/gpu/description.tex
> >>>>>>>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation:
> >>>> Request
> >>>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>>>>>>>>>                 VIRTIO_GPU_CMD_SUBMIT_3D,
> >>>>>>>>>>>>>>                 VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
> >>>>>>>>>>>>>>                 VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
> >>>>>>>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>                 /* cursor commands */
> >>>>>>>>>>>>>>                 VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300,
> @@
> >>>>>>>>>>>>>> -236,6
> >>>>>>>>>> +237,7
> >>>>>>>>>>>> @@
> >>>>>>>>>>>>>> \subsubsection{Device Operation: Request
> >>>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
> >>>>>>>>>>>>>>                 VIRTIO_GPU_RESP_OK_EDID,
> >>>>>>>>>>>>>>                 VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
> >>>>>>>>>>>>>>                 VIRTIO_GPU_RESP_OK_MAP_INFO,
> >>>>>>>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>                 /* error responses */
> >>>>>>>>>>>>>>                 VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@
> >>>>>>>>>>>>>> -791,6
> >>>>>>>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
> >>>>>>>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };
> >>>>>>>>>>>>>> \end{lstlisting}
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates
> the
> >>>>>>>>>> cumulative
> >>>>>>>>>>>>>> +distance
> >>>>>>>>>>>>>> +  between two devices before Peer-to-Peer(P2P)
> transaction.
> >>>>>>>>>>>>>> +Request data
> >>>>>>>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
> >>>>>>>>>>>>>> +Response type is
> >>>>>>>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response
> data
> >> is
> >>>>>>>>>>>>>> +  \field{virtio_gpu_resp_distance}.
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +\begin{lstlisting}
> >>>>>>>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
> >>>>>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
> >>>>>>>>>>>>>> +        le32 provider_bus;
> >>>>>>>>>>>>>> +        le32 provider_slot;
> >>>>>>>>>>>>>> +        le32 provider_func;
> >>>>>>>>>>>>>> +        le32 client_bus;
> >>>>>>>>>>>>>> +        le32 client_slot;
> >>>>>>>>>>>>>> +        le32 client_func; };
> >>>>>>>>>>> The requester device's PCI RID is anyway known to the
> responder.
> >>>>>>>>>>> So no
> >>>>>>>>>> need to pass it here.
> >>>>>>>>>>> It is unclear if its provider or the client.
> >>>>>>>>>>
> >>>>>>>>>> I don't think requester device's PCI RID is known to the
> >>>>>>>>>> responder, since the existing function pci_p2pdma_distance()
> >>>>>>>>>> also require provider and client information.
> >>>>>>>>>>
> >>>>>>>>> The existing function takes the provider pci device as input
> >>>>>>>>> at C code
> >>>> level.
> >>>>>>>>> And proposed extension (without provider BDF) will also
> >>>>>>>>> operate on the
> >>>>>>>> provider device on which the request/response transaction is
> >>>> happening.
> >>>>>>>>
> >>>>>>>> Well at first, I just want virtio GPU to handle the command to
> >>>>>>>> get p2pdma_distance of two PCI devices(maybe none of them is
> >>>>>>>> virtio
> >>>> GPU).
> >>>>>>> Using virtio gpu to find the distance between two arbitrary
> >>>>>>> devices which
> >>>>>> are not virtio is certainly outside the scope and looks wrong.
> >>>>>>> It should be done at some PCI topology level who exposed this
> >> devices.
> >>>>>> For our use case, virtio GPU is the provider, we are using virtio
> >>>>>> GPU to calculate distance between virtio GPU and dGPU, which are
> >>>>>> not arbitrary devices.
> >>>>> The fact that in physical world this distance is calculated
> >>>>> without involving
> >>>> the individual provider and client EP PCI devices, in virtual world
> >>>> doing this via virtual PCI EP devices does not look correct
> >>>> abstraction to
> >> me.
> >>>>
> >>>> In physical world, this distance is calculated by calling existing
> >>>> function pci_p2pdma_distance(), which will pass the the provider
> >>>> and client. I understand that you mean virtio GPU should be
> >>>> provider by default, but my point is that when I call
> >>>> pci_p2pdma_distance() to calculate distance between virtio GPU and
> >>>> dGPU, I still need to pass the
> >> RID of them to host kernel side.
> >>>> And because host kernel doesn't know which one GPU is the virtio
> >>>> GPU, this struct still needs to include provider information.
> >>>>
> >>>>> I tend to imagine that one needs to build a right virtual topology
> >>>>> in QEMU
> >>>> or somewhere else who is providing these PCI devices such that
> >>>> guest kernel can calculate it accurately without depending on the EP PCI
> devices.
> >>>>
> >>>> But the virtio GPU BDF in guest is different from host side, guest
> >>>> kernel cannot calculate the correct physical distance from guest side.
> >>> Sure virtio GPU BDF is likely vRID.
> >>> Guest should request the right distance calculator entity, possibly
> >>> the one
> >> that is providing the connectivity (and distance), which likely is
> >> pci RC/bridge/switch etc.
> >>
> >> Yeah, the problem here is pci RC/bridge/switch of guest is emulated,
> >> which is not going to provide the exact same connectivity as host does.
> >>
> > It is the role of the emulated entity to provide topology such that guest PCI
> driver can discover that (and not arbitrary virtio gpu driver).
> >
> >>>
> >>> Imagine a virtio PCI driver asking to to the virtio PCI device, how
> >>> far is it
> >> from the vCPU for NUMA awareness.
> >>> It seems wrong to me.
> >>
> >> It's not asking virtio PCI device to calculate the distance, it's
> >> asking virtio PCI device to pass the command.
> > Pass the command is same thing as virtio driver asking virtio pci device.
> > How virtio pci device implements is internal details not specific to the spec.
> >
> >> Calculating the distance is still going to request the right
> >> calculator entity, which is host PCI RC/bridge/switch.
> > Ok. so please have such hypercall in the PCI RC/bridge/switch domain. Not
> in the PCIe EP domain.
> >
> >>
> >>> Because it is unable to follow the same flow between virtual and
> >>> physical
> >> world.
> >>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>> So
> >>>>>>>> when guest needs to calculate distance of any two PCI devices,
> >>>>>>>> it can reuse this command. But indeed, we only have this one
> >>>>>>>> use case at this moment, I guess considering virtio GPU as
> >>>>>>>> provider and remove provider PCI RID of this struct is fine.
> >>>>>>> Right. A driver can issue the command to device without
> >>>>>>> mentioning its own
> >>>>>> device identifier.
> >>>>>> Since calling function pci_p2pdma_distance() still needs to pass
> >>>>>> its own RID, I think at least QEMU has to tell host kernel the
> >>>>>> RID of virito
> >> GPU.
> >>>>> Probably yes, would be interesting to see this UAPI on the host
> >>>>> kernel too,
> >>>> (unrelated to this patch).
> >>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> But there's another problem: how does host kernel get virtio
> >>>>>>>> GPU RID and calculate distance between virtio GPU and dGPU? If
> >>>>>>>> I remove provider information of this struct, then QEMU still
> >>>>>>>> needs to pass virtio GPU RID to host kernel side.
> >>>>>>> The device implementation side can get to know about this as it
> >>>>>>> has plugged
> >>>>>> in and built the topology.
> >>>>>>> The RID we are discussing are the vRID seen by the guest driver.
> >>>>>>> I probably missed to clarify this.
> >>>>>> But only getting the distance on the virtual PCI inside the guest
> >>>>>> is not enough to figure out if P2P is possible or not. We need to
> >>>>>> get the real distance from host side. Please take a look at the
> >>>>>> discussion
> >> here:
> >>>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2
> >>>>>>
> Fl%2F&data=05%7C02%7Cparav%40nvidia.com%7C44fb34c8931c4aaedd330
> 8d
> >>>>>>
> d61df99f8%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C6387743
> 352
> >>>>>>
> 99455726%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> OiIw
> >>>>>>
> LjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%
> >>>>>>
> 7C%7C%7C&sdata=VOWFlscDn%2FxG4%2FgT0Ge1M6IuZgisTYCqzSAUJo9AJU
> c%3D
> >>>>>> &reserved=0
> >>>>>>
> >>
> or%2F&data=05%7C02%7Cparav%40nvidia.com%7C9538a40c459942853759
> 0
> >> 8dd6
> >>>>>>
> >>
> 12df4ca%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C63877357
> 22
> >> 4449
> >>>>>>
> >>
> 6025%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
> >> jAuMD
> >>>>>>
> >>
> AwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C
> >> %7C
> >>>>>>
> >>
> &sdata=10ypT8BuukdQ1cGF6OY77wPdOo4H3owSPHoATgCX1lw%3D&reserve
> >> d=0
> >>>>>> e.kernel.org%2Fall%2Fbfe6a8b7-adcf-40e2-b7a2-
> >>>> 4e64dc96862d%40amd.com%2
> >>>>>>
> >>>>
> >>
> F&data=05%7C02%7Cparav%40nvidia.com%7Ccf5112991dc145cd786508dd6
> 0
> >>>> 6e772
> >>>>>>
> >>>>
> >>
> 9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C63877274979663
> 7
> >>>> 911%7CU
> >>>>>>
> >>>>
> >>
> nknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAw
> >>>> MCIsIlAi
> >>>>>>
> >>>>
> >>
> OiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=
> >>>> 7wrTh
> >>>>>> 4PogFwvK6YzRgBQi2GeOus0CarX0%2Bu5DerCQWs%3D&reserved=0
> >>>>>
> >>>>> Finding out the distance between two PCI devices in physical world
> >>>>> is role of
> >>>> the PCI bridge/switch topology builder.
> >>>>> So why should it change for the guest VM?
> >>>>
> >>>> Because as discussed in above link, before letting virtio GPU and
> >>>> dGPU do P2P, we have to check if these two devices can communicate
> >>>> with each other by calculating the distance in physical world.
> >>> Ok, so that distance should be provided by the same way in physical
> >>> and
> >> virtual world.
> >>> The hypercall should be in PCI RC subsystem, not in virtio arbitrary device.
> >>>
> >>>> It's not just the role of PCI bridge/switch topology builder.
> >>> Well, I think it is. Because this is how the physical world able to provide.
> >>>
> >>>> Checking this
> >>>> distance is a necessary step before P2P like amdgpu does:
> >>>> amdgpu_dma_buf_attach
> >>>> ->pci_p2pdma_distance
> >>>>
> >>>>> Why virtualized topology cannot be build in a way such that it
> >>>>> matches to
> >>>> the physical world to get the accurate distance?
> >>>>> At least to me, doing this in the pci RP domain (and not PCI EP
> >>>>> domain)
> >>>> seems the right way to go forward.
> >>>>
> >>>> Because the BDF of virtio GPU and passthrough dGPU on virtual PCI
> >>>> bridge is different from the physical world, which makes guest
> >>>> kernel cannot get the accurate distance.
> >>> I suggested one option is QEMU to build right topology as that of
> >>> the
> >> physical world so that physical and virtual systems can run same code.
> >>
> >> Changing the way QEMU emulate PCI topology doesn't make any sense.
> > Why not? If the QEMU is enabling P2P communication, it needs to enable
> guest to discover the way PCI driver (not virtio) discovers in physical world.
> > When guest access the vCPUs the hypercall is in the vCPU domain. Vcpu's
> information does not come from virtio device.
> > Similarly, PCI topology info in my view should come from the PCI
> RC/Bridge/switch emulation side.
> > I probably repeated myself here, by providing more examples of drawing the
> parallels between physical and virtual world.
> 
> 1. When amdgpu driver need to calculate p2pdma_distance, the PCI
> information come from amdgpu device instead of PCI RC/Bridge.
>
Do you mean amdgpu driver invokes a driver internal API to talk to amd device and it does not invoke pci_p2pdma_distance()?

I have no comments from such driver as its outside the virtio scope.
 
> 2. In guest side, PCI topology info is from QEMU emulation, which is different
> from physical world. Do you mean QEMU should emulate exact same PCI
> topology as physical world so that we can get this distance from guest by just
> calling pci_p2pdma_distance()? 
Yep.

> Can you explain more details about how to let
> guest discover the way PCI driver discovers in physical world?
>
This is QEMU or VMM level implementation question on how to build the synthetic topology.
In libvirt and other places, there is a way to describe the PCI level BDF etc details.
You might want to check with the right owners on how to achieve or extend it.
 
> Hi MST,
> Do you have any suggestions about how to get physical p2pdma_distance
> between virtio GPU and passthrough dGPU?
> 
> Julia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-14  2:15                             ` Parav Pandit
@ 2025-03-14  3:02                               ` Zhang, Julia
  2025-03-19 10:45                                 ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-14  3:02 UTC (permalink / raw)
  To: Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev,
	Parav Pandit, Zhang, Julia



On 2025/3/14 10:15, Parav Pandit wrote:
> 
> 
>> From: Zhang, Julia <Julia.Zhang@amd.com>
>> Sent: Thursday, March 13, 2025 9:02 AM
>>
>> On 2025/3/12 18:11, Parav Pandit wrote:
>>>
>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>> Sent: Wednesday, March 12, 2025 11:50 AM
>>>>
>>>> On 2025/3/11 23:14, Parav Pandit wrote:
>>>>>
>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>> Sent: Tuesday, March 11, 2025 12:59 PM
>>>>>>
>>>>>> On 2025/3/11 12:00, Parav Pandit wrote:
>>>>>>>
>>>>>>>
>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>> Sent: Tuesday, March 11, 2025 8:58 AM
>>>>>>>>
>>>>>>>> On 2025/3/10 18:18, Parav Pandit wrote:
>>>>>>>>>
>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>> Sent: Monday, March 10, 2025 1:07 PM
>>>>>>>>>>
>>>>>>>>>> On 2025/3/10 11:48, Parav Pandit wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>>>> Sent: Thursday, March 6, 2025 12:47 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 2025/3/6 12:29, Parav Pandit wrote:
>>>>>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
>>>>>>>>>>>>>>>> To: virtio-comment@lists.linux.dev
>>>>>>>>>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>>>>>>>>>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan
>>>>>> <lingshan.zhu@amd.com>;
>>>>>>>>>>>> Huang
>>>>>>>>>>>>>> Rui
>>>>>>>>>>>>>>>> <ray.huang@amd.com>; Julia Zhang
>> <julia.zhang@amd.com>;
>>>>>>>> Matias
>>>>>>>>>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>>>>>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>> scenes.
>>>>>>>>>>>>>>>> For example, dGPU prime feature will let virtio-iGPU
>>>>>>>>>>>>>>>> access to passthrough dGPU buffer.
>>>>>>>>>>>>> Virtio specification does not describe dGPU and iGPU.
>>>>>>>>>>>>> So either describe it in the description or please drop that
>>>>>>>>>>>>> and explain the
>>>>>>>>>>>> motivation without referring to those terms.
>>>>>>>>>>>>
>>>>>>>>>>>> OK I see.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs
>>>>>>>>>>>>>>>> to check the compatibility which is represented by
>>>>>> p2pdma_distance.
>>>>>>>>>>>>>>>> This defines a command
>>>> VIRTIO_GPU_CMD_P2PDMA_DISTANCE
>>>>>> to
>>>>>>>>>> allow
>>>>>>>>>>>> guest
>>>>>>>>>>>>>> send
>>>>>>>>>>>>>>>> virtual pci notation of two devices to host and get host
>>>>>>>>>>>>>>>> physical p2pdma_distance of these two PCI devices.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> [..]
>>>>>>>>>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>>>>>>>>>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen
>>>>>>>>>>>>>>>> <mvaralar@redhat.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are trying to implement dGPU prime feature in guest
>>>>>>>>>>>>>>>> which will let
>>>>>>>>>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU.
>>>>>>>>>>>>>>>> Before that, virtio gpu driver needs to get
>>>>>>>>>>>>>>>> p2pdma_distance to check if P2P DMA transaction is
>> possible or not.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To implement getting p2pdma_distance, QEMU needs to
>>>> handle
>>>>>>>>>>>>>>>> the command from guest with virtual pci notations of two
>>>>>>>>>>>>>>>> PCI devices and send it to host kernel and return host
>>>>>>>>>>>>>>>> physical distance back to
>>>>>>>>>>>> guest.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So this defines the new command follow the suggestion in
>>>> https:
>>>>>>>>>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
>>>>>>>>>> julia.zhang@amd.com
>>>>>>>>>>>>>>>> /
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> changes from v1:
>>>>>>>>>>>>>>>>          - add issue link to commit msg
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Julia
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>          device-types/gpu/description.tex | 29
>>>>>>>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>          1 file changed, 29 insertions(+)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
>>>>>>>>>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
>>>>>>>>>>>>>>>> --- a/device-types/gpu/description.tex
>>>>>>>>>>>>>>>> +++ b/device-types/gpu/description.tex
>>>>>>>>>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation:
>>>>>> Request
>>>>>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>>>>>>>                  VIRTIO_GPU_CMD_SUBMIT_3D,
>>>>>>>>>>>>>>>>                  VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>>>>>>>>>>>>>>>                  VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>>>>>>>>>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                  /* cursor commands */
>>>>>>>>>>>>>>>>                  VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300,
>> @@
>>>>>>>>>>>>>>>> -236,6
>>>>>>>>>>>> +237,7
>>>>>>>>>>>>>> @@
>>>>>>>>>>>>>>>> \subsubsection{Device Operation: Request
>>>>>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>>>>>>>                  VIRTIO_GPU_RESP_OK_EDID,
>>>>>>>>>>>>>>>>                  VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>>>>>>>>>>>>>>>                  VIRTIO_GPU_RESP_OK_MAP_INFO,
>>>>>>>>>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                  /* error responses */
>>>>>>>>>>>>>>>>                  VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@
>>>>>>>>>>>>>>>> -791,6
>>>>>>>>>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
>>>>>>>>>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };
>>>>>>>>>>>>>>>> \end{lstlisting}
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates
>> the
>>>>>>>>>>>> cumulative
>>>>>>>>>>>>>>>> +distance
>>>>>>>>>>>>>>>> +  between two devices before Peer-to-Peer(P2P)
>> transaction.
>>>>>>>>>>>>>>>> +Request data
>>>>>>>>>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
>>>>>>>>>>>>>>>> +Response type is
>>>>>>>>>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response
>> data
>>>> is
>>>>>>>>>>>>>>>> +  \field{virtio_gpu_resp_distance}.
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +\begin{lstlisting}
>>>>>>>>>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
>>>>>>>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>>>>>>>>>> +        le32 provider_bus;
>>>>>>>>>>>>>>>> +        le32 provider_slot;
>>>>>>>>>>>>>>>> +        le32 provider_func;
>>>>>>>>>>>>>>>> +        le32 client_bus;
>>>>>>>>>>>>>>>> +        le32 client_slot;
>>>>>>>>>>>>>>>> +        le32 client_func; };
>>>>>>>>>>>>> The requester device's PCI RID is anyway known to the
>> responder.
>>>>>>>>>>>>> So no
>>>>>>>>>>>> need to pass it here.
>>>>>>>>>>>>> It is unclear if its provider or the client.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think requester device's PCI RID is known to the
>>>>>>>>>>>> responder, since the existing function pci_p2pdma_distance()
>>>>>>>>>>>> also require provider and client information.
>>>>>>>>>>>>
>>>>>>>>>>> The existing function takes the provider pci device as input
>>>>>>>>>>> at C code
>>>>>> level.
>>>>>>>>>>> And proposed extension (without provider BDF) will also
>>>>>>>>>>> operate on the
>>>>>>>>>> provider device on which the request/response transaction is
>>>>>> happening.
>>>>>>>>>>
>>>>>>>>>> Well at first, I just want virtio GPU to handle the command to
>>>>>>>>>> get p2pdma_distance of two PCI devices(maybe none of them is
>>>>>>>>>> virtio
>>>>>> GPU).
>>>>>>>>> Using virtio gpu to find the distance between two arbitrary
>>>>>>>>> devices which
>>>>>>>> are not virtio is certainly outside the scope and looks wrong.
>>>>>>>>> It should be done at some PCI topology level who exposed this
>>>> devices.
>>>>>>>> For our use case, virtio GPU is the provider, we are using virtio
>>>>>>>> GPU to calculate distance between virtio GPU and dGPU, which are
>>>>>>>> not arbitrary devices.
>>>>>>> The fact that in physical world this distance is calculated
>>>>>>> without involving
>>>>>> the individual provider and client EP PCI devices, in virtual world
>>>>>> doing this via virtual PCI EP devices does not look correct
>>>>>> abstraction to
>>>> me.
>>>>>>
>>>>>> In physical world, this distance is calculated by calling existing
>>>>>> function pci_p2pdma_distance(), which will pass the the provider
>>>>>> and client. I understand that you mean virtio GPU should be
>>>>>> provider by default, but my point is that when I call
>>>>>> pci_p2pdma_distance() to calculate distance between virtio GPU and
>>>>>> dGPU, I still need to pass the
>>>> RID of them to host kernel side.
>>>>>> And because host kernel doesn't know which one GPU is the virtio
>>>>>> GPU, this struct still needs to include provider information.
>>>>>>
>>>>>>> I tend to imagine that one needs to build a right virtual topology
>>>>>>> in QEMU
>>>>>> or somewhere else who is providing these PCI devices such that
>>>>>> guest kernel can calculate it accurately without depending on the EP PCI
>> devices.
>>>>>>
>>>>>> But the virtio GPU BDF in guest is different from host side, guest
>>>>>> kernel cannot calculate the correct physical distance from guest side.
>>>>> Sure virtio GPU BDF is likely vRID.
>>>>> Guest should request the right distance calculator entity, possibly
>>>>> the one
>>>> that is providing the connectivity (and distance), which likely is
>>>> pci RC/bridge/switch etc.
>>>>
>>>> Yeah, the problem here is pci RC/bridge/switch of guest is emulated,
>>>> which is not going to provide the exact same connectivity as host does.
>>>>
>>> It is the role of the emulated entity to provide topology such that guest PCI
>> driver can discover that (and not arbitrary virtio gpu driver).
>>>
>>>>>
>>>>> Imagine a virtio PCI driver asking to to the virtio PCI device, how
>>>>> far is it
>>>> from the vCPU for NUMA awareness.
>>>>> It seems wrong to me.
>>>>
>>>> It's not asking virtio PCI device to calculate the distance, it's
>>>> asking virtio PCI device to pass the command.
>>> Pass the command is same thing as virtio driver asking virtio pci device.
>>> How virtio pci device implements is internal details not specific to the spec.
>>>
>>>> Calculating the distance is still going to request the right
>>>> calculator entity, which is host PCI RC/bridge/switch.
>>> Ok. so please have such hypercall in the PCI RC/bridge/switch domain. Not
>> in the PCIe EP domain.
>>>
>>>>
>>>>> Because it is unable to follow the same flow between virtual and
>>>>> physical
>>>> world.
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> So
>>>>>>>>>> when guest needs to calculate distance of any two PCI devices,
>>>>>>>>>> it can reuse this command. But indeed, we only have this one
>>>>>>>>>> use case at this moment, I guess considering virtio GPU as
>>>>>>>>>> provider and remove provider PCI RID of this struct is fine.
>>>>>>>>> Right. A driver can issue the command to device without
>>>>>>>>> mentioning its own
>>>>>>>> device identifier.
>>>>>>>> Since calling function pci_p2pdma_distance() still needs to pass
>>>>>>>> its own RID, I think at least QEMU has to tell host kernel the
>>>>>>>> RID of virito
>>>> GPU.
>>>>>>> Probably yes, would be interesting to see this UAPI on the host
>>>>>>> kernel too,
>>>>>> (unrelated to this patch).
>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But there's another problem: how does host kernel get virtio
>>>>>>>>>> GPU RID and calculate distance between virtio GPU and dGPU? If
>>>>>>>>>> I remove provider information of this struct, then QEMU still
>>>>>>>>>> needs to pass virtio GPU RID to host kernel side.
>>>>>>>>> The device implementation side can get to know about this as it
>>>>>>>>> has plugged
>>>>>>>> in and built the topology.
>>>>>>>>> The RID we are discussing are the vRID seen by the guest driver.
>>>>>>>>> I probably missed to clarify this.
>>>>>>>> But only getting the distance on the virtual PCI inside the guest
>>>>>>>> is not enough to figure out if P2P is possible or not. We need to
>>>>>>>> get the real distance from host side. Please take a look at the
>>>>>>>> discussion
>>>> here:
>>>>>>>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>>>>>>>
>> Fl%2F&data=05%7C02%7Cparav%40nvidia.com%7C44fb34c8931c4aaedd330
>> 8d
>>>>>>>>
>> d61df99f8%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C6387743
>> 352
>>>>>>>>
>> 99455726%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
>> OiIw
>>>>>>>>
>> LjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%
>>>>>>>>
>> 7C%7C%7C&sdata=VOWFlscDn%2FxG4%2FgT0Ge1M6IuZgisTYCqzSAUJo9AJU
>> c%3D
>>>>>>>> &reserved=0
>>>>>>>>
>>>>
>> or%2F&data=05%7C02%7Cparav%40nvidia.com%7C9538a40c459942853759
>> 0
>>>> 8dd6
>>>>>>>>
>>>>
>> 12df4ca%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C63877357
>> 22
>>>> 4449
>>>>>>>>
>>>>
>> 6025%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
>>>> jAuMD
>>>>>>>>
>>>>
>> AwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C
>>>> %7C
>>>>>>>>
>>>>
>> &sdata=10ypT8BuukdQ1cGF6OY77wPdOo4H3owSPHoATgCX1lw%3D&reserve
>>>> d=0
>>>>>>>> e.kernel.org%2Fall%2Fbfe6a8b7-adcf-40e2-b7a2-
>>>>>> 4e64dc96862d%40amd.com%2
>>>>>>>>
>>>>>>
>>>>
>> F&data=05%7C02%7Cparav%40nvidia.com%7Ccf5112991dc145cd786508dd6
>> 0
>>>>>> 6e772
>>>>>>>>
>>>>>>
>>>>
>> 9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C63877274979663
>> 7
>>>>>> 911%7CU
>>>>>>>>
>>>>>>
>>>>
>> nknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAw
>>>>>> MCIsIlAi
>>>>>>>>
>>>>>>
>>>>
>> OiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=
>>>>>> 7wrTh
>>>>>>>> 4PogFwvK6YzRgBQi2GeOus0CarX0%2Bu5DerCQWs%3D&reserved=0
>>>>>>>
>>>>>>> Finding out the distance between two PCI devices in physical world
>>>>>>> is role of
>>>>>> the PCI bridge/switch topology builder.
>>>>>>> So why should it change for the guest VM?
>>>>>>
>>>>>> Because as discussed in above link, before letting virtio GPU and
>>>>>> dGPU do P2P, we have to check if these two devices can communicate
>>>>>> with each other by calculating the distance in physical world.
>>>>> Ok, so that distance should be provided by the same way in physical
>>>>> and
>>>> virtual world.
>>>>> The hypercall should be in PCI RC subsystem, not in virtio arbitrary device.
>>>>>
>>>>>> It's not just the role of PCI bridge/switch topology builder.
>>>>> Well, I think it is. Because this is how the physical world able to provide.
>>>>>
>>>>>> Checking this
>>>>>> distance is a necessary step before P2P like amdgpu does:
>>>>>> amdgpu_dma_buf_attach
>>>>>> ->pci_p2pdma_distance
>>>>>>
>>>>>>> Why virtualized topology cannot be build in a way such that it
>>>>>>> matches to
>>>>>> the physical world to get the accurate distance?
>>>>>>> At least to me, doing this in the pci RP domain (and not PCI EP
>>>>>>> domain)
>>>>>> seems the right way to go forward.
>>>>>>
>>>>>> Because the BDF of virtio GPU and passthrough dGPU on virtual PCI
>>>>>> bridge is different from the physical world, which makes guest
>>>>>> kernel cannot get the accurate distance.
>>>>> I suggested one option is QEMU to build right topology as that of
>>>>> the
>>>> physical world so that physical and virtual systems can run same code.
>>>>
>>>> Changing the way QEMU emulate PCI topology doesn't make any sense.
>>> Why not? If the QEMU is enabling P2P communication, it needs to enable
>> guest to discover the way PCI driver (not virtio) discovers in physical world.
>>> When guest access the vCPUs the hypercall is in the vCPU domain. Vcpu's
>> information does not come from virtio device.
>>> Similarly, PCI topology info in my view should come from the PCI
>> RC/Bridge/switch emulation side.
>>> I probably repeated myself here, by providing more examples of drawing the
>> parallels between physical and virtual world.
>>
>> 1. When amdgpu driver need to calculate p2pdma_distance, the PCI
>> information come from amdgpu device instead of PCI RC/Bridge.
>>
> Do you mean amdgpu driver invokes a driver internal API to talk to amd device and it does not invoke pci_p2pdma_distance()?
> 
> I have no comments from such driver as its outside the virtio scope.
>   
>> 2. In guest side, PCI topology info is from QEMU emulation, which is different
>> from physical world. Do you mean QEMU should emulate exact same PCI
>> topology as physical world so that we can get this distance from guest by just
>> calling pci_p2pdma_distance()?
> Yep.
> 
>> Can you explain more details about how to let
>> guest discover the way PCI driver discovers in physical world?
>>
> This is QEMU or VMM level implementation question on how to build the synthetic topology.
> In libvirt and other places, there is a way to describe the PCI level BDF etc details.
> You might want to check with the right owners on how to achieve or extend it.

Hi MST,
As discussed above, Parav suggested to rebuild PCI topology in QEMU 
level to make virtual PCI topology 100% same as physical PCI topology so 
that we can get accurate p2pdma_distance from guest directly. Since you 
are one of the maintainers of QEMU, can you give us some suggestions or 
explain more details about how to achieve it?

Thanks,
Julia
>   
>> Hi MST,
>> Do you have any suggestions about how to get physical p2pdma_distance
>> between virtio GPU and passthrough dGPU?
>>
>> Julia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-14  3:02                               ` Zhang, Julia
@ 2025-03-19 10:45                                 ` Zhang, Julia
  2025-03-19 12:23                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 21+ messages in thread
From: Zhang, Julia @ 2025-03-19 10:45 UTC (permalink / raw)
  To: Michael S . Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev,
	Parav Pandit, Zhang, Julia



On 2025/3/14 11:02, Zhang, Julia wrote:
> 
> 
> On 2025/3/14 10:15, Parav Pandit wrote:
>>
>>
>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>> Sent: Thursday, March 13, 2025 9:02 AM
>>>
>>> On 2025/3/12 18:11, Parav Pandit wrote:
>>>>
>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>> Sent: Wednesday, March 12, 2025 11:50 AM
>>>>>
>>>>> On 2025/3/11 23:14, Parav Pandit wrote:
>>>>>>
>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>> Sent: Tuesday, March 11, 2025 12:59 PM
>>>>>>>
>>>>>>> On 2025/3/11 12:00, Parav Pandit wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>> Sent: Tuesday, March 11, 2025 8:58 AM
>>>>>>>>>
>>>>>>>>> On 2025/3/10 18:18, Parav Pandit wrote:
>>>>>>>>>>
>>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>>> Sent: Monday, March 10, 2025 1:07 PM
>>>>>>>>>>>
>>>>>>>>>>> On 2025/3/10 11:48, Parav Pandit wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>>>>> Sent: Thursday, March 6, 2025 12:47 PM
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2025/3/6 12:29, Parav Pandit wrote:
>>>>>>>>>>>>>>> From: Zhang, Julia <Julia.Zhang@amd.com>
>>>>>>>>>>>>>>> Sent: Wednesday, March 5, 2025 4:12 PM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2025/2/26 18:51, Parav Pandit wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> From: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>>>>>>>>> Sent: Wednesday, February 26, 2025 2:58 PM
>>>>>>>>>>>>>>>>> To: virtio-comment@lists.linux.dev
>>>>>>>>>>>>>>>>> Cc: Michael S . Tsirkin <mst@redhat.com>; Chen Jiqian
>>>>>>>>>>>>>>>>> <Jiqian.Chen@amd.com>; Zhu Lingshan
>>>>>>> <lingshan.zhu@amd.com>;
>>>>>>>>>>>>> Huang
>>>>>>>>>>>>>>> Rui
>>>>>>>>>>>>>>>>> <ray.huang@amd.com>; Julia Zhang
>>> <julia.zhang@amd.com>;
>>>>>>>>> Matias
>>>>>>>>>>>>>>>>> Ezequiel Vara Larsen <mvaralar@redhat.com>
>>>>>>>>>>>>>>>>> Subject: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> PCI peer-to-peer DMA transaction may be used in guest for
>>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>> scenes.
>>>>>>>>>>>>>>>>> For example, dGPU prime feature will let virtio-iGPU
>>>>>>>>>>>>>>>>> access to passthrough dGPU buffer.
>>>>>>>>>>>>>> Virtio specification does not describe dGPU and iGPU.
>>>>>>>>>>>>>> So either describe it in the description or please drop that
>>>>>>>>>>>>>> and explain the
>>>>>>>>>>>>> motivation without referring to those terms.
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK I see.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To support P2P DMA transaction in guest, virtio-gpu needs
>>>>>>>>>>>>>>>>> to check the compatibility which is represented by
>>>>>>> p2pdma_distance.
>>>>>>>>>>>>>>>>> This defines a command
>>>>> VIRTIO_GPU_CMD_P2PDMA_DISTANCE
>>>>>>> to
>>>>>>>>>>> allow
>>>>>>>>>>>>> guest
>>>>>>>>>>>>>>> send
>>>>>>>>>>>>>>>>> virtual pci notation of two devices to host and get host
>>>>>>>>>>>>>>>>> physical p2pdma_distance of these two PCI devices.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [..]
>>>>>>>>>>>>>>>>> Ifsm1M5xIb37T9Tg%3D&reserved=0
>>>>>>>>>>>>>>>>> Signed-off-by: Julia Zhang <julia.zhang@amd.com>
>>>>>>>>>>>>>>>>> Reviewed-by: Matias Ezequiel Vara Larsen
>>>>>>>>>>>>>>>>> <mvaralar@redhat.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We are trying to implement dGPU prime feature in guest
>>>>>>>>>>>>>>>>> which will let
>>>>>>>>>>>>>>>>> virtio- iGPU read rendered data of passthrough dGPU.
>>>>>>>>>>>>>>>>> Before that, virtio gpu driver needs to get
>>>>>>>>>>>>>>>>> p2pdma_distance to check if P2P DMA transaction is
>>> possible or not.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To implement getting p2pdma_distance, QEMU needs to
>>>>> handle
>>>>>>>>>>>>>>>>> the command from guest with virtual pci notations of two
>>>>>>>>>>>>>>>>> PCI devices and send it to host kernel and return host
>>>>>>>>>>>>>>>>> physical distance back to
>>>>>>>>>>>>> guest.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So this defines the new command follow the suggestion in
>>>>> https:
>>>>>>>>>>>>>>>>> //lore.kernel.org/all/20241207105537.542441-4-
>>>>>>>>>>> julia.zhang@amd.com
>>>>>>>>>>>>>>>>> /
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> changes from v1:
>>>>>>>>>>>>>>>>>           - add issue link to commit msg
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Julia
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>           device-types/gpu/description.tex | 29
>>>>>>>>>>>>>>> +++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>           1 file changed, 29 insertions(+)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/device-types/gpu/description.tex b/device-
>>>>>>>>>>>>>>>>> types/gpu/description.tex index 4435248..9d0f30b 100644
>>>>>>>>>>>>>>>>> --- a/device-types/gpu/description.tex
>>>>>>>>>>>>>>>>> +++ b/device-types/gpu/description.tex
>>>>>>>>>>>>>>>>> @@ -223,6 +223,7 @@ \subsubsection{Device Operation:
>>>>>>> Request
>>>>>>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>>>>>>>>                   VIRTIO_GPU_CMD_SUBMIT_3D,
>>>>>>>>>>>>>>>>>                   VIRTIO_GPU_CMD_RESOURCE_MAP_BLOB,
>>>>>>>>>>>>>>>>>                   VIRTIO_GPU_CMD_RESOURCE_UNMAP_BLOB,
>>>>>>>>>>>>>>>>> +        VIRTIO_GPU_CMD_P2PDMA_DISTANCE,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>                   /* cursor commands */
>>>>>>>>>>>>>>>>>                   VIRTIO_GPU_CMD_UPDATE_CURSOR = 0x0300,
>>> @@
>>>>>>>>>>>>>>>>> -236,6
>>>>>>>>>>>>> +237,7
>>>>>>>>>>>>>>> @@
>>>>>>>>>>>>>>>>> \subsubsection{Device Operation: Request
>>>>>>>>>>>>>>>>> header}\label{sec:Device Types / GPU De
>>>>>>>>>>>>>>>>>                   VIRTIO_GPU_RESP_OK_EDID,
>>>>>>>>>>>>>>>>>                   VIRTIO_GPU_RESP_OK_RESOURCE_UUID,
>>>>>>>>>>>>>>>>>                   VIRTIO_GPU_RESP_OK_MAP_INFO,
>>>>>>>>>>>>>>>>> +        VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>                   /* error responses */
>>>>>>>>>>>>>>>>>                   VIRTIO_GPU_RESP_ERR_UNSPEC = 0x1200, @@
>>>>>>>>>>>>>>>>> -791,6
>>>>>>>>>>>>>>>>> +793,33 @@ \subsubsection{Device Operation: controlq
>>>>>>>>>>>>>>>>> (3d)}\label{sec:Device Types / GPU Dev  };
>>>>>>>>>>>>>>>>> \end{lstlisting}
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +\item[VIRTIO_GPU_CMD_P2PDMA_DISTANCE] calculates
>>> the
>>>>>>>>>>>>> cumulative
>>>>>>>>>>>>>>>>> +distance
>>>>>>>>>>>>>>>>> +  between two devices before Peer-to-Peer(P2P)
>>> transaction.
>>>>>>>>>>>>>>>>> +Request data
>>>>>>>>>>>>>>>>> +  is \field{struct virtio_gpu_device_p2pdma_distance}.
>>>>>>>>>>>>>>>>> +Response type is
>>>>>>>>>>>>>>>>> +  VIRTIO_GPU_RESP_OK_P2PDMA_DISTANCE. Response
>>> data
>>>>> is
>>>>>>>>>>>>>>>>> +  \field{virtio_gpu_resp_distance}.
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +\begin{lstlisting}
>>>>>>>>>>>>>>>>> +struct virtio_gpu_device_p2pdma_distance {
>>>>>>>>>>>>>>>>> +        struct virtio_gpu_ctrl_hdr hdr;
>>>>>>>>>>>>>>>>> +        le32 provider_bus;
>>>>>>>>>>>>>>>>> +        le32 provider_slot;
>>>>>>>>>>>>>>>>> +        le32 provider_func;
>>>>>>>>>>>>>>>>> +        le32 client_bus;
>>>>>>>>>>>>>>>>> +        le32 client_slot;
>>>>>>>>>>>>>>>>> +        le32 client_func; };
>>>>>>>>>>>>>> The requester device's PCI RID is anyway known to the
>>> responder.
>>>>>>>>>>>>>> So no
>>>>>>>>>>>>> need to pass it here.
>>>>>>>>>>>>>> It is unclear if its provider or the client.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think requester device's PCI RID is known to the
>>>>>>>>>>>>> responder, since the existing function pci_p2pdma_distance()
>>>>>>>>>>>>> also require provider and client information.
>>>>>>>>>>>>>
>>>>>>>>>>>> The existing function takes the provider pci device as input
>>>>>>>>>>>> at C code
>>>>>>> level.
>>>>>>>>>>>> And proposed extension (without provider BDF) will also
>>>>>>>>>>>> operate on the
>>>>>>>>>>> provider device on which the request/response transaction is
>>>>>>> happening.
>>>>>>>>>>>
>>>>>>>>>>> Well at first, I just want virtio GPU to handle the command to
>>>>>>>>>>> get p2pdma_distance of two PCI devices(maybe none of them is
>>>>>>>>>>> virtio
>>>>>>> GPU).
>>>>>>>>>> Using virtio gpu to find the distance between two arbitrary
>>>>>>>>>> devices which
>>>>>>>>> are not virtio is certainly outside the scope and looks wrong.
>>>>>>>>>> It should be done at some PCI topology level who exposed this
>>>>> devices.
>>>>>>>>> For our use case, virtio GPU is the provider, we are using virtio
>>>>>>>>> GPU to calculate distance between virtio GPU and dGPU, which are
>>>>>>>>> not arbitrary devices.
>>>>>>>> The fact that in physical world this distance is calculated
>>>>>>>> without involving
>>>>>>> the individual provider and client EP PCI devices, in virtual world
>>>>>>> doing this via virtual PCI EP devices does not look correct
>>>>>>> abstraction to
>>>>> me.
>>>>>>>
>>>>>>> In physical world, this distance is calculated by calling existing
>>>>>>> function pci_p2pdma_distance(), which will pass the the provider
>>>>>>> and client. I understand that you mean virtio GPU should be
>>>>>>> provider by default, but my point is that when I call
>>>>>>> pci_p2pdma_distance() to calculate distance between virtio GPU and
>>>>>>> dGPU, I still need to pass the
>>>>> RID of them to host kernel side.
>>>>>>> And because host kernel doesn't know which one GPU is the virtio
>>>>>>> GPU, this struct still needs to include provider information.
>>>>>>>
>>>>>>>> I tend to imagine that one needs to build a right virtual topology
>>>>>>>> in QEMU
>>>>>>> or somewhere else who is providing these PCI devices such that
>>>>>>> guest kernel can calculate it accurately without depending on the EP PCI
>>> devices.
>>>>>>>
>>>>>>> But the virtio GPU BDF in guest is different from host side, guest
>>>>>>> kernel cannot calculate the correct physical distance from guest side.
>>>>>> Sure virtio GPU BDF is likely vRID.
>>>>>> Guest should request the right distance calculator entity, possibly
>>>>>> the one
>>>>> that is providing the connectivity (and distance), which likely is
>>>>> pci RC/bridge/switch etc.
>>>>>
>>>>> Yeah, the problem here is pci RC/bridge/switch of guest is emulated,
>>>>> which is not going to provide the exact same connectivity as host does.
>>>>>
>>>> It is the role of the emulated entity to provide topology such that guest PCI
>>> driver can discover that (and not arbitrary virtio gpu driver).
>>>>
>>>>>>
>>>>>> Imagine a virtio PCI driver asking to to the virtio PCI device, how
>>>>>> far is it
>>>>> from the vCPU for NUMA awareness.
>>>>>> It seems wrong to me.
>>>>>
>>>>> It's not asking virtio PCI device to calculate the distance, it's
>>>>> asking virtio PCI device to pass the command.
>>>> Pass the command is same thing as virtio driver asking virtio pci device.
>>>> How virtio pci device implements is internal details not specific to the spec.
>>>>
>>>>> Calculating the distance is still going to request the right
>>>>> calculator entity, which is host PCI RC/bridge/switch.
>>>> Ok. so please have such hypercall in the PCI RC/bridge/switch domain. Not
>>> in the PCIe EP domain.
>>>>
>>>>>
>>>>>> Because it is unable to follow the same flow between virtual and
>>>>>> physical
>>>>> world.
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> So
>>>>>>>>>>> when guest needs to calculate distance of any two PCI devices,
>>>>>>>>>>> it can reuse this command. But indeed, we only have this one
>>>>>>>>>>> use case at this moment, I guess considering virtio GPU as
>>>>>>>>>>> provider and remove provider PCI RID of this struct is fine.
>>>>>>>>>> Right. A driver can issue the command to device without
>>>>>>>>>> mentioning its own
>>>>>>>>> device identifier.
>>>>>>>>> Since calling function pci_p2pdma_distance() still needs to pass
>>>>>>>>> its own RID, I think at least QEMU has to tell host kernel the
>>>>>>>>> RID of virito
>>>>> GPU.
>>>>>>>> Probably yes, would be interesting to see this UAPI on the host
>>>>>>>> kernel too,
>>>>>>> (unrelated to this patch).
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But there's another problem: how does host kernel get virtio
>>>>>>>>>>> GPU RID and calculate distance between virtio GPU and dGPU? If
>>>>>>>>>>> I remove provider information of this struct, then QEMU still
>>>>>>>>>>> needs to pass virtio GPU RID to host kernel side.
>>>>>>>>>> The device implementation side can get to know about this as it
>>>>>>>>>> has plugged
>>>>>>>>> in and built the topology.
>>>>>>>>>> The RID we are discussing are the vRID seen by the guest driver.
>>>>>>>>>> I probably missed to clarify this.
>>>>>>>>> But only getting the distance on the virtual PCI inside the guest
>>>>>>>>> is not enough to figure out if P2P is possible or not. We need to
>>>>>>>>> get the real distance from host side. Please take a look at the
>>>>>>>>> discussion
>>>>> here:
>>>>>>>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2
>>>>>>>>>
>>> Fl%2F&data=05%7C02%7Cparav%40nvidia.com%7C44fb34c8931c4aaedd330
>>> 8d
>>>>>>>>>
>>> d61df99f8%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C6387743
>>> 352
>>>>>>>>>
>>> 99455726%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
>>> OiIw
>>>>>>>>>
>>> LjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%
>>>>>>>>>
>>> 7C%7C%7C&sdata=VOWFlscDn%2FxG4%2FgT0Ge1M6IuZgisTYCqzSAUJo9AJU
>>> c%3D
>>>>>>>>> &reserved=0
>>>>>>>>>
>>>>>
>>> or%2F&data=05%7C02%7Cparav%40nvidia.com%7C9538a40c459942853759
>>> 0
>>>>> 8dd6
>>>>>>>>>
>>>>>
>>> 12df4ca%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C63877357
>>> 22
>>>>> 4449
>>>>>>>>>
>>>>>
>>> 6025%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwL
>>>>> jAuMD
>>>>>>>>>
>>>>>
>>> AwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C
>>>>> %7C
>>>>>>>>>
>>>>>
>>> &sdata=10ypT8BuukdQ1cGF6OY77wPdOo4H3owSPHoATgCX1lw%3D&reserve
>>>>> d=0
>>>>>>>>> e.kernel.org%2Fall%2Fbfe6a8b7-adcf-40e2-b7a2-
>>>>>>> 4e64dc96862d%40amd.com%2
>>>>>>>>>
>>>>>>>
>>>>>
>>> F&data=05%7C02%7Cparav%40nvidia.com%7Ccf5112991dc145cd786508dd6
>>> 0
>>>>>>> 6e772
>>>>>>>>>
>>>>>>>
>>>>>
>>> 9%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C63877274979663
>>> 7
>>>>>>> 911%7CU
>>>>>>>>>
>>>>>>>
>>>>>
>>> nknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAw
>>>>>>> MCIsIlAi
>>>>>>>>>
>>>>>>>
>>>>>
>>> OiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=
>>>>>>> 7wrTh
>>>>>>>>> 4PogFwvK6YzRgBQi2GeOus0CarX0%2Bu5DerCQWs%3D&reserved=0
>>>>>>>>
>>>>>>>> Finding out the distance between two PCI devices in physical world
>>>>>>>> is role of
>>>>>>> the PCI bridge/switch topology builder.
>>>>>>>> So why should it change for the guest VM?
>>>>>>>
>>>>>>> Because as discussed in above link, before letting virtio GPU and
>>>>>>> dGPU do P2P, we have to check if these two devices can communicate
>>>>>>> with each other by calculating the distance in physical world.
>>>>>> Ok, so that distance should be provided by the same way in physical
>>>>>> and
>>>>> virtual world.
>>>>>> The hypercall should be in PCI RC subsystem, not in virtio arbitrary device.
>>>>>>
>>>>>>> It's not just the role of PCI bridge/switch topology builder.
>>>>>> Well, I think it is. Because this is how the physical world able to provide.
>>>>>>
>>>>>>> Checking this
>>>>>>> distance is a necessary step before P2P like amdgpu does:
>>>>>>> amdgpu_dma_buf_attach
>>>>>>> ->pci_p2pdma_distance
>>>>>>>
>>>>>>>> Why virtualized topology cannot be build in a way such that it
>>>>>>>> matches to
>>>>>>> the physical world to get the accurate distance?
>>>>>>>> At least to me, doing this in the pci RP domain (and not PCI EP
>>>>>>>> domain)
>>>>>>> seems the right way to go forward.
>>>>>>>
>>>>>>> Because the BDF of virtio GPU and passthrough dGPU on virtual PCI
>>>>>>> bridge is different from the physical world, which makes guest
>>>>>>> kernel cannot get the accurate distance.
>>>>>> I suggested one option is QEMU to build right topology as that of
>>>>>> the
>>>>> physical world so that physical and virtual systems can run same code.
>>>>>
>>>>> Changing the way QEMU emulate PCI topology doesn't make any sense.
>>>> Why not? If the QEMU is enabling P2P communication, it needs to enable
>>> guest to discover the way PCI driver (not virtio) discovers in physical world.
>>>> When guest access the vCPUs the hypercall is in the vCPU domain. Vcpu's
>>> information does not come from virtio device.
>>>> Similarly, PCI topology info in my view should come from the PCI
>>> RC/Bridge/switch emulation side.
>>>> I probably repeated myself here, by providing more examples of drawing the
>>> parallels between physical and virtual world.
>>>
>>> 1. When amdgpu driver need to calculate p2pdma_distance, the PCI
>>> information come from amdgpu device instead of PCI RC/Bridge.
>>>
>> Do you mean amdgpu driver invokes a driver internal API to talk to amd device and it does not invoke pci_p2pdma_distance()?
>>
>> I have no comments from such driver as its outside the virtio scope.
>>    
>>> 2. In guest side, PCI topology info is from QEMU emulation, which is different
>>> from physical world. Do you mean QEMU should emulate exact same PCI
>>> topology as physical world so that we can get this distance from guest by just
>>> calling pci_p2pdma_distance()?
>> Yep.
>>
>>> Can you explain more details about how to let
>>> guest discover the way PCI driver discovers in physical world?
>>>
>> This is QEMU or VMM level implementation question on how to build the synthetic topology.
>> In libvirt and other places, there is a way to describe the PCI level BDF etc details.
>> You might want to check with the right owners on how to achieve or extend it.
> 
> Hi MST,
> As discussed above, Parav suggested to rebuild PCI topology in QEMU
> level to make virtual PCI topology 100% same as physical PCI topology so
> that we can get accurate p2pdma_distance from guest directly. Since you
> are one of the maintainers of QEMU, can you give us some suggestions or
> explain more details about how to achieve it?
> 
> Thanks,
> Julia

Hi MST,
Since Parav thinks I should rebuild PCI topology in QEMU to make it 100% 
same as physical PCI topology, which I think will be a big change for 
virtual PCI. Can you give us some suggestions please?

Thanks,
Julia
>>    
>>> Hi MST,
>>> Do you have any suggestions about how to get physical p2pdma_distance
>>> between virtio GPU and passthrough dGPU?
>>>
>>> Julia

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-19 10:45                                 ` Zhang, Julia
@ 2025-03-19 12:23                                   ` Michael S. Tsirkin
  2025-03-20 10:00                                     ` Zhang, Julia
  0 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2025-03-19 12:23 UTC (permalink / raw)
  To: Zhang, Julia
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev,
	Parav Pandit

On Wed, Mar 19, 2025 at 10:45:14AM +0000, Zhang, Julia wrote:
> >>> Can you explain more details about how to let
> >>> guest discover the way PCI driver discovers in physical world?
> >>>
> >> This is QEMU or VMM level implementation question on how to build the synthetic topology.
> >> In libvirt and other places, there is a way to describe the PCI level BDF etc details.
> >> You might want to check with the right owners on how to achieve or extend it.
> > 
> > Hi MST,
> > As discussed above, Parav suggested to rebuild PCI topology in QEMU
> > level to make virtual PCI topology 100% same as physical PCI topology so
> > that we can get accurate p2pdma_distance from guest directly. Since you
> > are one of the maintainers of QEMU, can you give us some suggestions or
> > explain more details about how to achieve it?
> > 
> > Thanks,
> > Julia
> 
> Hi MST,
> Since Parav thinks I should rebuild PCI topology in QEMU to make it 100% 
> same as physical PCI topology, which I think will be a big change for 
> virtual PCI. Can you give us some suggestions please?
> 
> Thanks,
> Julia

Hi Julia,
I feel the issue is not "100% matching" or the specific interface, but
that the motivation for the patches is not documented sufficiently
clearly.

It is true that many hypervisor vendors have a preference to utilizing native
interfaces, since this reduces the amount of guest code to maintain.
Remember, that guest code has to be implemented for every OS.
Further, duplicating e.g. ACPI information at the device level
means that baremetal and VM guest behaviour use different drivers,
which increases support and testing load.

I also note that the admin command infrastructure was designed with
use-cases where one device describes other devices.

Having said all that, it's not a religious argument, and might be reasonable
e.g. if the infrastructure is reusable to address a lot of other
problems.

Generally, one way to move things forward would be to show an example: 
- a baremetal instance which has the issue
- how is the distance discovered on baremetal
- how would a minimal emulation of baremetal look like
- why is this hard


Hope this helps,
-- 
MST


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/1] virtio-gpu: get p2pdma distance
  2025-03-19 12:23                                   ` Michael S. Tsirkin
@ 2025-03-20 10:00                                     ` Zhang, Julia
  0 siblings, 0 replies; 21+ messages in thread
From: Zhang, Julia @ 2025-03-20 10:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Chen, Jiqian, Zhu, Lingshan, Huang, Ray,
	Matias Ezequiel Vara Larsen, virtio-comment@lists.linux.dev,
	Parav Pandit, Zhang, Julia



On 2025/3/19 20:23, Michael S. Tsirkin wrote:
> On Wed, Mar 19, 2025 at 10:45:14AM +0000, Zhang, Julia wrote:
>>>>> Can you explain more details about how to let
>>>>> guest discover the way PCI driver discovers in physical world?
>>>>>
>>>> This is QEMU or VMM level implementation question on how to build the synthetic topology.
>>>> In libvirt and other places, there is a way to describe the PCI level BDF etc details.
>>>> You might want to check with the right owners on how to achieve or extend it.
>>>
>>> Hi MST,
>>> As discussed above, Parav suggested to rebuild PCI topology in QEMU
>>> level to make virtual PCI topology 100% same as physical PCI topology so
>>> that we can get accurate p2pdma_distance from guest directly. Since you
>>> are one of the maintainers of QEMU, can you give us some suggestions or
>>> explain more details about how to achieve it?
>>>
>>> Thanks,
>>> Julia
>>
>> Hi MST,
>> Since Parav thinks I should rebuild PCI topology in QEMU to make it 100%
>> same as physical PCI topology, which I think will be a big change for
>> virtual PCI. Can you give us some suggestions please?
>>
>> Thanks,
>> Julia

Hi MST,
Thanks for the reply.

> 
> Hi Julia,
> I feel the issue is not "100% matching" or the specific interface, but
> that the motivation for the patches is not documented sufficiently
> clearly.

Please let me explain the patches again.

We are trying to let virtio iGPU import passthrough dGPU buffer to 
implement dGPU prime feature.

Checking if p2p is possible or not is suggested by kernel maintainers . 
They think that all importers should check the p2p_dma_distance before 
p2p to make sure two physical PCI devices can communicate with each other.

But inside a VM we can't see the physical devices. In our use 
case(virtio iGPU + passthrough dGPU), I can only see passed through 
devices and virtual devices with virtual BDF and a bunch of virtual bridges.

So calling pci_p2pdma_distance() like baremetal does to get 
p2p_dma_distance inside the VM is meaningless. We have to find way to 
get the real physical p2p_dma_distance from host side.

My design is to send a command from guest to host to pass the virtual 
BDF of guest devices and get real physical distance from host side and 
return the distance to guest side.

> 
> It is true that many hypervisor vendors have a preference to utilizing native
> interfaces, since this reduces the amount of guest code to maintain.
> Remember, that guest code has to be implemented for every OS.
> Further, duplicating e.g. ACPI information at the device level
> means that baremetal and VM guest behaviour use different drivers,
> which increases support and testing load.
> 
> I also note that the admin command infrastructure was designed with
> use-cases where one device describes other devices. >
> Having said all that, it's not a religious argument, and might be reasonable
> e.g. if the infrastructure is reusable to address a lot of other
> problems.

I think command queue is also designed as a common infrastructure to 
handle commands from guest. And no matter how I send the command to get 
the distance, the main purpose is to get the distance from host right? 
Then I always need to add a struct in spec to pass BDF of two devices 
and then call pci_p2pdma_distance in host to get distance. I don't 
understand what kind of change should I do for this patch. Can you 
explain more details?

> 
> Generally, one way to move things forward would be to show an example:
> - a baremetal instance which has the issue
> - how is the distance discovered on baremetal
> - how would a minimal emulation of baremetal look like
> - why is this hard

- amdgpu call pci_p2pdma_distance before import:
   amdgpu_dma_buf_attach
       ->pci_p2pdma_distance

- linux/drivers/pci/p2pdma.c file explain how to calculate the distance:
/*
[...]
  * If the two devices are the same PCI function, return
  * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 0.
  *
  * If they are two functions of the same device, return
  * PCI_P2PDMA_MAP_BUS_ADDR and a distance of 2 (one hop up to the bridge,
  * then one hop back down to another function of the same device).
  *
  * In the case where two devices are connected to the same PCIe switch,
  * return a distance of 4. This corresponds to the following PCI tree:
  *
  *     -+  Root Port
  *      \+ Switch Upstream Port
  *       +-+ Switch Downstream Port 0
  *       + \- Device A
  *       \-+ Switch Downstream Port 1
  *         \- Device B
  *
  * The distance is 4 because we traverse from Device A to Downstream Port 0
  * to the common Switch Upstream Port, back down to Downstream Port 1 and
  * then to Device B. The mapping type returned depends on the ACS
  * redirection setting of the ports along the path.
[...]
  */

Thanks,
Julia
> 
> 
> Hope this helps,


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-03-20 10:00 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-26  9:27 [PATCH v2 1/1] virtio-gpu: get p2pdma distance Julia Zhang
2025-02-26 10:44 ` Matias Ezequiel Vara Larsen
2025-02-26 10:51 ` Parav Pandit
2025-03-05 10:41   ` Zhang, Julia
2025-03-06  4:29     ` Parav Pandit
2025-03-06  7:16       ` Zhang, Julia
2025-03-10  3:48         ` Parav Pandit
2025-03-10  7:37           ` Zhang, Julia
2025-03-10 10:18             ` Parav Pandit
2025-03-11  3:27               ` Zhang, Julia
2025-03-11  4:00                 ` Parav Pandit
2025-03-11  7:29                   ` Zhang, Julia
2025-03-11 15:14                     ` Parav Pandit
2025-03-12  6:20                       ` Zhang, Julia
2025-03-12 10:11                         ` Parav Pandit
2025-03-13  3:31                           ` Zhang, Julia
2025-03-14  2:15                             ` Parav Pandit
2025-03-14  3:02                               ` Zhang, Julia
2025-03-19 10:45                                 ` Zhang, Julia
2025-03-19 12:23                                   ` Michael S. Tsirkin
2025-03-20 10:00                                     ` Zhang, Julia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.