From: Zhu Lingshan <lingshan.zhu@amd.com>
To: Parav Pandit <parav@nvidia.com>,
"mst@redhat.com" <mst@redhat.com>,
"cohuck@redhat.com" <cohuck@redhat.com>,
"jasowang@redhat.com" <jasowang@redhat.com>
Cc: "virtio-comment@lists.linux.dev" <virtio-comment@lists.linux.dev>,
"Eugenio Pérez" <eperezma@redhat.com>,
"David Stevens" <stevensd@chromium.org>
Subject: Re: [PATCH V7 v7] virtio: introduce SUSPEND bit in device status
Date: Thu, 15 Aug 2024 16:23:26 +0800 [thread overview]
Message-ID: <88a087ff-676f-4616-a573-a765ca77bef7@amd.com> (raw)
In-Reply-To: <IA0PR12MB87138A02C6F9F6282EAD8236DC862@IA0PR12MB8713.namprd12.prod.outlook.com>
On 8/13/2024 2:55 PM, Parav Pandit wrote:
>
>> From: Zhu Lingshan <lingshan.zhu@amd.com>
>> Sent: Tuesday, August 13, 2024 11:44 AM
>>
> I removed the unreachable intel email id as every single email is bouncing from it.
> Please consider dropping that email from v8 as it will cause all reviewers email to bounce.
>
>> On 8/13/2024 1:50 PM, Parav Pandit wrote:
>>>> From: Zhu Lingshan <lingshan.zhu@amd.com>
>>>> Sent: Tuesday, August 13, 2024 11:15 AM
>>>>
>>>> On 8/13/2024 12:42 PM, Parav Pandit wrote:
>>>>> Hi Lingshan, David,
>>>>>
>>>>>> From: Zhu Lingshan <lingshan.zhu@amd.com>
>>>>>> Sent: Thursday, August 1, 2024 5:05 PM
>>>>>>
>>>>>> This commit allows the driver to suspend the device by introducing
>>>>>> a new status bit SUSPEND in device_status.
>>>>>>
>>>>>> This commit also introduce a new feature bit VIRTIO_F_SUSPEND which
>>>>>> indicating whether the device support SUSPEND.
>>>>>>
>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
>>>>>> Signed-off-by: David Stevens <stevensd@chromium.org>
>>>>>> ---
>>>>>> content.tex | 75
>> ++++++++++++++++++++++++++++++++++++++++++++++-
>>>> ----
>>>>>> --
>>>>>> 1 file changed, 65 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> Changes from V6:
>>>>>> - the device should hold its config interrupt while SUSPEND, and
>>>>>> send config interrupt when the SUSPEND bit is cleared.
>>>>>> - while SUSPEND, the driver MUST NOT access Device Configuration
>>>>>> Space
>>>>>> - minor changes.
>>>>>>
>>>>>> Changes from V5:
>>>>>> - the device should present NEEDS_RESET if failed to suspend
>>>>>> - allow the driver access device status in the config space when
>>>>>> suspended if it is implemented in config space.
>>>>>> - language improvements
>>>>>>
>>>>>> Changes from V4:
>>>>>> - re-order the device status bits section
>>>>>> - kick vqs --> notify vqs
>>>>>>
>>>>>> Changes from V3:
>>>>>> - allow the driver clearing the SUSPEND bit to resume the device.
>>>>>> - disallow access to config space while suspended.
>>>>>>
>>>>>> diff --git a/content.tex b/content.tex index 0a62dce..2d1bee8
>>>>>> 100644
>>>>>> --- a/content.tex
>>>>>> +++ b/content.tex
>>>>>> @@ -36,19 +36,22 @@ \section{\field{Device Status}
>>>>>> Field}\label{sec:Basic Facilities of a Virtio Dev
>>>>>> this bit. For example, under Linux, drivers can be loadable modules.
>>>>>> \end{note}
>>>>>>
>>>>>> -\item[FAILED (128)] Indicates that something went wrong in the
>>>>>> guest,
>>>>>> - and it has given up on the device. This could be an internal
>>>>>> - error, or the driver didn't like the device for some reason, or
>>>>>> - even a fatal error during device operation.
>>>>>> +\item[DRIVER_OK (4)] Indicates that the driver is set up and ready
>>>>>> +to
>>>>>> + drive the device.
>>>>>>
>>>>>> \item[FEATURES_OK (8)] Indicates that the driver has acknowledged
>>>>>> all
>>>> the
>>>>>> features it understands, and feature negotiation is complete.
>>>>>>
>>>>>> -\item[DRIVER_OK (4)] Indicates that the driver is set up and ready
>>>>>> to
>>>>>> - drive the device.
>>>>>> +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated,
>> indicates
>>>>>> +that the
>>>>>> + device has been suspended by the driver.
>>>>>>
>>>>>> \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has
>>>> experienced
>>>>>> an error from which it can't recover.
>>>>>> +
>>>>>> +\item[FAILED (128)] Indicates that something went wrong in the
>>>>>> +guest,
>>>>>> + and it has given up on the device. This could be an internal
>>>>>> + error, or the driver didn't like the device for some reason, or
>>>>>> + even a fatal error during device operation.
>>>>>> \end{description}
>>>>>>
>>>>>> The \field{device status} field starts out as 0, and is
>>>>>> reinitialized to 0 by @@ -
>>>>>> 60,8 +63,9 @@ \section{\field{Device Status} Field}\label{sec:Basic
>>>>>> Facilities of a Virtio Dev initialization sequence specified in
>>>>>> \ref{sec:General Initialization And Device Operation / Device
>>>> Initialization}.
>>>>>> -The driver MUST NOT clear a
>>>>>> -\field{device status} bit. If the driver sets the FAILED bit,
>>>>>> +The driver MUST NOT clear a \field{device status} bit other than
>>>>>> +SUSPEND except when setting \field{device status} to 0 as a
>>>>>> +transport-specific way to initiate a reset. If the driver sets the
>>>>>> +FAILED bit,
>>>>>> the driver MUST later reset the device before attempting to re-
>> initialize.
>>>>>> The driver SHOULD NOT rely on completion of operations of a @@
>>>>>> -99,10
>>>>>> +103,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a
>>>>>> +Virtio Device
>>>>>> / Feature B \begin{description}
>>>>>> \item[0 to 23, and 50 to 127] Feature bits for the specific device
>>>>>> type
>>>>>>
>>>>>> -\item[24 to 41] Feature bits reserved for extensions to the queue
>>>>>> and
>>>>>> +\item[24 to 42] Feature bits reserved for extensions to the queue
>>>>>> +and
>>>>>> feature negotiation mechanisms
>>>>>>
>>>>>> -\item[42 to 49, and 128 and above] Feature bits reserved for
>>>>>> future extensions.
>>>>>> +\item[43 to 49, and 128 and above] Feature bits reserved for
>>>>>> +future
>>>>>> extensions.
>>>>>> \end{description}
>>>>>>
>>>>>> \begin{note}
>>>>>> @@ -629,6 +633,53 @@ \section{Device Cleanup}\label{sec:General
>>>>>> Initialization And Device Operation /
>>>>>>
>>>>>> Thus a driver MUST ensure a virtqueue isn't live (by device reset)
>>>>>> before removing exposed buffers.
>>>>>>
>>>>>> +\section{Device Suspend}\label{sec:General Initialization And
>>>>>> +Device Operation / Device Suspend}
>>>>>> +
>>>>>> +When VIRTIO_F_SUSPEND is negotiated, the driver can set the
>>>>>> +SUSPEND bit in \field{device status} to suspend a device, and can
>>>>>> +clear the SUSPEND bit to resume a suspended device.
>>>>>> +
>>>>>> +\drivernormative{\subsection}{Device Suspend}{General
>>>>>> +Initialization And Device Operation / Device Suspend}
>>>>>> +
>>>>>> +The driver MUST NOT set SUSPEND if FEATURES_OK is not set or
>>>>>> VIRTIO_F_SUSPEND is not negotiated.
>>>>>> +
>>>>>> +Once the driver sets SUSPEND to \field{device status} of the device:
>>>>>> +\begin{itemize}
>>>>>> +\item The driver MUST re-read \field{device status} to verify
>>>>>> +whether the
>>>>>> SUSPEND bit is set.
>>>>>> +\item The driver MUST NOT make any more buffers available to the
>>>> device.
>>>>>> +\item The driver MUST NOT access any virtqueues or send
>>>>>> +notifications for
>>>>>> any virtqueues.
>>>>>> +\item The driver MUST NOT access Device Configuration Space.
>>>>>> +\end{itemize}
>>>>>> +
>>>> Hi Parva
>>>>> Do we agree that
>>>>> a. suspending a device is non frequent operation (in order of N
>>>> operations/sec, where N is roughly in range of 10 or 100) per device?
>>>> Ideally it should not be often in normal operations, but remember we
>>>> can not restrict the behaviors of the driver, so we must be able to
>>>> handle the scenario in which SUSPENDING is often.
>>> Sure. the intent is slow rate, but one can do at unexpected times.
>>> Do you agree?
>> I think we don't have an intention of the frequency in the spec.
> Sure.
>
>> The spec only provides generic mechanisms and interfaces.
> Sure.
>
>> Don't assume it(or driver wants it to be) would be often or not, that depends
>> on the driver.
> As you rightly said : it cannot be assumed.
> The driver will read the device status right after it wrote it. This typically is < 50nsec of time.
> The suspend operation for a net device to store hundreds of queues, RSS table, flow filters, takes plenty of time (at least more than 50nsec :) ).
> Similarly for the GPU to store some MBs of memory takes more than 50nsec of time, for example to store in a file for a software-based GPU device.
> So a device cannot respond back suspend=true in next 50nsec time.
It's OK for the device to take longer time to respond, the driver simply re-reads device status.
>
> More below.
>
>>>>> b. A software-based device may not always want to force VM_EXIT on
>>>>> read
>>>> and write on the device_status register?
>>>> Trap and Emulation is the basic of virtualization, and how to
>>>> pass-through a device is out of this spec.
>>>>
>>> Sure, I didn’t suggest to put such things in the spec.
>>> My question is, whether to trap and emulate or not is a choice of the
>> software.
>>> Do you agree?
>> The device emulator does not know anything about whether trapped or not.
>> Trapping and Emulation is a hypervisor thing.
>>
>> If here "software" refers to the device emulator, then Yes, it is not the
>> emulator's decision. And the device should not be aware of VM_EXIT &
>> VM_ENTRY.
>>
> Right. So a software wants to implement device_status as pure MMIO writes. (and not VM_EXIT).
This is not always true, there can be VM_EXIT of pure emulated devices. The HW registers are sensitive resource
and any access to them need to be trapped and emulated.
> And prefer to returning SUSPEND=true at slow pace.
> This means, the device implementation cannot immediately return suspend=true right after it was written.
> A MMIO read will read it back, as suspend=true.
>
> An alternative would be, to forward CPU loads and CPU stores to different address.
> However, this does not work for the hw based devices.
>
> That means, PCI HW needs to return suspend=0, until the device is not suspended.
> In this example, the device cannot build special circuitry to answer suspend=true within 50nsec, or in other words building special circuitry to return suspend=false is too complex for the slow operation.
why? The device can just not to change the value of the SUSPEND bit before it has fully suspended.
>
> If this understanding of burden is clear,
>
> The proposal is, can you please extend the interface such that,
>
> 1. driver writes suspend command.
> 2. driver reads suspend_status, and receives not_completed=(false). This is the default value.
> 3. When the device completes suspend, it changes the polarity of suspend_status=true.
>
> This has two main benefits:
> [A] This will enable software-based devices to write data to slow files and does not have to force VM_EXITs.
>
> [B] It also enables hw based devices to not build special circuitry to answer within 50nsec, which can get very complicated for tens or hundreds of PCI PFs.
I think we have already discussed on this before in V5, and Jason has some insightful comments
https://lore.kernel.org/virtio-comment/20240612082055-mutt-send-email-mst@kernel.org/T/#mc817fc6ca12ff0bcbae62b43b6146a177ecf13a9
>
>> this is out of the spec anyway.
>>
>> Thanks
>> Zhu Lingshan
>>>
>>>> Thanks
>>>> Zhu Lingshan
>>>>>> +\devicenormative{\subsection}{Device Suspend}{General
>>>>>> +Initialization And Device Operation / Device Suspend}
>>>>>> +
>>>>>> +The device MUST ignore SUSPEND if FEATURES_OK is not set or
>>>>>> VIRTIO_F_SUSPEND is not negotiated.
>>>>>> +
>>>>>> +The device MUST ignore all access to its Configuration Space while
>>>>>> +suspended, except for \field{device status} if it is part of the
>>>>>> +Configuration
>>>>>> Space.
>>>>>> +
>>>>>> +A device MUST NOT send any notifications for any virtqeuues,
>>>>>> +access any virtqueues, or modify any fields in its Configuration
>>>>>> +Space while suspended.
>>>>>> +
>>>>>> +If changes occur in the Configuration Space while the SUSPEND bit
>>>>>> +is set, the device MUST NOT send any configuration change
>> notifications.
>>>>>> +Instead, the device MUST send the notification after the SUSPEND
>>>>>> +bit has
>>>>>> been cleared.
>>>>>> +
>>>>>> +When the driver sets SUSPEND, the device MUST either suspend
>>>>>> +itself or set
>>>>>> DEVICE_NEEDS_RESET if failed to suspend.
>>>>>> +
>>>>>> +If SUSPEND is set in \field{device status}, when the driver clears
>>>>>> +SUSPEND, the device MUST either resume normal operation or set
>>>>>> DEVICE_NEEDS_RESET.
>>>>>> +
>>>>>> +When the driver sets SUSPEND,
>>>>>> +the device SHOULD perform the following actions before presenting
>>>>>> +that
>>>>>> the SUSPEND bit is set to 1 in the \field{device status}:
>>>>>> +
>>>>>> +\begin{itemize}
>>>>>> +\item Stop processing more buffers of any virtqueues \item Wait
>>>>>> +until all buffers that are being processed have been used.
>>>>>> +\item Send used buffer notifications to the driver.
>>>>>> +\end{itemize}
>>>>>> +
>>>>>> \chapter{Virtio Transport Options}\label{sec:Virtio Transport
>>>>>> Options}
>>>>>>
>>>>>> Virtio can use various different buses, thus the standard is split
>>>>>> @@ -872,6
>>>>>> +923,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved
>>>>>> +Feature
>>>>>> Bits}
>>>>>> \ref{devicenormative:Basic Facilities of a Virtio Device /
>>>>>> Feature Bits} for
>>>>>> handling features reserved for future use.
>>>>>>
>>>>>> + \item[VIRTIO_F_SUSPEND(42)] This feature indicates that the driver
>> can
>>>>>> + trigger suspending the device via the SUSPEND flag
>>>>>> + See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
>>>>>> +
>>>>>> \end{description}
>>>>>>
>>>>>> \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature
>>>>>> Bits}
>>>>>> --
>>>>>> 2.45.2
>>>>>>
next prev parent reply other threads:[~2024-08-15 8:23 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-01 11:35 [PATCH V7 v7] virtio: introduce SUSPEND bit in device status Zhu Lingshan
2024-08-13 4:42 ` Parav Pandit
2024-08-13 5:44 ` Zhu Lingshan
2024-08-13 5:50 ` Parav Pandit
2024-08-13 6:14 ` Zhu Lingshan
2024-08-13 6:55 ` Parav Pandit
2024-08-15 8:23 ` Zhu Lingshan [this message]
2024-08-15 9:34 ` Parav Pandit
2024-08-30 2:31 ` Zhu Lingshan
2024-08-30 3:02 ` Parav Pandit
2024-09-03 9:05 ` Zhu Lingshan
2024-09-03 9:45 ` Michael S. Tsirkin
2024-09-03 10:09 ` Parav Pandit
2024-09-03 10:35 ` Michael S. Tsirkin
2024-09-03 10:37 ` Michael S. Tsirkin
2024-09-04 3:07 ` Jason Wang
2024-09-04 4:02 ` Michael S. Tsirkin
2024-09-04 6:31 ` Jason Wang
2024-09-04 6:38 ` Zhu Lingshan
2024-09-04 6:46 ` Parav Pandit
2024-09-05 7:14 ` Zhu Lingshan
2024-09-05 7:16 ` Parav Pandit
2024-09-05 7:29 ` Zhu Lingshan
2024-09-05 7:35 ` Parav Pandit
2024-09-05 8:30 ` Zhu Lingshan
2024-09-05 8:41 ` David Stevens
2024-09-06 1:53 ` Parav Pandit
2024-09-05 7:17 ` Michael S. Tsirkin
2024-09-05 7:31 ` Zhu Lingshan
2024-09-05 7:34 ` Parav Pandit
2024-09-05 6:51 ` Michael S. Tsirkin
2024-09-05 7:12 ` Zhu Lingshan
2024-09-05 8:12 ` Michael S. Tsirkin
2024-09-05 9:09 ` Zhu Lingshan
2024-09-06 1:54 ` Parav Pandit
2024-09-05 23:51 ` Jason Wang
2024-09-11 3:52 ` Zhu Lingshan
2024-09-11 10:20 ` Michael S. Tsirkin
2024-09-12 2:05 ` Jason Wang
2024-09-12 5:44 ` Michael S. Tsirkin
2024-09-24 7:35 ` Jason Wang
2024-09-24 23:05 ` Michael S. Tsirkin
2024-09-25 3:47 ` Jason Wang
2024-09-25 11:17 ` Michael S. Tsirkin
2024-09-27 4:08 ` Jason Wang
2024-09-29 17:55 ` Michael S. Tsirkin
2024-10-17 6:56 ` Jason Wang
2024-09-03 10:28 ` Parav Pandit
2024-09-05 7:20 ` Zhu Lingshan
2024-08-15 10:45 ` Michael S. Tsirkin
2024-08-30 2:32 ` Zhu Lingshan
2024-08-15 10:52 ` Michael S. Tsirkin
2024-08-15 10:59 ` Parav Pandit
2024-08-15 15:07 ` Michael S. Tsirkin
2024-08-17 5:19 ` Parav Pandit
2024-08-30 2:37 ` Zhu Lingshan
2024-08-30 3:10 ` Parav Pandit
2024-09-03 8:51 ` Zhu Lingshan
2024-09-03 8:55 ` Parav Pandit
2024-09-03 9:36 ` Michael S. Tsirkin
2024-09-05 7:27 ` Zhu Lingshan
2024-09-24 23:07 ` Michael S. Tsirkin
2024-08-13 7:51 ` Michael S. Tsirkin
2024-08-13 7:58 ` Parav Pandit
2024-08-13 8:03 ` Michael S. Tsirkin
2024-08-13 8:01 ` Michael S. Tsirkin
2024-08-15 9:12 ` Zhu Lingshan
2024-08-15 10:50 ` Michael S. Tsirkin
2024-08-30 2:20 ` Zhu Lingshan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=88a087ff-676f-4616-a573-a765ca77bef7@amd.com \
--to=lingshan.zhu@amd.com \
--cc=cohuck@redhat.com \
--cc=eperezma@redhat.com \
--cc=jasowang@redhat.com \
--cc=mst@redhat.com \
--cc=parav@nvidia.com \
--cc=stevensd@chromium.org \
--cc=virtio-comment@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox