Discussion of the implementations of VIRTIO specification
 help / color / mirror / Atom feed
* [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes features to specification
@ 2018-03-06  2:50 Changpeng Liu
  2018-03-07 11:28 ` Stefan Hajnoczi
       [not found] ` <bc8d88a4-3069-188a-99aa-c1c75a6e13b2@redhat.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Changpeng Liu @ 2018-03-06  2:50 UTC (permalink / raw)
  To: virtio-dev
  Cc: virtio-comment, changpeng.liu, cavery, stefanha, pbonzini,
	pawelx.wodkowski, james.r.harris

Existing virtio-blk protocol doesn't have DISCARD/WRITE ZEROES support,
this will impact the performance when using SSD backend over file systems.

Here is the proposal to extend existing virtio-blk protocol to support
DISCARD/WRITE ZEROES commands.

Basic idea here is using 16 Bytes payload to support 1 descriptor, users
can put several segments together with 1 DISCARD/WRITE ZEROES command.

struct virtio_blk_discard_write_zeroes {
       le64 sector;
       le32 num_sectors;
       struct {
               le32 unmap:1;
               le32 reserved:31;
       } flags;
};

For the purpose to support such feature, we need to introduce 2 new feature
flags: VIRTIO_BLK_F_DISCARD/VIRTIO_BLK_F_WRITE_ZEROES, and 2 new command
types: VIRTIO_BLK_T_DISCARD/VIRTIO_BLK_T_WRITE_ZEROES. Also we introduce
several new parameters in the configuration space of virtio-blk:
max_discard_sectors/max_discard_seg/max_write_zeroes_sectors.
These parameters will tell the OS what's the granularity when
issuing such commands.

If both DISCARD and WRITE ZEROES are supported, unmap flag bit maybe used
for WRITE ZEROES command with DISCARD bit enabled.

Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
---
 content.tex | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 81 insertions(+), 3 deletions(-)

diff --git a/content.tex b/content.tex
index c7ef7fd..c4b3190 100644
--- a/content.tex
+++ b/content.tex
@@ -4127,6 +4127,14 @@ device except where noted.
 
 \item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback
     and writethrough modes.
+
+\item[VIRTIO_BLK_F_DISCARD (13)] Device can support discard command, maximum
+    discard sectors size in \field{max_discard_sectors} and maximum discard
+    segment number in \field{max_discard_seg}.
+
+\item[VIRTIO_BLK_F_WRITE_ZEROES (14)] Device can support write zeroes command,
+     maximum write zeroes sectors size in \field{max_write_zeroes_sectors} and
+     maximum write zeroes segment number in \field{max_write_zeroes_seg}.
 \end{description}
 
 \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits}
@@ -4148,6 +4156,12 @@ The \field{capacity} of the device (expressed in 512-byte sectors) is always
 present. The availability of the others all depend on various feature
 bits as indicated above.
 
+The parameters in the configuration space of the device \field{max_discard_sectors}
+\field{discard_sector_alignment} are expressed in 512-bytes sectors if the
+VIRTIO_BLK_F_DISCARD feature bit is negotiated. The \field{max_write_zeroes_sectors}
+is expressed in 512-byte sectors if the VIRTIO_BLK_F_WRITE_ZEROES feature
+bit is negotiated.
+
 \begin{lstlisting}
 struct virtio_blk_config {
         le64 capacity;
@@ -4170,6 +4184,14 @@ struct virtio_blk_config {
                 le32 opt_io_size;
         } topology;
         u8 writeback;
+        u8 unused0[3];
+        le32 max_discard_sectors;
+        le32 max_discard_seg;
+        le32 discard_sector_alignment;
+        le32 max_write_zeroes_sectors;
+        le32 max_write_zeroes_seg;
+        u8 write_zeroes_may_unmap;
+        u8 unused1[3];
 };
 \end{lstlisting}
 
@@ -4211,6 +4233,17 @@ according to the native endian of the guest rather than
   after reset can be either writeback or writethrough.  The actual
   mode can be determined by reading \field{writeback} after feature
   negotiation.
+
+\item If the VIRTIO_BLK_F_DISCARD feature is negotiated,
+    \field{max_discard_sectors} and \field{max_discard_seg} can be read
+    to determine the maximum discard sectors and maximum number of discard
+    segments for the block driver to use. \field{discard_sector_alignment}
+    can be used by OS when splitting a request based on alignment.
+
+\item if the VIRTIO_BLK_F_WRITE_ZEROES feature is negotiated,
+    \field{max_write_zeroes_sectors} and \field{max_write_zeroes_seg} can
+    be read to determine the maximum write zeroes sectors and maximum
+    number of write zeroes segments for the block driver to use.
 \end{enumerate}
 
 \drivernormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization}
@@ -4234,6 +4267,9 @@ if they offer VIRTIO_BLK_F_CONFIG_WCE.
 If VIRTIO_BLK_F_CONFIG_WCE is negotiated but VIRTIO_BLK_F_FLUSH
 is not, the device MUST initialize \field{writeback} to 0.
 
+The device MUST initialize padding bytes \field{unused0} and
+\field{unused1} to 0.
+
 \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization}
 
 Because legacy devices do not have FEATURES_OK, transitional devices
@@ -4270,20 +4306,38 @@ struct virtio_blk_req {
         u8 data[][512];
         u8 status;
 };
+
+struct virtio_blk_discard_write_zeroes {
+       le64 sector;
+       le32 num_sectors;
+       struct {
+               le32 unmap:1;
+               le32 reserved:31;
+       } flags;
+};
 \end{lstlisting}
 
 The type of the request is either a read (VIRTIO_BLK_T_IN), a write
-(VIRTIO_BLK_T_OUT), or a flush (VIRTIO_BLK_T_FLUSH).
+(VIRTIO_BLK_T_OUT), a discard (VIRTIO_BLK_T_DISCARD), a write zeroes
+(VIRTIO_BLK_T_WRITE_ZEROES) or a flush (VIRTIO_BLK_T_FLUSH).
 
 \begin{lstlisting}
 #define VIRTIO_BLK_T_IN           0
 #define VIRTIO_BLK_T_OUT          1
 #define VIRTIO_BLK_T_FLUSH        4
+#define VIRTIO_BLK_T_DISCARD      11
+#define VIRTIO_BLK_T_WRITE_ZEROES 13
 \end{lstlisting}
 
 The \field{sector} number indicates the offset (multiplied by 512) where
-the read or write is to occur. This field is unused and set to 0
-for scsi packet commands and for flush commands.
+the read or write is to occur. This field is unused and set to 0 for
+commands other than read or write.
+
+The \field{data} used for discard or write zeroes command is described
+by one or more virtio_blk_discard_write_zeroes structs. \field{sector}
+indicates the starting offset (multiplied by 512) of the segment, while
+\field{num_sectors} indicates the number of sectors in each discarded
+range. \field{unmap} is only used for write zeroes command.
 
 The final \field{status} byte is written by the device: either
 VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for device or driver
@@ -4311,12 +4365,36 @@ switch to writethrough or writeback mode by writing respectively 0 and
 the driver MUST NOT assume that any volatile writes have been committed
 to persistent device backend storage.
 
+The \field{unmap} bit MUST be zero for discard commands.  The driver
+MUST NOT assume anything about the data returned by read requests after
+a range of sectors has been discarded.
+
 \devicenormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation}
 
 A device MUST set the \field{status} byte to VIRTIO_BLK_S_IOERR
 for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT
 write any data.
 
+The device MUST set the \field{status} byte to VIRTIO_BLK_S_UNSUPP for
+discard and write zeroes commands if any unknown flag is set.
+Furthermore, the device MUST set the \field{status} byte to
+VIRTIO_BLK_S_UNSUPP for discard commands if the \field{unmap} flag is set.
+
+For discard commands, the device MAY deallocate the specified range of
+sectors in the device backend storage.
+
+For write zeroes commands, if the \field{unmap} is set, the device MAY
+deallocate the specified range of sectors in the device backend storage,
+as if the DISCARD command had been sent.  After a write zeroes command
+is completed, reads of the specified ranges of sectors MUST return
+zeroes.  This is true independent of whether \field{unmap} was set or clear.
+
+The device SHOULD clear the \field{write_zeroes_may_unmap} field of the
+virtio configuration space if and only if a write zeroes request cannot
+result in deallocating one or more sectors.  The device MAY change the
+content of the field during operation of the device; when this happens,
+the device SHOULD trigger a configuration change interrupt.
+
 A write is considered volatile when it is submitted; the contents of
 sectors covered by a volatile write are undefined in persistent device
 backend storage until the write becomes stable.  A write becomes stable
-- 
1.9.3


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes features to specification
  2018-03-06  2:50 [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes features to specification Changpeng Liu
@ 2018-03-07 11:28 ` Stefan Hajnoczi
  2018-03-08  1:01   ` Liu, Changpeng
       [not found] ` <bc8d88a4-3069-188a-99aa-c1c75a6e13b2@redhat.com>
  1 sibling, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2018-03-07 11:28 UTC (permalink / raw)
  To: Changpeng Liu
  Cc: virtio-dev, virtio-comment, cavery, pbonzini, pawelx.wodkowski,
	james.r.harris

[-- Attachment #1: Type: text/plain, Size: 355 bytes --]

On Tue, Mar 06, 2018 at 10:50:20AM +0800, Changpeng Liu wrote:
> +The \field{data} used for discard or write zeroes command is described
> +by one or more virtio_blk_discard_write_zeroes structs. \field{sector}
> +indicates the starting offset (multiplied by 512) of the segment, while

Did you mean "divided by 512 bytes" instead of "multiplied by 512"?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes features to specification
  2018-03-07 11:28 ` Stefan Hajnoczi
@ 2018-03-08  1:01   ` Liu, Changpeng
  2018-03-08 13:04     ` Paolo Bonzini
  0 siblings, 1 reply; 6+ messages in thread
From: Liu, Changpeng @ 2018-03-08  1:01 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-dev@lists.oasis-open.org,
	virtio-comment@lists.oasis-open.org, cavery@redhat.com,
	pbonzini@redhat.com, Wodkowski, PawelX, Harris, James R



> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] On
> Behalf Of Stefan Hajnoczi
> Sent: Wednesday, March 7, 2018 7:29 PM
> To: Liu, Changpeng <changpeng.liu@intel.com>
> Cc: virtio-dev@lists.oasis-open.org; virtio-comment@lists.oasis-open.org;
> cavery@redhat.com; pbonzini@redhat.com; Wodkowski, PawelX
> <pawelx.wodkowski@intel.com>; Harris, James R <james.r.harris@intel.com>
> Subject: Re: [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes
> features to specification
> 
> On Tue, Mar 06, 2018 at 10:50:20AM +0800, Changpeng Liu wrote:
> > +The \field{data} used for discard or write zeroes command is described
> > +by one or more virtio_blk_discard_write_zeroes structs. \field{sector}
> > +indicates the starting offset (multiplied by 512) of the segment, while
> 
> Did you mean "divided by 512 bytes" instead of "multiplied by 512"?
I saw the original expression for `sector` field used "multiplied by 512", so here just
follow it, should here use "expressed by 512-byte", maybe this looks more clear to
readers.

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes features to specification
  2018-03-08  1:01   ` Liu, Changpeng
@ 2018-03-08 13:04     ` Paolo Bonzini
  2018-03-09  2:31       ` Liu, Changpeng
  0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2018-03-08 13:04 UTC (permalink / raw)
  To: Liu, Changpeng, Stefan Hajnoczi
  Cc: virtio-dev@lists.oasis-open.org,
	virtio-comment@lists.oasis-open.org, cavery@redhat.com,
	Wodkowski, PawelX, Harris, James R

On 08/03/2018 02:01, Liu, Changpeng wrote:
> 
> 
>> -----Original Message-----
>> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] On
>> Behalf Of Stefan Hajnoczi
>> Sent: Wednesday, March 7, 2018 7:29 PM
>> To: Liu, Changpeng <changpeng.liu@intel.com>
>> Cc: virtio-dev@lists.oasis-open.org; virtio-comment@lists.oasis-open.org;
>> cavery@redhat.com; pbonzini@redhat.com; Wodkowski, PawelX
>> <pawelx.wodkowski@intel.com>; Harris, James R <james.r.harris@intel.com>
>> Subject: Re: [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes
>> features to specification
>>
>> On Tue, Mar 06, 2018 at 10:50:20AM +0800, Changpeng Liu wrote:
>>> +The \field{data} used for discard or write zeroes command is described
>>> +by one or more virtio_blk_discard_write_zeroes structs. \field{sector}
>>> +indicates the starting offset (multiplied by 512) of the segment, while
>>
>> Did you mean "divided by 512 bytes" instead of "multiplied by 512"?
> I saw the original expression for `sector` field used "multiplied by 512", so here just
> follow it, should here use "expressed by 512-byte", maybe this looks more clear to
> readers.

Actually the existing description of the "sector" field is wrong too.
"in 512-byte units" is probably the best way to express it.

Paolo

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes features to specification
  2018-03-08 13:04     ` Paolo Bonzini
@ 2018-03-09  2:31       ` Liu, Changpeng
  0 siblings, 0 replies; 6+ messages in thread
From: Liu, Changpeng @ 2018-03-09  2:31 UTC (permalink / raw)
  To: Paolo Bonzini, Stefan Hajnoczi
  Cc: virtio-dev@lists.oasis-open.org, cavery@redhat.com,
	Wodkowski, PawelX, Harris, James R



> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org] On
> Behalf Of Paolo Bonzini
> Sent: Thursday, March 8, 2018 9:05 PM
> To: Liu, Changpeng <changpeng.liu@intel.com>; Stefan Hajnoczi
> <stefanha@redhat.com>
> Cc: virtio-dev@lists.oasis-open.org; virtio-comment@lists.oasis-open.org;
> cavery@redhat.com; Wodkowski, PawelX <pawelx.wodkowski@intel.com>; Harris,
> James R <james.r.harris@intel.com>
> Subject: Re: [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes features
> to specification
> 
> On 08/03/2018 02:01, Liu, Changpeng wrote:
> >
> >
> >> -----Original Message-----
> >> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On
> >> Behalf Of Stefan Hajnoczi
> >> Sent: Wednesday, March 7, 2018 7:29 PM
> >> To: Liu, Changpeng <changpeng.liu@intel.com>
> >> Cc: virtio-dev@lists.oasis-open.org; virtio-comment@lists.oasis-open.org;
> >> cavery@redhat.com; pbonzini@redhat.com; Wodkowski, PawelX
> >> <pawelx.wodkowski@intel.com>; Harris, James R <james.r.harris@intel.com>
> >> Subject: Re: [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes
> >> features to specification
> >>
> >> On Tue, Mar 06, 2018 at 10:50:20AM +0800, Changpeng Liu wrote:
> >>> +The \field{data} used for discard or write zeroes command is described
> >>> +by one or more virtio_blk_discard_write_zeroes structs. \field{sector}
> >>> +indicates the starting offset (multiplied by 512) of the segment, while
> >>
> >> Did you mean "divided by 512 bytes" instead of "multiplied by 512"?
> > I saw the original expression for `sector` field used "multiplied by 512", so here
> just
> > follow it, should here use "expressed by 512-byte", maybe this looks more clear
> to
> > readers.
> 
> Actually the existing description of the "sector" field is wrong too.
> "in 512-byte units" is probably the best way to express it.
Okay, I will change the expression.
> 
> Paolo
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [virtio-dev] Re: [virtio-comment] [PATCH v3] virtio-blk: add discard and write zeroes features to specification
       [not found] ` <bc8d88a4-3069-188a-99aa-c1c75a6e13b2@redhat.com>
@ 2018-03-12 16:28   ` Paolo Bonzini
  0 siblings, 0 replies; 6+ messages in thread
From: Paolo Bonzini @ 2018-03-12 16:28 UTC (permalink / raw)
  To: Laszlo Ersek, Changpeng Liu
  Cc: virtio-dev, virtio-comment, cavery, stefanha, pawelx.wodkowski,
	james.r.harris

On 12/03/2018 17:19, Laszlo Ersek wrote:
> virtio-scsi already supports discard and write zeroes, and virtio-scsi
> can be backed by SSDs. Can you please describe the use case in the
> commit message (in a few words) that needs discard / write zeroes but is
> ill-served by virtio-scsi?
> 
> I believe the spec (and implementations) should be conservative about
> feature duplication unless there's a good reason for it. I don't doubt
> you have a good reason; can you please name it in the commit message?
> I'm not questioning your motives, I'd just like to see them documented
> in the commit log.

Hi Laszlo, the basic reason is that virtio-scsi is slower than
virtio-blk. :)

Indeed DISCARD/WRITE ZEROES was a reason to use virtio-scsi instead of
virtio-blk back when it was created, but at the time DISCARD was
considered a thin provisioning feature; that is, the idea was that
guests could cooperate with the host to save host disk space and to
speed up maintenance operation such as live disk migration.

Nowadays, DISCARD has evolved into a basic disk maintenance operation
and IMO it fits within the scope of virtio-blk, as the backend of choice
for high performance I/O devices.  (Though I wouldn't be surprised
however if, in a few years, virtio-blk has become obsolete and we'll
just use NVMe.  Recent versions of the NVMe spec have taken some
inspiration from Xen and virtio ring buffers, and perform great as
virtual devices; this is a similar "convergence" to what's happening in
virtio for the new packed ring format, just in the opposite direction).

Paolo

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-03-12 16:28 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-03-06  2:50 [virtio-dev] [PATCH v3] virtio-blk: add discard and write zeroes features to specification Changpeng Liu
2018-03-07 11:28 ` Stefan Hajnoczi
2018-03-08  1:01   ` Liu, Changpeng
2018-03-08 13:04     ` Paolo Bonzini
2018-03-09  2:31       ` Liu, Changpeng
     [not found] ` <bc8d88a4-3069-188a-99aa-c1c75a6e13b2@redhat.com>
2018-03-12 16:28   ` [virtio-dev] Re: [virtio-comment] " Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox