[Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack
@ 2013-11-19 17:07 Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 01/20] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
                   ` (19 more replies)
  0 siblings, 20 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This series fixes the implementation of WRITE SAME in the SCSI emulation
layer.  On top of this, it ensures that zero writes from the guest can
be offloaded to the host device or filesystem if that's supported.
This is a relatively important part of the thin provisioning feature,
and builds on the unmap flag to write_zeroes, recently added by Peter
Lieven.  The feature works with iscsi, block device and filesystem
targets.

v1->v2
        I cleaned up the series, putting all the block layer changes
        together and all the SCSI changes at the end.

        qcow2 and iscsi are now better covered, which caused the
        introduction of patches 6-11.  I also downloaded the latest
        which contributed some changes in patches 10-11.  It turns
        out we were too cautious in treating LBPRZ as advisory when
        using the UNMAP command.  On the other hand, the API we
        introduced is useful for qcow2/qed/vmdk WRITE SAME, so no
        damage done.

        Patch 3 is also new.

Paolo Bonzini (18):
  block: generalize BlockLimits handling to cover bdrv_aio_discard too
  block: add flags to BlockRequest
  block: add flags argument to bdrv_co_write_zeroes tracepoint
  block: add bdrv_aio_write_zeroes
  block: handle ENOTSUP from discard in generic code
  block: make bdrv_co_do_write_zeroes stricter in producing aligned requests
  vpc, vhdx: add get_info
  block drivers: add discard/write_zeroes properties to bdrv_get_info implementation
  block drivers: expose requirement for write same alignment from formats
  block/iscsi: check WRITE SAME support differently depending on MAY_UNMAP
  block/iscsi: use UNMAP to write zeroes if LBPRZ=1
  raw-posix: implement write_zeroes with MAY_UNMAP for files
  raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  raw-posix: add support for write_zeroes on XFS and block devices
  qemu-iotests: 033 is fast
  scsi-disk: catch write protection errors in UNMAP
  scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands
  scsi-disk: correctly implement WRITE SAME

Peter Lieven (2):
  block/iscsi: remove .bdrv_has_zero_init
  block/iscsi: updated copyright

 block.c                  | 140 ++++++++++++++++++++-----------------
 block/iscsi.c            |  32 ++++++---
 block/qcow2.c            |   3 +
 block/qed.c              |   3 +
 block/raw-aio.h          |   3 +-
 block/raw-posix.c        | 175 ++++++++++++++++++++++++++++++++++++++++++++---
 block/vdi.c              |   3 +
 block/vhdx.c             |  14 ++++
 block/vmdk.c             |   4 ++
 block/vpc.c              |  16 +++++
 hw/scsi/scsi-disk.c      | 154 ++++++++++++++++++++++++++++++++++-------
 include/block/block.h    |   4 ++
 tests/qemu-iotests/group |   2 +-
 trace-events             |   4 +-
 14 files changed, 452 insertions(+), 105 deletions(-)

-- 
1.8.4.2

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 01/20] block: generalize BlockLimits handling to cover bdrv_aio_discard too
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 02/20] block: add flags to BlockRequest Paolo Bonzini
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

bdrv_co_discard is only covering drivers which have a .bdrv_co_discard()
implementation, but not those with .bdrv_aio_discard(). Not very nice,
and easy to avoid.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c | 80 +++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 41 insertions(+), 39 deletions(-)

diff --git a/block.c b/block.c
index d7de89d..2e37040 100644
--- a/block.c
+++ b/block.c
@@ -4281,6 +4281,8 @@ static void coroutine_fn bdrv_discard_co_entry(void *opaque)
 int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
                                  int nb_sectors)
 {
+    int max_discard;
+
     if (!bs->drv) {
         return -ENOMEDIUM;
     } else if (bdrv_check_request(bs, sector_num, nb_sectors)) {
@@ -4298,55 +4300,55 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
         return 0;
     }
 
-    if (bs->drv->bdrv_co_discard) {
-        int max_discard = bs->bl.max_discard ?
-                          bs->bl.max_discard : MAX_DISCARD_DEFAULT;
+    if (!bs->drv->bdrv_co_discard && !bs->drv->bdrv_aio_discard) {
+        return 0;
+    }
 
-        while (nb_sectors > 0) {
-            int ret;
-            int num = nb_sectors;
+    max_discard = bs->bl.max_discard ?  bs->bl.max_discard : MAX_DISCARD_DEFAULT;
+    while (nb_sectors > 0) {
+        int ret;
+        int num = nb_sectors;
 
-            /* align request */
-            if (bs->bl.discard_alignment &&
-                num >= bs->bl.discard_alignment &&
-                sector_num % bs->bl.discard_alignment) {
-                if (num > bs->bl.discard_alignment) {
-                    num = bs->bl.discard_alignment;
-                }
-                num -= sector_num % bs->bl.discard_alignment;
+        /* align request */
+        if (bs->bl.discard_alignment &&
+            num >= bs->bl.discard_alignment &&
+            sector_num % bs->bl.discard_alignment) {
+            if (num > bs->bl.discard_alignment) {
+                num = bs->bl.discard_alignment;
             }
+            num -= sector_num % bs->bl.discard_alignment;
+        }
 
-            /* limit request size */
-            if (num > max_discard) {
-                num = max_discard;
-            }
+        /* limit request size */
+        if (num > max_discard) {
+            num = max_discard;
+        }
 
+        if (bs->drv->bdrv_co_discard) {
             ret = bs->drv->bdrv_co_discard(bs, sector_num, num);
-            if (ret) {
-                return ret;
+        } else {
+            BlockDriverAIOCB *acb;
+            CoroutineIOCompletion co = {
+                .coroutine = qemu_coroutine_self(),
+            };
+
+            acb = bs->drv->bdrv_aio_discard(bs, sector_num, nb_sectors,
+                                            bdrv_co_io_em_complete, &co);
+            if (acb == NULL) {
+                return -EIO;
+            } else {
+                qemu_coroutine_yield();
+                ret = co.ret;
             }
-
-            sector_num += num;
-            nb_sectors -= num;
         }
-        return 0;
-    } else if (bs->drv->bdrv_aio_discard) {
-        BlockDriverAIOCB *acb;
-        CoroutineIOCompletion co = {
-            .coroutine = qemu_coroutine_self(),
-        };
-
-        acb = bs->drv->bdrv_aio_discard(bs, sector_num, nb_sectors,
-                                        bdrv_co_io_em_complete, &co);
-        if (acb == NULL) {
-            return -EIO;
-        } else {
-            qemu_coroutine_yield();
-            return co.ret;
+        if (ret) {
+            return ret;
         }
-    } else {
-        return 0;
+
+        sector_num += num;
+        nb_sectors -= num;
     }
+    return 0;
 }
 
 int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 02/20] block: add flags to BlockRequest
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 01/20] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 03/20] block: add flags argument to bdrv_co_write_zeroes tracepoint Paolo Bonzini
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This lets bdrv_co_do_rw receive flags, so that it can be used for
zero writes.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c               | 17 +++++++++++------
 include/block/block.h |  1 +
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 2e37040..909fddb 100644
--- a/block.c
+++ b/block.c
@@ -74,6 +74,7 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
                                                int nb_sectors,
+                                               BdrvRequestFlags flags,
                                                BlockDriverCompletionFunc *cb,
                                                void *opaque,
                                                bool is_write);
@@ -3648,7 +3649,7 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
 {
     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
 
-    return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors,
+    return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors, 0,
                                  cb, opaque, false);
 }
 
@@ -3658,7 +3659,7 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
 {
     trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque);
 
-    return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors,
+    return bdrv_co_aio_rw_vector(bs, sector_num, qiov, nb_sectors, 0,
                                  cb, opaque, true);
 }
 
@@ -3830,8 +3831,10 @@ int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs)
     /* Run the aio requests. */
     mcb->num_requests = num_reqs;
     for (i = 0; i < num_reqs; i++) {
-        bdrv_aio_writev(bs, reqs[i].sector, reqs[i].qiov,
-            reqs[i].nb_sectors, multiwrite_cb, mcb);
+        bdrv_co_aio_rw_vector(bs, reqs[i].sector, reqs[i].qiov,
+                              reqs[i].nb_sectors, reqs[i].flags,
+                              multiwrite_cb, mcb,
+                              true);
     }
 
     return 0;
@@ -3973,10 +3976,10 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque)
 
     if (!acb->is_write) {
         acb->req.error = bdrv_co_do_readv(bs, acb->req.sector,
-            acb->req.nb_sectors, acb->req.qiov, 0);
+            acb->req.nb_sectors, acb->req.qiov, acb->req.flags);
     } else {
         acb->req.error = bdrv_co_do_writev(bs, acb->req.sector,
-            acb->req.nb_sectors, acb->req.qiov, 0);
+            acb->req.nb_sectors, acb->req.qiov, acb->req.flags);
     }
 
     acb->bh = qemu_bh_new(bdrv_co_em_bh, acb);
@@ -3987,6 +3990,7 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
                                                int nb_sectors,
+                                               BdrvRequestFlags flags,
                                                BlockDriverCompletionFunc *cb,
                                                void *opaque,
                                                bool is_write)
@@ -3998,6 +4002,7 @@ static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
     acb->req.sector = sector_num;
     acb->req.nb_sectors = nb_sectors;
     acb->req.qiov = qiov;
+    acb->req.flags = flags;
     acb->is_write = is_write;
     acb->done = NULL;
 
diff --git a/include/block/block.h b/include/block/block.h
index 4d9e67c..703d875 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -311,6 +311,7 @@ typedef struct BlockRequest {
     /* Fields to be filled by multiwrite caller */
     int64_t sector;
     int nb_sectors;
+    int flags;
     QEMUIOVector *qiov;
     BlockDriverCompletionFunc *cb;
     void *opaque;
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 03/20] block: add flags argument to bdrv_co_write_zeroes tracepoint
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 01/20] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 02/20] block: add flags to BlockRequest Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-20  9:59   ` Stefan Hajnoczi
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 04/20] block: add bdrv_aio_write_zeroes Paolo Bonzini
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c      | 2 +-
 trace-events | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 909fddb..0c3f410 100644
--- a/block.c
+++ b/block.c
@@ -2880,7 +2880,7 @@ int coroutine_fn bdrv_co_write_zeroes(BlockDriverState *bs,
                                       int64_t sector_num, int nb_sectors,
                                       BdrvRequestFlags flags)
 {
-    trace_bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
+    trace_bdrv_co_write_zeroes(bs, sector_num, nb_sectors, flags);
 
     if (!(bs->open_flags & BDRV_O_UNMAP)) {
         flags &= ~BDRV_REQ_MAY_UNMAP;
diff --git a/trace-events b/trace-events
index 8695e9e..6175b61 100644
--- a/trace-events
+++ b/trace-events
@@ -64,7 +64,7 @@ bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
-bdrv_co_write_zeroes(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
+bdrv_co_write_zeroes(void *bs, int64_t sector_num, int nb_sector, int flags) "bs %p sector_num %"PRId64" nb_sectors %d flags %d"
 bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
 bdrv_co_do_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
 
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 04/20] block: add bdrv_aio_write_zeroes
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (2 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 03/20] block: add flags argument to bdrv_co_write_zeroes tracepoint Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-20 10:02   ` Stefan Hajnoczi
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 05/20] block: handle ENOTSUP from discard in generic code Paolo Bonzini
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This will be used by the SCSI layer.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c               | 11 +++++++++++
 include/block/block.h |  3 +++
 trace-events          |  1 +
 3 files changed, 15 insertions(+)

diff --git a/block.c b/block.c
index 0c3f410..f5428f4 100644
--- a/block.c
+++ b/block.c
@@ -3663,6 +3663,17 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
                                  cb, opaque, true);
 }
 
+BlockDriverAIOCB *bdrv_aio_write_zeroes(BlockDriverState *bs,
+        int64_t sector_num, int nb_sectors, BdrvRequestFlags flags,
+        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    trace_bdrv_aio_write_zeroes(bs, sector_num, nb_sectors, flags, opaque);
+
+    return bdrv_co_aio_rw_vector(bs, sector_num, NULL, nb_sectors,
+                                 BDRV_REQ_ZERO_WRITE | flags,
+                                 cb, opaque, true);
+}
+
 
 typedef struct MultiwriteCB {
     int error;
diff --git a/include/block/block.h b/include/block/block.h
index 703d875..4967ed2 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -216,6 +216,9 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
                const uint8_t *buf, int nb_sectors);
 int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
                int nb_sectors, BdrvRequestFlags flags);
+BlockDriverAIOCB *bdrv_aio_write_zeroes(BlockDriverState *bs, int64_t sector_num,
+                                        int nb_sectors, BdrvRequestFlags flags,
+                                        BlockDriverCompletionFunc *cb, void *opaque);
 int bdrv_make_zero(BlockDriverState *bs, BdrvRequestFlags flags);
 int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov);
 int bdrv_pread(BlockDriverState *bs, int64_t offset,
diff --git a/trace-events b/trace-events
index 6175b61..a18ea59 100644
--- a/trace-events
+++ b/trace-events
@@ -60,6 +60,7 @@ bdrv_aio_discard(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs
 bdrv_aio_flush(void *bs, void *opaque) "bs %p opaque %p"
 bdrv_aio_readv(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d opaque %p"
 bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d opaque %p"
+bdrv_aio_write_zeroes(void *bs, int64_t sector_num, int nb_sectors, int flags, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d flags %d opaque %p"
 bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
 bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 05/20] block: handle ENOTSUP from discard in generic code
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (3 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 04/20] block: add bdrv_aio_write_zeroes Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests Paolo Bonzini
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Similar to write_zeroes, let the generic code receive a ENOTSUP for
discard operations.  Since bdrv_discard has advisory semantics,
we can just swallow the error.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c           |  2 +-
 block/raw-posix.c | 12 ++++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/block.c b/block.c
index f5428f4..4897649 100644
--- a/block.c
+++ b/block.c
@@ -4357,7 +4357,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
                 ret = co.ret;
             }
         }
-        if (ret) {
+        if (ret && ret != -ENOTSUP) {
             return ret;
         }
 
diff --git a/block/raw-posix.c b/block/raw-posix.c
index ace5d96..f719b69 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -324,10 +324,10 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif
 
-    s->has_discard = 1;
+    s->has_discard = true;
 #ifdef CONFIG_XFS
     if (platform_test_xfs_fd(s->fd)) {
-        s->is_xfs = 1;
+        s->is_xfs = true;
     }
 #endif
 
@@ -699,8 +699,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
     int ret = -EOPNOTSUPP;
     BDRVRawState *s = aiocb->bs->opaque;
 
-    if (s->has_discard == 0) {
-        return 0;
+    if (!s->has_discard) {
+        return -ENOTSUP;
     }
 
     if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
@@ -735,8 +735,8 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 
     if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
         ret == -ENOTTY) {
-        s->has_discard = 0;
-        ret = 0;
+        s->has_discard = false;
+        ret = -ENOTSUP;
     }
     return ret;
 }
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (4 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 05/20] block: handle ENOTSUP from discard in generic code Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-20 10:22   ` Stefan Hajnoczi
  2013-11-21 11:30   ` Peter Lieven
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 07/20] vpc, vhdx: add get_info Paolo Bonzini
                   ` (13 subsequent siblings)
  19 siblings, 2 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Right now, bdrv_co_do_write_zeroes will only try to align the
beginning of the request.  However, it is simpler for many
formats to expect the block layer to separate both the head *and*
the tail.  This makes sure that the format's bdrv_co_write_zeroes
function will be called with aligned sector_num and nb_sectors for
the bulk of the request.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/block.c b/block.c
index 4897649..567d669 100644
--- a/block.c
+++ b/block.c
@@ -2761,14 +2761,19 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     while (nb_sectors > 0 && !ret) {
         int num = nb_sectors;
 
-        /* align request */
-        if (bs->bl.write_zeroes_alignment &&
-            num >= bs->bl.write_zeroes_alignment &&
-            sector_num % bs->bl.write_zeroes_alignment) {
-            if (num > bs->bl.write_zeroes_alignment) {
+        /* Align request.  Block drivers can expect the "bulk" of the request
+         * to be aligned.
+         */
+        if (bs->bl.write_zeroes_alignment
+            && num > bs->bl.write_zeroes_alignment) {
+            if (sector_num % bs->bl.write_zeroes_alignment != 0) {
+                /* Make a small request up to the first aligned sector.  */
                 num = bs->bl.write_zeroes_alignment;
+                num -= sector_num % bs->bl.write_zeroes_alignment;
+            } else if (num >= bs->bl.write_zeroes_alignment) {
+                /* Shorten the request to the last aligned sector.  */
+                num -= (sector_num + num) % bs->bl.write_zeroes_alignment;
             }
-            num -= sector_num % bs->bl.write_zeroes_alignment;
         }
 
         /* limit request size */
@@ -2785,24 +2790,27 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         if (ret == -ENOTSUP) {
             /* Fall back to bounce buffer if write zeroes is unsupported */
             iov.iov_len = num * BDRV_SECTOR_SIZE;
             if (iov.iov_base == NULL) {
-                /* allocate bounce buffer only once and ensure that it
-                 * is big enough for this and all future requests.
-                 */
-                size_t bufsize = num <= nb_sectors ? num : max_write_zeroes;
-                iov.iov_base = qemu_blockalign(bs, bufsize * BDRV_SECTOR_SIZE);
-                memset(iov.iov_base, 0, bufsize * BDRV_SECTOR_SIZE);
+                iov.iov_base = qemu_blockalign(bs, num * BDRV_SECTOR_SIZE);
+                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
             }
             qemu_iovec_init_external(&qiov, &iov, 1);
 
             ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
+            if (num <= max_write_zeroes) {
+                /* Allocate bounce buffer only once if it is
+                 * big enough for this and all future requests.
+                 */
+                qemu_vfree(iov.iov_base);
+                iov.iov_base = NULL;
+            }
         }
 
         sector_num += num;
         nb_sectors -= num;
     }
 
     qemu_vfree(iov.iov_base);
     return ret;
 }
 
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 07/20] vpc, vhdx: add get_info
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (5 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-20 12:39   ` Stefan Hajnoczi
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 08/20] block drivers: add discard/write_zeroes properties to bdrv_get_info implementation Paolo Bonzini
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/vhdx.c | 11 +++++++++++
 block/vpc.c  | 14 ++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/block/vhdx.c b/block/vhdx.c
index 7d1af96..9ab2b39 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -1043,6 +1043,16 @@ static void vhdx_block_translate(BDRVVHDXState *s, int64_t sector_num,
 }
 
 
+static int vhdx_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    BDRVVHDXState *s = bs->opaque;
+
+    bdi->cluster_size =
+        (s->logical_sector_size / BDRV_SECTOR_SIZE) * s->block_size;
+
+    return 0;
+}
+
 
 static coroutine_fn int vhdx_co_readv(BlockDriverState *bs, int64_t sector_num,
                                       int nb_sectors, QEMUIOVector *qiov)
@@ -1885,6 +1895,7 @@ static BlockDriver bdrv_vhdx = {
     .bdrv_co_readv          = vhdx_co_readv,
     .bdrv_co_writev         = vhdx_co_writev,
     .bdrv_create            = vhdx_create,
+    .bdrv_get_info          = vhdx_get_info,
 
     .create_options         = vhdx_create_options,
 };
diff --git a/block/vpc.c b/block/vpc.c
index 577cc45..551876f 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -455,6 +455,18 @@ fail:
     return -1;
 }
 
+static int vpc_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    BDRVVPCState *s = (BDRVVPCState *)bs->opaque;
+    VHDFooter *footer = (VHDFooter *) s->footer_buf;
+
+    if (cpu_to_be32(footer->type) != VHD_FIXED) {
+        bdi->cluster_size = s->block_size;
+    }
+
+    return 0;
+}
+
 static int vpc_read(BlockDriverState *bs, int64_t sector_num,
                     uint8_t *buf, int nb_sectors)
 {
@@ -857,6 +869,8 @@ static BlockDriver bdrv_vpc = {
     .bdrv_read              = vpc_co_read,
     .bdrv_write             = vpc_co_write,
 
+    .bdrv_get_info          = vpc_get_info,
+
     .create_options         = vpc_create_options,
     .bdrv_has_zero_init     = vpc_has_zero_init,
 };
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 08/20] block drivers: add discard/write_zeroes properties to bdrv_get_info implementation
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (6 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 07/20] vpc, vhdx: add get_info Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-21 11:33   ` Peter Lieven
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 09/20] block drivers: expose requirement for write same alignment from formats Paolo Bonzini
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/qcow2.c | 2 ++
 block/qed.c   | 2 ++
 block/vdi.c   | 3 +++
 block/vhdx.c  | 3 +++
 block/vpc.c   | 2 ++
 5 files changed, 12 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2fe37ed..92ed895 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1892,6 +1892,8 @@ static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 static int qcow2_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
     BDRVQcowState *s = bs->opaque;
+    bdi->unallocated_blocks_are_zero = (bs->backing_hd == NULL);
+    bdi->can_write_zeroes_with_unmap = (s->qcow_version >= 3);
     bdi->cluster_size = s->cluster_size;
     bdi->vm_state_offset = qcow2_vm_state_offset(s);
     return 0;
diff --git a/block/qed.c b/block/qed.c
index adc2736..1f87420 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1475,6 +1475,8 @@ static int bdrv_qed_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
     memset(bdi, 0, sizeof(*bdi));
     bdi->cluster_size = s->header.cluster_size;
     bdi->is_dirty = s->header.features & QED_F_NEED_CHECK;
+    bdi->unallocated_blocks_are_zero = (bs->backing_hd == NULL);
+    bdi->can_write_zeroes_with_unmap = true;
     return 0;
 }
 
diff --git a/block/vdi.c b/block/vdi.c
index b6ec002..3a05027 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -331,6 +331,9 @@ static int vdi_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
     logout("\n");
     bdi->cluster_size = s->block_size;
     bdi->vm_state_offset = 0;
+
+    /* Backing file not supported yet.  */
+    bdi->unallocated_blocks_are_zero = true;
     return 0;
 }
 
diff --git a/block/vhdx.c b/block/vhdx.c
index 9ab2b39..d06585a 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -1050,6 +1050,9 @@ static int vhdx_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
     bdi->cluster_size =
         (s->logical_sector_size / BDRV_SECTOR_SIZE) * s->block_size;
 
+    bdi->unallocated_blocks_are_zero =
+        (s->params.data_bits & VHDX_PARAMS_HAS_PARENT) == 0;
+
     return 0;
 }
 
diff --git a/block/vpc.c b/block/vpc.c
index 551876f..d6a47ef 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -464,6 +464,8 @@ static int vpc_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
         bdi->cluster_size = s->block_size;
     }
 
+    /* Backing file not supported yet.  */
+    bdi->unallocated_blocks_are_zero = true;
     return 0;
 }
 
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 09/20] block drivers: expose requirement for write same alignment from formats
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (7 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 08/20] block drivers: add discard/write_zeroes properties to bdrv_get_info implementation Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 10/20] block/iscsi: remove .bdrv_has_zero_init Paolo Bonzini
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This will let misaligned but large requests use zero clusters.  This
is important because the cluster size is not guest visible.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/qcow2.c | 1 +
 block/qed.c   | 1 +
 block/vmdk.c  | 4 ++++
 3 files changed, 6 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 92ed895..9cc247b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -718,6 +718,7 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     qemu_opts_del(opts);
+    bs->bl.write_zeroes_alignment = s->cluster_sectors;
 
     if (s->use_lazy_refcounts && s->qcow_version < 3) {
         error_setg(errp, "Lazy refcounts require a qcow2 image with at least "
diff --git a/block/qed.c b/block/qed.c
index 1f87420..23247f3 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -495,6 +495,7 @@ static int bdrv_qed_open(BlockDriverState *bs, QDict *options, int flags,
         }
     }
 
+    bs->bl.write_zeroes_alignment = s->header.cluster_size >> BDRV_SECTOR_BITS;
     s->need_check_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
                                             qed_need_check_timer_cb, s);
 
diff --git a/block/vmdk.c b/block/vmdk.c
index 6555663..63ea7ff 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -428,6 +428,10 @@ static int vmdk_add_extent(BlockDriverState *bs,
     extent->l2_size = l2_size;
     extent->cluster_sectors = flat ? sectors : cluster_sectors;
 
+    if (!flat) {
+        bs->bl.write_zeroes_alignment =
+            MAX(bs->bl.write_zeroes_alignment, cluster_sectors);
+    }
     if (s->num_extents > 1) {
         extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
     } else {
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 10/20] block/iscsi: remove .bdrv_has_zero_init
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (8 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 09/20] block drivers: expose requirement for write same alignment from formats Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 11/20] block/iscsi: updated copyright Paolo Bonzini
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

From: Peter Lieven <pl@kamp.de>

since commit 3ac21627 the default value changed to 0.

Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/iscsi.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index b7b5238..b6b62aa 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1505,11 +1505,6 @@ static int iscsi_truncate(BlockDriverState *bs, int64_t offset)
     return 0;
 }
 
-static int iscsi_has_zero_init(BlockDriverState *bs)
-{
-    return 0;
-}
-
 static int iscsi_create(const char *filename, QEMUOptionParameter *options,
                         Error **errp)
 {
@@ -1608,8 +1603,6 @@ static BlockDriver bdrv_iscsi = {
     .bdrv_aio_writev = iscsi_aio_writev,
     .bdrv_aio_flush  = iscsi_aio_flush,
 
-    .bdrv_has_zero_init = iscsi_has_zero_init,
-
 #ifdef __linux__
     .bdrv_ioctl       = iscsi_ioctl,
     .bdrv_aio_ioctl   = iscsi_aio_ioctl,
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 11/20] block/iscsi: updated copyright
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (9 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 10/20] block/iscsi: remove .bdrv_has_zero_init Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 12/20] block/iscsi: check WRITE SAME support differently depending on MAY_UNMAP Paolo Bonzini
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

From: Peter Lieven <pl@kamp.de>

added myself to reflect recent work on the iscsi block driver.

Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/iscsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/iscsi.c b/block/iscsi.c
index b6b62aa..20f4f55 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -2,6 +2,7 @@
  * QEMU Block driver for iSCSI images
  *
  * Copyright (c) 2010-2011 Ronnie Sahlberg <ronniesahlberg@gmail.com>
+ * Copyright (c) 2012-2013 Peter Lieven <pl@kamp.de>
  *
  * Permission is hereby granted, free of charge, to any person obtaining a copy
  * of this software and associated documentation files (the "Software"), to deal
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 12/20] block/iscsi: check WRITE SAME support differently depending on MAY_UNMAP
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (10 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 11/20] block/iscsi: updated copyright Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1 Paolo Bonzini
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

The current check is right for MAY_UNMAP=1.  For MAY_UNMAP=0, just
try and fall back to regular writes as soon as a WRITE SAME command
fails.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/iscsi.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 20f4f55..93fee6d 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -55,6 +55,7 @@ typedef struct IscsiLun {
     QEMUTimer *nop_timer;
     uint8_t lbpme;
     uint8_t lbprz;
+    uint8_t has_write_same;
     struct scsi_inquiry_logical_block_provisioning lbp;
     struct scsi_inquiry_block_limits bl;
     unsigned char *zeroblock;
@@ -976,8 +977,13 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
         return -EINVAL;
     }
 
-    if (!iscsilun->lbp.lbpws) {
-        /* WRITE SAME is not supported by the target */
+    if (!(flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->has_write_same) {
+        /* WRITE SAME without UNMAP is not supported by the target */
+        return -ENOTSUP;
+    }
+
+    if ((flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->lbp.lbpws) {
+        /* WRITE SAME with UNMAP is not supported by the target */
         return -ENOTSUP;
     }
 
@@ -1012,6 +1018,14 @@ retry:
     }
 
     if (iTask.status != SCSI_STATUS_GOOD) {
+        if (iTask.status == SCSI_STATUS_CHECK_CONDITION &&
+            iTask.task->sense.key == SCSI_SENSE_ILLEGAL_REQUEST &&
+            iTask.task->sense.ascq == SCSI_SENSE_ASCQ_INVALID_OPERATION_CODE) {
+            /* WRITE SAME is not supported by the target */
+            iscsilun->has_write_same = false;
+            return -ENOTSUP;
+        }
+
         return -EIO;
     }
 
@@ -1375,6 +1389,7 @@ static int iscsi_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     iscsilun->type = inq->periperal_device_type;
+    iscsilun->has_write_same = true;
 
     if ((ret = iscsi_readcapacity_sync(iscsilun)) != 0) {
         goto out;
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (11 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 12/20] block/iscsi: check WRITE SAME support differently depending on MAY_UNMAP Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-21 11:43   ` Peter Lieven
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 14/20] raw-posix: implement write_zeroes with MAY_UNMAP for files Paolo Bonzini
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

The latest revision of SCSI SBC clarifies the semantics of LBPRZ
in a way that lets the iscsi driver remove the distinction between
bdrv_unallocated_blocks_are_zero and bdrv_can_write_zeroes_with_unmap.
See Table 8:

    "[If] the LBA became mapped as the result of an autonomous transition,
    and no write command has specified that LBA since the LBA was mapped,
    [a read operation returns the] logical block data that would be returned
    if that autonomous transition had not occurred and the LBA had remained
    unmapped."

Thus, we can assume that even after an UNMAP command, unallocated blocks
return zero.  The distinction may remain for other drivers, for example
qcow2 or qed or vmdk.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/iscsi.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 93fee6d..63b451d 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -984,6 +984,9 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
 
     if ((flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->lbp.lbpws) {
         /* WRITE SAME with UNMAP is not supported by the target */
+        if (iscsilun->lbp.lbpu && iscsilun->lbprz) {
+            return iscsi_co_discard(bs, sector_num, nb_sectors);
+        }
         return -ENOTSUP;
     }
 
@@ -1579,7 +1582,7 @@ static int iscsi_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
     IscsiLun *iscsilun = bs->opaque;
     bdi->unallocated_blocks_are_zero = !!iscsilun->lbprz;
-    bdi->can_write_zeroes_with_unmap = iscsilun->lbprz && iscsilun->lbp.lbpws;
+    bdi->can_write_zeroes_with_unmap = !!iscsilun->lbprz;
     return 0;
 }
 
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 14/20] raw-posix: implement write_zeroes with MAY_UNMAP for files
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (12 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1 Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 15/20] raw-posix: implement write_zeroes with MAY_UNMAP for block devices Paolo Bonzini
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Writing zeroes to a file can be done by punching a hole if
MAY_UNMAP is set.

Note that in this case ENOTSUP is not ignored, but makes
the block layer fall back to the generic implementation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-posix.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 trace-events      |  1 +
 2 files changed, 65 insertions(+), 2 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index f719b69..a7011d5 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -139,9 +139,10 @@ typedef struct BDRVRawState {
     void *aio_ctx;
 #endif
 #ifdef CONFIG_XFS
-    bool is_xfs : 1;
+    bool is_xfs:1;
 #endif
-    bool has_discard : 1;
+    bool has_discard:1;
+    bool discard_zeroes:1;
 } BDRVRawState;
 
 typedef struct BDRVRawReopenState {
@@ -283,6 +284,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     Error *local_err = NULL;
     const char *filename;
     int fd, ret;
+    struct stat st;
 
     opts = qemu_opts_create_nofail(&raw_runtime_opts);
     qemu_opts_absorb_qdict(opts, options, &local_err);
@@ -325,6 +327,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 #endif
 
     s->has_discard = true;
+
+    if (fstat(s->fd, &st) < 0) {
+        error_setg_errno(errp, errno, "Could not stat file");
+        goto fail;
+    }
+    if (S_ISREG(st.st_mode)) {
+        s->discard_zeroes = true;
+    }
+
 #ifdef CONFIG_XFS
     if (platform_test_xfs_fd(s->fd)) {
         s->is_xfs = true;
@@ -788,6 +799,29 @@ static int aio_worker(void *arg)
     return ret;
 }
 
+static int paio_submit_co(BlockDriverState *bs, int fd,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+        int type)
+{
+    RawPosixAIOData *acb = g_slice_new(RawPosixAIOData);
+    ThreadPool *pool;
+
+    acb->bs = bs;
+    acb->aio_type = type;
+    acb->aio_fildes = fd;
+
+    if (qiov) {
+        acb->aio_iov = qiov->iov;
+        acb->aio_niov = qiov->niov;
+    }
+    acb->aio_nbytes = nb_sectors * 512;
+    acb->aio_offset = sector_num * 512;
+
+    trace_paio_submit_co(sector_num, nb_sectors, type);
+    pool = aio_get_thread_pool(bdrv_get_aio_context(bs));
+    return thread_pool_submit_co(pool, aio_worker, acb);
+}
+
 static BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
         int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque, int type)
@@ -1200,6 +1234,31 @@ static coroutine_fn BlockDriverAIOCB *raw_aio_discard(BlockDriverState *bs,
                        cb, opaque, QEMU_AIO_DISCARD);
 }
 
+static int coroutine_fn raw_co_write_zeroes(
+    BlockDriverState *bs, int64_t sector_num,
+    int nb_sectors, BdrvRequestFlags flags)
+{
+    BDRVRawState *s = bs->opaque;
+
+    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+        return -ENOTSUP;
+    }
+    if (!s->discard_zeroes) {
+        return -ENOTSUP;
+    }
+    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                          QEMU_AIO_DISCARD);
+}
+
+static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    BDRVRawState *s = bs->opaque;
+
+    bdi->unallocated_blocks_are_zero = s->discard_zeroes;
+    bdi->can_write_zeroes_with_unmap = s->discard_zeroes;
+    return 0;
+}
+
 static QEMUOptionParameter raw_create_options[] = {
     {
         .name = BLOCK_OPT_SIZE,
@@ -1223,6 +1282,7 @@ static BlockDriver bdrv_file = {
     .bdrv_create = raw_create,
     .bdrv_has_zero_init = bdrv_has_zero_init_1,
     .bdrv_co_get_block_status = raw_co_get_block_status,
+    .bdrv_co_write_zeroes = raw_co_write_zeroes,
 
     .bdrv_aio_readv = raw_aio_readv,
     .bdrv_aio_writev = raw_aio_writev,
@@ -1231,6 +1291,7 @@ static BlockDriver bdrv_file = {
 
     .bdrv_truncate = raw_truncate,
     .bdrv_getlength = raw_getlength,
+    .bdrv_get_info = raw_get_info,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
 
@@ -1586,6 +1647,7 @@ static BlockDriver bdrv_host_device = {
 
     .bdrv_truncate      = raw_truncate,
     .bdrv_getlength	= raw_getlength,
+    .bdrv_get_info = raw_get_info,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
 
diff --git a/trace-events b/trace-events
index a18ea59..d4c2b3e 100644
--- a/trace-events
+++ b/trace-events
@@ -128,6 +128,7 @@ thread_pool_cancel(void *req, void *opaque) "req %p opaque %p"
 
 # block/raw-win32.c
 # block/raw-posix.c
+paio_submit_co(int64_t sector_num, int nb_sectors, int type) "sector_num %"PRId64" nb_sectors %d type %d"
 paio_submit(void *acb, void *opaque, int64_t sector_num, int nb_sectors, int type) "acb %p opaque %p sector_num %"PRId64" nb_sectors %d type %d"
 
 # ioport.c
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 15/20] raw-posix: implement write_zeroes with MAY_UNMAP for block devices
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (13 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 14/20] raw-posix: implement write_zeroes with MAY_UNMAP for files Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 16/20] raw-posix: add support for write_zeroes on XFS and " Paolo Bonzini
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

See the next commit for the description of the Linux kernel problem
that is worked around in raw_open_common.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-posix.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index a7011d5..928ac70 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -335,6 +335,22 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     if (S_ISREG(st.st_mode)) {
         s->discard_zeroes = true;
     }
+    if (S_ISBLK(st.st_mode)) {
+#ifdef BLKDISCARDZEROES
+        unsigned int arg;
+        if (ioctl(s->fd, BLKDISCARDZEROES, &arg) == 0 && arg) {
+            s->discard_zeroes = true;
+        }
+#endif
+#ifdef __linux__
+        /* On Linux 3.10, BLKDISCARD leaves stale data in the page cache.  Do
+         * not rely on the contents of discarded blocks unless using O_DIRECT.
+         */
+        if (!(bs->open_flags & BDRV_O_NOCACHE)) {
+            s->discard_zeroes = false;
+        }
+#endif
+    }
 
 #ifdef CONFIG_XFS
     if (platform_test_xfs_fd(s->fd)) {
@@ -1587,6 +1603,26 @@ static coroutine_fn BlockDriverAIOCB *hdev_aio_discard(BlockDriverState *bs,
                        cb, opaque, QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
 }
 
+static coroutine_fn int hdev_co_write_zeroes(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors, BdrvRequestFlags flags)
+{
+    BDRVRawState *s = bs->opaque;
+    int rc;
+
+    rc = fd_open(bs);
+    if (rc < 0) {
+        return rc;
+    }
+    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+        return -ENOTSUP;
+    }
+    if (!s->discard_zeroes) {
+        return -ENOTSUP;
+    }
+    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                          QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
+}
+
 static int hdev_create(const char *filename, QEMUOptionParameter *options,
                        Error **errp)
 {
@@ -1639,6 +1675,7 @@ static BlockDriver bdrv_host_device = {
     .bdrv_reopen_abort   = raw_reopen_abort,
     .bdrv_create        = hdev_create,
     .create_options     = raw_create_options,
+    .bdrv_co_write_zeroes = hdev_co_write_zeroes,
 
     .bdrv_aio_readv	= raw_aio_readv,
     .bdrv_aio_writev	= raw_aio_writev,
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 16/20] raw-posix: add support for write_zeroes on XFS and block devices
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (14 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 15/20] raw-posix: implement write_zeroes with MAY_UNMAP for block devices Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 17/20] qemu-iotests: 033 is fast Paolo Bonzini
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

The code is similar to the implementation of discard and write_zeroes
with UNMAP.  However, failure must be propagated up to block.c.

The stale page cache problem can be reproduced as follows:

    # modprobe scsi-debug lbpws=1 lbprz=1
    # ./qemu-io /dev/sdXX
    qemu-io> write -P 0xcc 0 2M
    qemu-io> write -z 0 1M
    qemu-io> read -P 0x00 0 512
    Pattern verification failed at offset 0, 512 bytes
    qemu-io> read -v 0 512
    00000000:  cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
    ...

    # ./qemu-io --cache=none /dev/sdXX
    qemu-io> write -P 0xcc 0 2M
    qemu-io> write -z 0 1M
    qemu-io> read -P 0x00 0 512
    qemu-io> read -v 0 512
    00000000:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    ...

And similarly with discard instead of "write -z".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/raw-aio.h   |  3 +-
 block/raw-posix.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 74 insertions(+), 13 deletions(-)

diff --git a/block/raw-aio.h b/block/raw-aio.h
index c61f159..7ad0a8a 100644
--- a/block/raw-aio.h
+++ b/block/raw-aio.h
@@ -21,9 +21,10 @@
 #define QEMU_AIO_IOCTL        0x0004
 #define QEMU_AIO_FLUSH        0x0008
 #define QEMU_AIO_DISCARD      0x0010
+#define QEMU_AIO_WRITE_ZEROES 0x0020
 #define QEMU_AIO_TYPE_MASK \
         (QEMU_AIO_READ|QEMU_AIO_WRITE|QEMU_AIO_IOCTL|QEMU_AIO_FLUSH| \
-         QEMU_AIO_DISCARD)
+         QEMU_AIO_DISCARD|QEMU_AIO_WRITE_ZEROES)
 
 /* AIO flags */
 #define QEMU_AIO_MISALIGNED   0x1000
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 928ac70..1b43a4e 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -142,6 +142,7 @@ typedef struct BDRVRawState {
     bool is_xfs:1;
 #endif
     bool has_discard:1;
+    bool has_write_zeroes:1;
     bool discard_zeroes:1;
 } BDRVRawState;
 
@@ -327,6 +328,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 #endif
 
     s->has_discard = true;
+    s->has_write_zeroes = true;
 
     if (fstat(s->fd, &st) < 0) {
         error_setg_errno(errp, errno, "Could not stat file");
@@ -345,9 +347,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
 #ifdef __linux__
         /* On Linux 3.10, BLKDISCARD leaves stale data in the page cache.  Do
          * not rely on the contents of discarded blocks unless using O_DIRECT.
+         * Same for BLKZEROOUT.
          */
         if (!(bs->open_flags & BDRV_O_NOCACHE)) {
             s->discard_zeroes = false;
+            s->has_write_zeroes = false;
         }
 #endif
     }
@@ -703,6 +707,23 @@ static ssize_t handle_aiocb_rw(RawPosixAIOData *aiocb)
 }
 
 #ifdef CONFIG_XFS
+static int xfs_write_zeroes(BDRVRawState *s, int64_t offset, uint64_t bytes)
+{
+    struct xfs_flock64 fl;
+
+    memset(&fl, 0, sizeof(fl));
+    fl.l_whence = SEEK_SET;
+    fl.l_start = offset;
+    fl.l_len = bytes;
+
+    if (xfsctl(NULL, s->fd, XFS_IOC_ZERO_RANGE, &fl) < 0) {
+        DEBUG_BLOCK_PRINT("cannot write zero range (%s)\n", strerror(errno));
+        return -errno;
+    }
+
+    return 0;
+}
+
 static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 {
     struct xfs_flock64 fl;
@@ -721,6 +742,42 @@ static int xfs_discard(BDRVRawState *s, int64_t offset, uint64_t bytes)
 }
 #endif
 
+static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
+{
+    int ret = -EOPNOTSUPP;
+    BDRVRawState *s = aiocb->bs->opaque;
+
+    if (s->has_write_zeroes == 0) {
+        return -ENOTSUP;
+    }
+
+    if (aiocb->aio_type & QEMU_AIO_BLKDEV) {
+#ifdef BLKZEROOUT
+        do {
+            uint64_t range[2] = { aiocb->aio_offset, aiocb->aio_nbytes };
+            if (ioctl(aiocb->aio_fildes, BLKZEROOUT, range) == 0) {
+                return 0;
+            }
+        } while (errno == EINTR);
+
+        ret = -errno;
+#endif
+    } else {
+#ifdef CONFIG_XFS
+        if (s->is_xfs) {
+            return xfs_write_zeroes(s, aiocb->aio_offset, aiocb->aio_nbytes);
+        }
+#endif
+    }
+
+    if (ret == -ENODEV || ret == -ENOSYS || ret == -EOPNOTSUPP ||
+        ret == -ENOTTY) {
+        s->has_write_zeroes = false;
+        ret = -ENOTSUP;
+    }
+    return ret;
+}
+
 static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
 {
     int ret = -EOPNOTSUPP;
@@ -805,6 +862,9 @@ static int aio_worker(void *arg)
     case QEMU_AIO_DISCARD:
         ret = handle_aiocb_discard(aiocb);
         break;
+    case QEMU_AIO_WRITE_ZEROES:
+        ret = handle_aiocb_write_zeroes(aiocb);
+        break;
     default:
         fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type);
         ret = -EINVAL;
@@ -1257,13 +1317,13 @@ static int coroutine_fn raw_co_write_zeroes(
     BDRVRawState *s = bs->opaque;
 
     if (!(flags & BDRV_REQ_MAY_UNMAP)) {
-        return -ENOTSUP;
-    }
-    if (!s->discard_zeroes) {
-        return -ENOTSUP;
+        return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                              QEMU_AIO_WRITE_ZEROES);
+    } else if (s->discard_zeroes) {
+        return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                              QEMU_AIO_DISCARD);
     }
-    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
-                          QEMU_AIO_DISCARD);
+    return -ENOTSUP;
 }
 
 static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
@@ -1614,13 +1674,13 @@ static coroutine_fn int hdev_co_write_zeroes(BlockDriverState *bs,
         return rc;
     }
     if (!(flags & BDRV_REQ_MAY_UNMAP)) {
-        return -ENOTSUP;
-    }
-    if (!s->discard_zeroes) {
-        return -ENOTSUP;
+        return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                              QEMU_AIO_WRITE_ZEROES|QEMU_AIO_BLKDEV);
+    } else if (s->discard_zeroes) {
+        return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
+                              QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
     }
-    return paio_submit_co(bs, s->fd, sector_num, NULL, nb_sectors,
-                          QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
+    return -ENOTSUP;
 }
 
 static int hdev_create(const char *filename, QEMUOptionParameter *options,
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 17/20] qemu-iotests: 033 is fast
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (15 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 16/20] raw-posix: add support for write_zeroes on XFS and " Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 18/20] scsi-disk: catch write protection errors in UNMAP Paolo Bonzini
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 tests/qemu-iotests/group | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index b18b241..4f86046 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -39,7 +39,7 @@
 030 rw auto backing
 031 rw auto quick
 032 rw auto
-033 rw auto
+033 rw auto quick
 034 rw auto backing
 035 rw auto quick
 036 rw auto quick
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 18/20] scsi-disk: catch write protection errors in UNMAP
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (16 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 17/20] qemu-iotests: 033 is fast Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 19/20] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

This is the same that is already done for WRITE SAME.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/scsi-disk.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 74e6a14..4138268 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1543,6 +1543,7 @@ done:
 
 static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
 {
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
     uint8_t *p = inbuf;
     int len = r->req.cmd.xfer;
     UnmapCBData *data;
@@ -1560,6 +1561,11 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
         goto invalid_param_len;
     }
 
+    if (bdrv_is_read_only(s->qdev.conf.bs)) {
+        scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
+        return;
+    }
+
     data = g_new0(UnmapCBData, 1);
     data->r = r;
     data->inbuf = &p[8];
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 19/20] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (17 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 18/20] scsi-disk: catch write protection errors in UNMAP Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
  19 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Since we report ANC_SUP==0 in VPD page B2h, we need to return
an error (ILLEGAL REQUEST/INVALID FIELD IN CDB) for all WRITE SAME
requests with ANCHOR==1.

Inspired by a similar patch to the LIO in-kernel target.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/scsi-disk.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 4138268..0640bb0 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1548,6 +1548,11 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
     int len = r->req.cmd.xfer;
     UnmapCBData *data;
 
+    /* Reject ANCHOR=1.  */
+    if (r->req.cmd.buf[1] & 0x1) {
+        goto invalid_field;
+    }
+
     if (len < 8) {
         goto invalid_param_len;
     }
@@ -1578,6 +1583,10 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
 
 invalid_param_len:
     scsi_check_condition(r, SENSE_CODE(INVALID_PARAM_LEN));
+    return;
+
+invalid_field:
+    scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 
 static void scsi_disk_emulate_write_data(SCSIRequest *req)
@@ -1856,8 +1865,9 @@ static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
 
         /*
          * We only support WRITE SAME with the unmap bit set for now.
+         * Reject UNMAP=0 or ANCHOR=1.
          */
-        if (!(req->cmd.buf[1] & 0x8)) {
+        if (!(req->cmd.buf[1] & 0x8) || (req->cmd.buf[1] & 0x10)) {
             goto illegal_request;
         }
 
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME
  2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
                   ` (18 preceding siblings ...)
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 19/20] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands Paolo Bonzini
@ 2013-11-19 17:07 ` Paolo Bonzini
  2013-11-19 17:23   ` ronnie sahlberg
  2013-11-20 14:18   ` Stefan Hajnoczi
  19 siblings, 2 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pl, stefanha

Fetch the data to be written from the input buffer.  If it is all zeroes,
we can use the write_zeroes call (possibly with the new MAY_UNMAP flag).
Otherwise, do as many write cycles as needed, writing 512k at a time.

Strictly speaking, this is still incorrect because a zero cluster should
only be written if the MAY_UNMAP flag is set.  But this is a bug in qcow2
and the other formats, not in the SCSI code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/scsi-disk.c | 140 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 116 insertions(+), 24 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 0640bb0..4cc6a28 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -41,6 +41,7 @@ do { printf("scsi-disk: " fmt , ## __VA_ARGS__); } while (0)
 #include <scsi/sg.h>
 #endif
 
+#define SCSI_WRITE_SAME_MAX         524288
 #define SCSI_DMA_BUF_SIZE           131072
 #define SCSI_MAX_INQUIRY_LEN        256
 #define SCSI_MAX_MODE_LEN           256
@@ -634,6 +635,8 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf)
             buflen = 0x40;
             memset(outbuf + 4, 0, buflen - 4);
 
+            outbuf[4] = 0x1; /* wsnz */
+
             /* optimal transfer length granularity */
             outbuf[6] = (min_io_size >> 8) & 0xff;
             outbuf[7] = min_io_size & 0xff;
@@ -1589,6 +1592,111 @@ invalid_field:
     scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 
+typedef struct WriteSameCBData {
+    SCSIDiskReq *r;
+    int64_t sector;
+    int nb_sectors;
+    QEMUIOVector qiov;
+    struct iovec iov;
+} WriteSameCBData;
+
+static void scsi_write_same_complete(void *opaque, int ret)
+{
+    WriteSameCBData *data = opaque;
+    SCSIDiskReq *r = data->r;
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
+
+    assert(r->req.aiocb != NULL);
+    r->req.aiocb = NULL;
+    bdrv_acct_done(s->qdev.conf.bs, &r->acct);
+    if (r->req.io_canceled) {
+        goto done;
+    }
+
+    if (ret < 0) {
+        if (scsi_handle_rw_error(r, -ret)) {
+            goto done;
+        }
+    }
+
+    data->nb_sectors -= data->iov.iov_len / 512;
+    data->sector += data->iov.iov_len / 512;
+    data->iov.iov_len = MIN(data->nb_sectors * 512, data->iov.iov_len);
+    if (data->iov.iov_len) {
+        bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
+        r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
+                                       &data->qiov, data->iov.iov_len / 512,
+                                       scsi_write_same_complete, r);
+        return;
+    }
+
+    scsi_req_complete(&r->req, GOOD);
+
+done:
+    if (!r->req.io_canceled) {
+        scsi_req_unref(&r->req);
+    }
+    g_free(data->iov.iov_base);
+    g_free(data);
+}
+
+static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
+{
+    SCSIRequest *req = &r->req;
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
+    uint32_t nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
+    WriteSameCBData *data;
+    uint8_t *buf;
+    int i;
+
+    /* Fail if PBDATA=1 or LBDATA=1 or ANCHOR=1.  */
+    if (nb_sectors == 0 || (req->cmd.buf[1] & 0x16)) {
+        scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
+        return;
+    }
+
+    if (bdrv_is_read_only(s->qdev.conf.bs)) {
+        scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
+        return;
+    }
+    if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
+        scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
+        return;
+    }
+
+    if (buffer_is_zero(inbuf, s->qdev.blocksize)) {
+        int flags = (req->cmd.buf[1] & 0x8) ? BDRV_REQ_MAY_UNMAP : 0;
+
+        /* The request is used as the AIO opaque value, so add a ref.  */
+        scsi_req_ref(&r->req);
+        bdrv_acct_start(s->qdev.conf.bs, &r->acct, nb_sectors * s->qdev.blocksize,
+                        BDRV_ACCT_WRITE);
+        r->req.aiocb = bdrv_aio_write_zeroes(s->qdev.conf.bs,
+                                             r->req.cmd.lba * (s->qdev.blocksize / 512),
+                                             nb_sectors * (s->qdev.blocksize / 512),
+                                             flags, scsi_aio_complete, r);
+        return;
+    }
+
+    data = g_new0(WriteSameCBData, 1);
+    data->r = r;
+    data->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
+    data->nb_sectors = nb_sectors * (s->qdev.blocksize / 512);
+    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
+    data->iov.iov_base = buf = g_malloc(data->iov.iov_len);
+    qemu_iovec_init_external(&data->qiov, &data->iov, 1);
+
+    for (i = 0; i < data->iov.iov_len; i += s->qdev.blocksize) {
+        memcpy(&buf[i], inbuf, s->qdev.blocksize);
+    }
+
+    scsi_req_ref(&r->req);
+    bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
+    r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
+                                   &data->qiov, data->iov.iov_len / 512,
+                                   scsi_write_same_complete, data);
+}
+
 static void scsi_disk_emulate_write_data(SCSIRequest *req)
 {
     SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
@@ -1612,6 +1720,10 @@ static void scsi_disk_emulate_write_data(SCSIRequest *req)
         scsi_disk_emulate_unmap(r, r->iov.iov_base);
         break;
 
+    case WRITE_SAME_10:
+    case WRITE_SAME_16:
+        scsi_disk_emulate_write_same(r, r->iov.iov_base);
+        break;
     default:
         abort();
     }
@@ -1854,30 +1966,10 @@ static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
         break;
     case WRITE_SAME_10:
     case WRITE_SAME_16:
-        nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
-        if (bdrv_is_read_only(s->qdev.conf.bs)) {
-            scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
-            return 0;
-        }
-        if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
-            goto illegal_lba;
-        }
-
-        /*
-         * We only support WRITE SAME with the unmap bit set for now.
-         * Reject UNMAP=0 or ANCHOR=1.
-         */
-        if (!(req->cmd.buf[1] & 0x8) || (req->cmd.buf[1] & 0x10)) {
-            goto illegal_request;
-        }
-
-        /* The request is used as the AIO opaque value, so add a ref.  */
-        scsi_req_ref(&r->req);
-        r->req.aiocb = bdrv_aio_discard(s->qdev.conf.bs,
-                                        r->req.cmd.lba * (s->qdev.blocksize / 512),
-                                        nb_sectors * (s->qdev.blocksize / 512),
-                                        scsi_aio_complete, r);
-        return 0;
+        DPRINTF("WRITE SAME %d (len %lu)\n",
+                req->cmd.buf[0] == WRITE_SAME_10 ? 10 : 16,
+                (long)r->req.cmd.xfer);
+        break;
     default:
         DPRINTF("Unknown SCSI command (%2.2x)\n", buf[0]);
         scsi_check_condition(r, SENSE_CODE(INVALID_OPCODE));
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
@ 2013-11-19 17:23   ` ronnie sahlberg
  2013-11-19 17:27     ` ronnie sahlberg
  2013-11-19 17:31     ` Paolo Bonzini
  2013-11-20 14:18   ` Stefan Hajnoczi
  1 sibling, 2 replies; 42+ messages in thread
From: ronnie sahlberg @ 2013-11-19 17:23 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

+#define SCSI_WRITE_SAME_MAX         524288
...
+    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);

I don't  think you should just clamp the data to 512k, instead I think
you should report the 512k max write same size through
BlockLimitsVPD/MaximumWriteSameLength to the initiator.
Then instead of clamping the write to MIN()  you return a check
condition if the initiator sends too much.

On Tue, Nov 19, 2013 at 9:07 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Fetch the data to be written from the input buffer.  If it is all zeroes,
> we can use the write_zeroes call (possibly with the new MAY_UNMAP flag).
> Otherwise, do as many write cycles as needed, writing 512k at a time.
>
> Strictly speaking, this is still incorrect because a zero cluster should
> only be written if the MAY_UNMAP flag is set.  But this is a bug in qcow2
> and the other formats, not in the SCSI code.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  hw/scsi/scsi-disk.c | 140 +++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 116 insertions(+), 24 deletions(-)
>
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index 0640bb0..4cc6a28 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -41,6 +41,7 @@ do { printf("scsi-disk: " fmt , ## __VA_ARGS__); } while (0)
>  #include <scsi/sg.h>
>  #endif
>
> +#define SCSI_WRITE_SAME_MAX         524288
>  #define SCSI_DMA_BUF_SIZE           131072
>  #define SCSI_MAX_INQUIRY_LEN        256
>  #define SCSI_MAX_MODE_LEN           256
> @@ -634,6 +635,8 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf)
>              buflen = 0x40;
>              memset(outbuf + 4, 0, buflen - 4);
>
> +            outbuf[4] = 0x1; /* wsnz */
> +
>              /* optimal transfer length granularity */
>              outbuf[6] = (min_io_size >> 8) & 0xff;
>              outbuf[7] = min_io_size & 0xff;
> @@ -1589,6 +1592,111 @@ invalid_field:
>      scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
>  }
>
> +typedef struct WriteSameCBData {
> +    SCSIDiskReq *r;
> +    int64_t sector;
> +    int nb_sectors;
> +    QEMUIOVector qiov;
> +    struct iovec iov;
> +} WriteSameCBData;
> +
> +static void scsi_write_same_complete(void *opaque, int ret)
> +{
> +    WriteSameCBData *data = opaque;
> +    SCSIDiskReq *r = data->r;
> +    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
> +
> +    assert(r->req.aiocb != NULL);
> +    r->req.aiocb = NULL;
> +    bdrv_acct_done(s->qdev.conf.bs, &r->acct);
> +    if (r->req.io_canceled) {
> +        goto done;
> +    }
> +
> +    if (ret < 0) {
> +        if (scsi_handle_rw_error(r, -ret)) {
> +            goto done;
> +        }
> +    }
> +
> +    data->nb_sectors -= data->iov.iov_len / 512;
> +    data->sector += data->iov.iov_len / 512;
> +    data->iov.iov_len = MIN(data->nb_sectors * 512, data->iov.iov_len);
> +    if (data->iov.iov_len) {
> +        bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
> +        r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
> +                                       &data->qiov, data->iov.iov_len / 512,
> +                                       scsi_write_same_complete, r);
> +        return;
> +    }
> +
> +    scsi_req_complete(&r->req, GOOD);
> +
> +done:
> +    if (!r->req.io_canceled) {
> +        scsi_req_unref(&r->req);
> +    }
> +    g_free(data->iov.iov_base);
> +    g_free(data);
> +}
> +
> +static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
> +{
> +    SCSIRequest *req = &r->req;
> +    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +    uint32_t nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
> +    WriteSameCBData *data;
> +    uint8_t *buf;
> +    int i;
> +
> +    /* Fail if PBDATA=1 or LBDATA=1 or ANCHOR=1.  */
> +    if (nb_sectors == 0 || (req->cmd.buf[1] & 0x16)) {
> +        scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
> +        return;
> +    }
> +
> +    if (bdrv_is_read_only(s->qdev.conf.bs)) {
> +        scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
> +        return;
> +    }
> +    if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
> +        scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
> +        return;
> +    }
> +
> +    if (buffer_is_zero(inbuf, s->qdev.blocksize)) {
> +        int flags = (req->cmd.buf[1] & 0x8) ? BDRV_REQ_MAY_UNMAP : 0;
> +
> +        /* The request is used as the AIO opaque value, so add a ref.  */
> +        scsi_req_ref(&r->req);
> +        bdrv_acct_start(s->qdev.conf.bs, &r->acct, nb_sectors * s->qdev.blocksize,
> +                        BDRV_ACCT_WRITE);
> +        r->req.aiocb = bdrv_aio_write_zeroes(s->qdev.conf.bs,
> +                                             r->req.cmd.lba * (s->qdev.blocksize / 512),
> +                                             nb_sectors * (s->qdev.blocksize / 512),
> +                                             flags, scsi_aio_complete, r);
> +        return;
> +    }
> +
> +    data = g_new0(WriteSameCBData, 1);
> +    data->r = r;
> +    data->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
> +    data->nb_sectors = nb_sectors * (s->qdev.blocksize / 512);
> +    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
> +    data->iov.iov_base = buf = g_malloc(data->iov.iov_len);
> +    qemu_iovec_init_external(&data->qiov, &data->iov, 1);
> +
> +    for (i = 0; i < data->iov.iov_len; i += s->qdev.blocksize) {
> +        memcpy(&buf[i], inbuf, s->qdev.blocksize);
> +    }
> +
> +    scsi_req_ref(&r->req);
> +    bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
> +    r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
> +                                   &data->qiov, data->iov.iov_len / 512,
> +                                   scsi_write_same_complete, data);
> +}
> +
>  static void scsi_disk_emulate_write_data(SCSIRequest *req)
>  {
>      SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
> @@ -1612,6 +1720,10 @@ static void scsi_disk_emulate_write_data(SCSIRequest *req)
>          scsi_disk_emulate_unmap(r, r->iov.iov_base);
>          break;
>
> +    case WRITE_SAME_10:
> +    case WRITE_SAME_16:
> +        scsi_disk_emulate_write_same(r, r->iov.iov_base);
> +        break;
>      default:
>          abort();
>      }
> @@ -1854,30 +1966,10 @@ static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
>          break;
>      case WRITE_SAME_10:
>      case WRITE_SAME_16:
> -        nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
> -        if (bdrv_is_read_only(s->qdev.conf.bs)) {
> -            scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
> -            return 0;
> -        }
> -        if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
> -            goto illegal_lba;
> -        }
> -
> -        /*
> -         * We only support WRITE SAME with the unmap bit set for now.
> -         * Reject UNMAP=0 or ANCHOR=1.
> -         */
> -        if (!(req->cmd.buf[1] & 0x8) || (req->cmd.buf[1] & 0x10)) {
> -            goto illegal_request;
> -        }
> -
> -        /* The request is used as the AIO opaque value, so add a ref.  */
> -        scsi_req_ref(&r->req);
> -        r->req.aiocb = bdrv_aio_discard(s->qdev.conf.bs,
> -                                        r->req.cmd.lba * (s->qdev.blocksize / 512),
> -                                        nb_sectors * (s->qdev.blocksize / 512),
> -                                        scsi_aio_complete, r);
> -        return 0;
> +        DPRINTF("WRITE SAME %d (len %lu)\n",
> +                req->cmd.buf[0] == WRITE_SAME_10 ? 10 : 16,
> +                (long)r->req.cmd.xfer);
> +        break;
>      default:
>          DPRINTF("Unknown SCSI command (%2.2x)\n", buf[0]);
>          scsi_check_condition(r, SENSE_CODE(INVALID_OPCODE));
> --
> 1.8.4.2
>
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME
  2013-11-19 17:23   ` ronnie sahlberg
@ 2013-11-19 17:27     ` ronnie sahlberg
  2013-11-19 17:31     ` Paolo Bonzini
  1 sibling, 0 replies; 42+ messages in thread
From: ronnie sahlberg @ 2013-11-19 17:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

That means the initiator will do the "split into smaller manageable
chunks" for you and you get a 1-to-1 mapping between WS10/16 that the
initiator issues to qemu and the write-same calls that qemu generates.

On Tue, Nov 19, 2013 at 9:23 AM, ronnie sahlberg
<ronniesahlberg@gmail.com> wrote:
> +#define SCSI_WRITE_SAME_MAX         524288
> ...
> +    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
>
> I don't  think you should just clamp the data to 512k, instead I think
> you should report the 512k max write same size through
> BlockLimitsVPD/MaximumWriteSameLength to the initiator.
> Then instead of clamping the write to MIN()  you return a check
> condition if the initiator sends too much.
>
> On Tue, Nov 19, 2013 at 9:07 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Fetch the data to be written from the input buffer.  If it is all zeroes,
>> we can use the write_zeroes call (possibly with the new MAY_UNMAP flag).
>> Otherwise, do as many write cycles as needed, writing 512k at a time.
>>
>> Strictly speaking, this is still incorrect because a zero cluster should
>> only be written if the MAY_UNMAP flag is set.  But this is a bug in qcow2
>> and the other formats, not in the SCSI code.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  hw/scsi/scsi-disk.c | 140 +++++++++++++++++++++++++++++++++++++++++++---------
>>  1 file changed, 116 insertions(+), 24 deletions(-)
>>
>> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
>> index 0640bb0..4cc6a28 100644
>> --- a/hw/scsi/scsi-disk.c
>> +++ b/hw/scsi/scsi-disk.c
>> @@ -41,6 +41,7 @@ do { printf("scsi-disk: " fmt , ## __VA_ARGS__); } while (0)
>>  #include <scsi/sg.h>
>>  #endif
>>
>> +#define SCSI_WRITE_SAME_MAX         524288
>>  #define SCSI_DMA_BUF_SIZE           131072
>>  #define SCSI_MAX_INQUIRY_LEN        256
>>  #define SCSI_MAX_MODE_LEN           256
>> @@ -634,6 +635,8 @@ static int scsi_disk_emulate_inquiry(SCSIRequest *req, uint8_t *outbuf)
>>              buflen = 0x40;
>>              memset(outbuf + 4, 0, buflen - 4);
>>
>> +            outbuf[4] = 0x1; /* wsnz */
>> +
>>              /* optimal transfer length granularity */
>>              outbuf[6] = (min_io_size >> 8) & 0xff;
>>              outbuf[7] = min_io_size & 0xff;
>> @@ -1589,6 +1592,111 @@ invalid_field:
>>      scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
>>  }
>>
>> +typedef struct WriteSameCBData {
>> +    SCSIDiskReq *r;
>> +    int64_t sector;
>> +    int nb_sectors;
>> +    QEMUIOVector qiov;
>> +    struct iovec iov;
>> +} WriteSameCBData;
>> +
>> +static void scsi_write_same_complete(void *opaque, int ret)
>> +{
>> +    WriteSameCBData *data = opaque;
>> +    SCSIDiskReq *r = data->r;
>> +    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
>> +
>> +    assert(r->req.aiocb != NULL);
>> +    r->req.aiocb = NULL;
>> +    bdrv_acct_done(s->qdev.conf.bs, &r->acct);
>> +    if (r->req.io_canceled) {
>> +        goto done;
>> +    }
>> +
>> +    if (ret < 0) {
>> +        if (scsi_handle_rw_error(r, -ret)) {
>> +            goto done;
>> +        }
>> +    }
>> +
>> +    data->nb_sectors -= data->iov.iov_len / 512;
>> +    data->sector += data->iov.iov_len / 512;
>> +    data->iov.iov_len = MIN(data->nb_sectors * 512, data->iov.iov_len);
>> +    if (data->iov.iov_len) {
>> +        bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
>> +        r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
>> +                                       &data->qiov, data->iov.iov_len / 512,
>> +                                       scsi_write_same_complete, r);
>> +        return;
>> +    }
>> +
>> +    scsi_req_complete(&r->req, GOOD);
>> +
>> +done:
>> +    if (!r->req.io_canceled) {
>> +        scsi_req_unref(&r->req);
>> +    }
>> +    g_free(data->iov.iov_base);
>> +    g_free(data);
>> +}
>> +
>> +static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
>> +{
>> +    SCSIRequest *req = &r->req;
>> +    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
>> +    uint32_t nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
>> +    WriteSameCBData *data;
>> +    uint8_t *buf;
>> +    int i;
>> +
>> +    /* Fail if PBDATA=1 or LBDATA=1 or ANCHOR=1.  */
>> +    if (nb_sectors == 0 || (req->cmd.buf[1] & 0x16)) {
>> +        scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
>> +        return;
>> +    }
>> +
>> +    if (bdrv_is_read_only(s->qdev.conf.bs)) {
>> +        scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
>> +        return;
>> +    }
>> +    if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
>> +        scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
>> +        return;
>> +    }
>> +
>> +    if (buffer_is_zero(inbuf, s->qdev.blocksize)) {
>> +        int flags = (req->cmd.buf[1] & 0x8) ? BDRV_REQ_MAY_UNMAP : 0;
>> +
>> +        /* The request is used as the AIO opaque value, so add a ref.  */
>> +        scsi_req_ref(&r->req);
>> +        bdrv_acct_start(s->qdev.conf.bs, &r->acct, nb_sectors * s->qdev.blocksize,
>> +                        BDRV_ACCT_WRITE);
>> +        r->req.aiocb = bdrv_aio_write_zeroes(s->qdev.conf.bs,
>> +                                             r->req.cmd.lba * (s->qdev.blocksize / 512),
>> +                                             nb_sectors * (s->qdev.blocksize / 512),
>> +                                             flags, scsi_aio_complete, r);
>> +        return;
>> +    }
>> +
>> +    data = g_new0(WriteSameCBData, 1);
>> +    data->r = r;
>> +    data->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
>> +    data->nb_sectors = nb_sectors * (s->qdev.blocksize / 512);
>> +    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
>> +    data->iov.iov_base = buf = g_malloc(data->iov.iov_len);
>> +    qemu_iovec_init_external(&data->qiov, &data->iov, 1);
>> +
>> +    for (i = 0; i < data->iov.iov_len; i += s->qdev.blocksize) {
>> +        memcpy(&buf[i], inbuf, s->qdev.blocksize);
>> +    }
>> +
>> +    scsi_req_ref(&r->req);
>> +    bdrv_acct_start(s->qdev.conf.bs, &r->acct, data->iov.iov_len, BDRV_ACCT_WRITE);
>> +    r->req.aiocb = bdrv_aio_writev(s->qdev.conf.bs, data->sector,
>> +                                   &data->qiov, data->iov.iov_len / 512,
>> +                                   scsi_write_same_complete, data);
>> +}
>> +
>>  static void scsi_disk_emulate_write_data(SCSIRequest *req)
>>  {
>>      SCSIDiskReq *r = DO_UPCAST(SCSIDiskReq, req, req);
>> @@ -1612,6 +1720,10 @@ static void scsi_disk_emulate_write_data(SCSIRequest *req)
>>          scsi_disk_emulate_unmap(r, r->iov.iov_base);
>>          break;
>>
>> +    case WRITE_SAME_10:
>> +    case WRITE_SAME_16:
>> +        scsi_disk_emulate_write_same(r, r->iov.iov_base);
>> +        break;
>>      default:
>>          abort();
>>      }
>> @@ -1854,30 +1966,10 @@ static int32_t scsi_disk_emulate_command(SCSIRequest *req, uint8_t *buf)
>>          break;
>>      case WRITE_SAME_10:
>>      case WRITE_SAME_16:
>> -        nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
>> -        if (bdrv_is_read_only(s->qdev.conf.bs)) {
>> -            scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
>> -            return 0;
>> -        }
>> -        if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
>> -            goto illegal_lba;
>> -        }
>> -
>> -        /*
>> -         * We only support WRITE SAME with the unmap bit set for now.
>> -         * Reject UNMAP=0 or ANCHOR=1.
>> -         */
>> -        if (!(req->cmd.buf[1] & 0x8) || (req->cmd.buf[1] & 0x10)) {
>> -            goto illegal_request;
>> -        }
>> -
>> -        /* The request is used as the AIO opaque value, so add a ref.  */
>> -        scsi_req_ref(&r->req);
>> -        r->req.aiocb = bdrv_aio_discard(s->qdev.conf.bs,
>> -                                        r->req.cmd.lba * (s->qdev.blocksize / 512),
>> -                                        nb_sectors * (s->qdev.blocksize / 512),
>> -                                        scsi_aio_complete, r);
>> -        return 0;
>> +        DPRINTF("WRITE SAME %d (len %lu)\n",
>> +                req->cmd.buf[0] == WRITE_SAME_10 ? 10 : 16,
>> +                (long)r->req.cmd.xfer);
>> +        break;
>>      default:
>>          DPRINTF("Unknown SCSI command (%2.2x)\n", buf[0]);
>>          scsi_check_condition(r, SENSE_CODE(INVALID_OPCODE));
>> --
>> 1.8.4.2
>>
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME
  2013-11-19 17:23   ` ronnie sahlberg
  2013-11-19 17:27     ` ronnie sahlberg
@ 2013-11-19 17:31     ` Paolo Bonzini
  1 sibling, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-19 17:31 UTC (permalink / raw)
  To: ronnie sahlberg; +Cc: Kevin Wolf, Peter Lieven, qemu-devel, Stefan Hajnoczi

Il 19/11/2013 18:23, ronnie sahlberg ha scritto:
> +#define SCSI_WRITE_SAME_MAX         524288
> ...
> +    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
> 
> I don't  think you should just clamp the data to 512k, instead I think
> you should report the 512k max write same size through
> BlockLimitsVPD/MaximumWriteSameLength to the initiator.
> Then instead of clamping the write to MIN()  you return a check
> condition if the initiator sends too much.

This would not work if the BlockDriverState has a large
write_same_alignment (e.g. a qcow2 file with 1 MB cluster size).

The only purpose of the above clamping is to avoid using too much memory
when emulating WRITE SAME; there is no intrinsic limit in the length of
a WRITE SAME command that QEMU supports.  There is no particular need to
have 1:1 mapping from guest to host, for two reasons:

(1) even though this is not a fast path, every additional round trip hurts

(2) the code in this patch ensures that the 512 kb writes are
serialized.  If the guest works around the maximum WRITE SAME length by
sending many parallel requests, there will be no speed improvement, only
a lot more stress on the storage.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/20] block: add flags argument to bdrv_co_write_zeroes tracepoint
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 03/20] block: add flags argument to bdrv_co_write_zeroes tracepoint Paolo Bonzini
@ 2013-11-20  9:59   ` Stefan Hajnoczi
  0 siblings, 0 replies; 42+ messages in thread
From: Stefan Hajnoczi @ 2013-11-20  9:59 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, pl, qemu-devel, stefanha

On Tue, Nov 19, 2013 at 06:07:26PM +0100, Paolo Bonzini wrote:
> diff --git a/trace-events b/trace-events
> index 8695e9e..6175b61 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -64,7 +64,7 @@ bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
>  bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>  bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
>  bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
> -bdrv_co_write_zeroes(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
> +bdrv_co_write_zeroes(void *bs, int64_t sector_num, int nb_sector, int flags) "bs %p sector_num %"PRId64" nb_sectors %d flags %d"

Please use %#x instead of %d so bit values are easy to read.

Stefan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 04/20] block: add bdrv_aio_write_zeroes
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 04/20] block: add bdrv_aio_write_zeroes Paolo Bonzini
@ 2013-11-20 10:02   ` Stefan Hajnoczi
  0 siblings, 0 replies; 42+ messages in thread
From: Stefan Hajnoczi @ 2013-11-20 10:02 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, pl, qemu-devel, stefanha

On Tue, Nov 19, 2013 at 06:07:27PM +0100, Paolo Bonzini wrote:
> diff --git a/trace-events b/trace-events
> index 6175b61..a18ea59 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -60,6 +60,7 @@ bdrv_aio_discard(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs
>  bdrv_aio_flush(void *bs, void *opaque) "bs %p opaque %p"
>  bdrv_aio_readv(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d opaque %p"
>  bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d opaque %p"
> +bdrv_aio_write_zeroes(void *bs, int64_t sector_num, int nb_sectors, int flags, void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d flags %d opaque %p"

Please use %#x for flags

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests Paolo Bonzini
@ 2013-11-20 10:22   ` Stefan Hajnoczi
  2013-11-20 11:01     ` Paolo Bonzini
  2013-11-21 11:30   ` Peter Lieven
  1 sibling, 1 reply; 42+ messages in thread
From: Stefan Hajnoczi @ 2013-11-20 10:22 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, pl, qemu-devel, stefanha

On Tue, Nov 19, 2013 at 06:07:29PM +0100, Paolo Bonzini wrote:
> @@ -2761,14 +2761,19 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>      while (nb_sectors > 0 && !ret) {
>          int num = nb_sectors;
>  
> -        /* align request */
> -        if (bs->bl.write_zeroes_alignment &&
> -            num >= bs->bl.write_zeroes_alignment &&
> -            sector_num % bs->bl.write_zeroes_alignment) {
> -            if (num > bs->bl.write_zeroes_alignment) {
> +        /* Align request.  Block drivers can expect the "bulk" of the request
> +         * to be aligned.
> +         */
> +        if (bs->bl.write_zeroes_alignment
> +            && num > bs->bl.write_zeroes_alignment) {

Here '>' is used...

> +            if (sector_num % bs->bl.write_zeroes_alignment != 0) {
> +                /* Make a small request up to the first aligned sector.  */
>                  num = bs->bl.write_zeroes_alignment;
> +                num -= sector_num % bs->bl.write_zeroes_alignment;
> +            } else if (num >= bs->bl.write_zeroes_alignment) {

...but here '>=' is used.

The == case here cannot happen.  Did you mean '>=' in both places?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests
  2013-11-20 10:22   ` Stefan Hajnoczi
@ 2013-11-20 11:01     ` Paolo Bonzini
  2013-11-20 14:29       ` Stefan Hajnoczi
  0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-20 11:01 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, pl, qemu-devel, stefanha

Il 20/11/2013 11:22, Stefan Hajnoczi ha scritto:
>> > +            && num > bs->bl.write_zeroes_alignment) {
> Here '>' is used...
> 
>> > +            if (sector_num % bs->bl.write_zeroes_alignment != 0) {
>> > +                /* Make a small request up to the first aligned sector.  */
>> >                  num = bs->bl.write_zeroes_alignment;
>> > +                num -= sector_num % bs->bl.write_zeroes_alignment;
>> > +            } else if (num >= bs->bl.write_zeroes_alignment) {
> ...but here '>=' is used.
> 
> The == case here cannot happen.  Did you mean '>=' in both places?

I meant what I wrote, in the sense that the two "if"s make sense the
way I wrote them.  However, it is not too clear indeed.

>         if (bs->bl.write_zeroes_alignment
>             && num > bs->bl.write_zeroes_alignment) {

Here, '>' is a necessary (though not sufficient) for being able to
split the operation in one misaligned request and one aligned request.

If '<', the request is misaligned in either the starting sector
or the ending sector (or both), and there's no need to split it.

If '==', either the request is aligned or we can only split it
in two parts but they remain misaligned.  In either case there's
no need to do anyting.

Because the condition is not sufficient, we may end up splitting a
request that cannot be aligned anyway (e.g. sector_num = 1, num = 129,
alignment = 128; will be split into [1,128) and [128,130) which are
both misaligned.  This is not important.

>             if (sector_num % bs->bl.write_zeroes_alignment != 0) {
>                 /* Make a small request up to the first aligned sector.  */
>                 num = bs->bl.write_zeroes_alignment;
>                 num -= sector_num % bs->bl.write_zeroes_alignment;
>             } else if (num >= bs->bl.write_zeroes_alignment) {
>                 /* Shorten the request to the last aligned sector.  */
>                 num -= (sector_num + num) % bs->bl.write_zeroes_alignment;

Here, the "if" checks that we can have no underflow in the subtraction.
However, in addition to what you pointed out, it's not immediately obvious
that the subtraction has no effect if sector_num and num are correctly aligned.

So I will rewrite the "if" this way:

            if (sector_num % bs->bl.write_zeroes_alignment != 0) {
                /* Make a small request up to the first aligned sector.  */
                num = bs->bl.write_zeroes_alignment;
                num -= sector_num % bs->bl.write_zeroes_alignment;
            } else if ((sector_num + num) % bs->bl.write_zeroes_alignment != 0) {
                /* Shorten the request to the last aligned sector.  num cannot
                 * underflow because num > bs->bl.write_zeroes_alignment.
                 */
                num -= (sector_num + num) % bs->bl.write_zeroes_alignment;
            }

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 07/20] vpc, vhdx: add get_info
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 07/20] vpc, vhdx: add get_info Paolo Bonzini
@ 2013-11-20 12:39   ` Stefan Hajnoczi
  2013-11-20 12:50     ` Paolo Bonzini
  0 siblings, 1 reply; 42+ messages in thread
From: Stefan Hajnoczi @ 2013-11-20 12:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, pl, qemu-devel, stefanha

On Tue, Nov 19, 2013 at 06:07:30PM +0100, Paolo Bonzini wrote:
> +static int vhdx_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
> +{
> +    BDRVVHDXState *s = bs->opaque;
> +
> +    bdi->cluster_size =
> +        (s->logical_sector_size / BDRV_SECTOR_SIZE) * s->block_size;

I thought s->block_size is in bytes.  Why multiply by the logical block
size in 512-byte sectors?

Stefan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 07/20] vpc, vhdx: add get_info
  2013-11-20 12:39   ` Stefan Hajnoczi
@ 2013-11-20 12:50     ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-20 12:50 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, pl, qemu-devel, stefanha

Il 20/11/2013 13:39, Stefan Hajnoczi ha scritto:
>> > +
>> > +    bdi->cluster_size =
>> > +        (s->logical_sector_size / BDRV_SECTOR_SIZE) * s->block_size;
> I thought s->block_size is in bytes.  Why multiply by the logical block
> size in 512-byte sectors?

My mistake.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
  2013-11-19 17:23   ` ronnie sahlberg
@ 2013-11-20 14:18   ` Stefan Hajnoczi
  2013-11-20 14:19     ` Paolo Bonzini
  1 sibling, 1 reply; 42+ messages in thread
From: Stefan Hajnoczi @ 2013-11-20 14:18 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, pl, qemu-devel, stefanha

On Tue, Nov 19, 2013 at 06:07:43PM +0100, Paolo Bonzini wrote:
> +static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
> +{
> +    SCSIRequest *req = &r->req;
> +    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, req->dev);
> +    uint32_t nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
> +    WriteSameCBData *data;
> +    uint8_t *buf;
> +    int i;
> +
> +    /* Fail if PBDATA=1 or LBDATA=1 or ANCHOR=1.  */
> +    if (nb_sectors == 0 || (req->cmd.buf[1] & 0x16)) {
> +        scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
> +        return;
> +    }
> +
> +    if (bdrv_is_read_only(s->qdev.conf.bs)) {
> +        scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
> +        return;
> +    }
> +    if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
> +        scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
> +        return;
> +    }
> +
> +    if (buffer_is_zero(inbuf, s->qdev.blocksize)) {

Where is inbuf's size checked?  It must be s->qdev.blocksize for this
code to be correct.

> +        int flags = (req->cmd.buf[1] & 0x8) ? BDRV_REQ_MAY_UNMAP : 0;
> +
> +        /* The request is used as the AIO opaque value, so add a ref.  */
> +        scsi_req_ref(&r->req);
> +        bdrv_acct_start(s->qdev.conf.bs, &r->acct, nb_sectors * s->qdev.blocksize,
> +                        BDRV_ACCT_WRITE);
> +        r->req.aiocb = bdrv_aio_write_zeroes(s->qdev.conf.bs,
> +                                             r->req.cmd.lba * (s->qdev.blocksize / 512),
> +                                             nb_sectors * (s->qdev.blocksize / 512),
> +                                             flags, scsi_aio_complete, r);
> +        return;
> +    }
> +
> +    data = g_new0(WriteSameCBData, 1);
> +    data->r = r;
> +    data->sector = r->req.cmd.lba * (s->qdev.blocksize / 512);
> +    data->nb_sectors = nb_sectors * (s->qdev.blocksize / 512);
> +    data->iov.iov_len = MIN(data->nb_sectors * 512, SCSI_WRITE_SAME_MAX);
> +    data->iov.iov_base = buf = g_malloc(data->iov.iov_len);

qemu_blockalign() so the buffer is aligned for O_DIRECT.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME
  2013-11-20 14:18   ` Stefan Hajnoczi
@ 2013-11-20 14:19     ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-20 14:19 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, pl, qemu-devel, stefanha

Il 20/11/2013 15:18, Stefan Hajnoczi ha scritto:
>> > +    if (buffer_is_zero(inbuf, s->qdev.blocksize)) {
> Where is inbuf's size checked?  It must be s->qdev.blocksize for this
> code to be correct.
> 

See scsi_req_length:

    case WRITE_SAME_10:
    case WRITE_SAME_16:
        cmd->xfer = dev->blocksize;
        break;

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests
  2013-11-20 11:01     ` Paolo Bonzini
@ 2013-11-20 14:29       ` Stefan Hajnoczi
  0 siblings, 0 replies; 42+ messages in thread
From: Stefan Hajnoczi @ 2013-11-20 14:29 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, Stefan Hajnoczi, pl, qemu-devel

On Wed, Nov 20, 2013 at 12:01:46PM +0100, Paolo Bonzini wrote:
> Il 20/11/2013 11:22, Stefan Hajnoczi ha scritto:
> >> > +            && num > bs->bl.write_zeroes_alignment) {
> > Here '>' is used...
> > 
> >> > +            if (sector_num % bs->bl.write_zeroes_alignment != 0) {
> >> > +                /* Make a small request up to the first aligned sector.  */
> >> >                  num = bs->bl.write_zeroes_alignment;
> >> > +                num -= sector_num % bs->bl.write_zeroes_alignment;
> >> > +            } else if (num >= bs->bl.write_zeroes_alignment) {
> > ...but here '>=' is used.
> > 
> > The == case here cannot happen.  Did you mean '>=' in both places?
> 
> I meant what I wrote, in the sense that the two "if"s make sense the
> way I wrote them.  However, it is not too clear indeed.
> 
> >         if (bs->bl.write_zeroes_alignment
> >             && num > bs->bl.write_zeroes_alignment) {
> 
> Here, '>' is a necessary (though not sufficient) for being able to
> split the operation in one misaligned request and one aligned request.
> 
> If '<', the request is misaligned in either the starting sector
> or the ending sector (or both), and there's no need to split it.
> 
> If '==', either the request is aligned or we can only split it
> in two parts but they remain misaligned.  In either case there's
> no need to do anyting.
> 
> Because the condition is not sufficient, we may end up splitting a
> request that cannot be aligned anyway (e.g. sector_num = 1, num = 129,
> alignment = 128; will be split into [1,128) and [128,130) which are
> both misaligned.  This is not important.
> 
> >             if (sector_num % bs->bl.write_zeroes_alignment != 0) {
> >                 /* Make a small request up to the first aligned sector.  */
> >                 num = bs->bl.write_zeroes_alignment;
> >                 num -= sector_num % bs->bl.write_zeroes_alignment;
> >             } else if (num >= bs->bl.write_zeroes_alignment) {
> >                 /* Shorten the request to the last aligned sector.  */
> >                 num -= (sector_num + num) % bs->bl.write_zeroes_alignment;
> 
> Here, the "if" checks that we can have no underflow in the subtraction.
> However, in addition to what you pointed out, it's not immediately obvious
> that the subtraction has no effect if sector_num and num are correctly aligned.
> 
> So I will rewrite the "if" this way:
> 
>             if (sector_num % bs->bl.write_zeroes_alignment != 0) {
>                 /* Make a small request up to the first aligned sector.  */
>                 num = bs->bl.write_zeroes_alignment;
>                 num -= sector_num % bs->bl.write_zeroes_alignment;
>             } else if ((sector_num + num) % bs->bl.write_zeroes_alignment != 0) {
>                 /* Shorten the request to the last aligned sector.  num cannot
>                  * underflow because num > bs->bl.write_zeroes_alignment.
>                  */
>                 num -= (sector_num + num) % bs->bl.write_zeroes_alignment;
>             }

Thanks, that is clearer.

Stefan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests Paolo Bonzini
  2013-11-20 10:22   ` Stefan Hajnoczi
@ 2013-11-21 11:30   ` Peter Lieven
  2013-11-21 11:37     ` Paolo Bonzini
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Lieven @ 2013-11-21 11:30 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, stefanha

On 19.11.2013 18:07, Paolo Bonzini wrote:
> Right now, bdrv_co_do_write_zeroes will only try to align the
> beginning of the request.  However, it is simpler for many
> formats to expect the block layer to separate both the head *and*
> the tail.  This makes sure that the format's bdrv_co_write_zeroes
> function will be called with aligned sector_num and nb_sectors for
> the bulk of the request.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   block.c | 30 +++++++++++++++---------------
>   1 file changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/block.c b/block.c
> index 4897649..567d669 100644
> --- a/block.c
> +++ b/block.c
> @@ -2761,14 +2761,19 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>       while (nb_sectors > 0 && !ret) {
>           int num = nb_sectors;
>   
> -        /* align request */
> -        if (bs->bl.write_zeroes_alignment &&
> -            num >= bs->bl.write_zeroes_alignment &&
> -            sector_num % bs->bl.write_zeroes_alignment) {
> -            if (num > bs->bl.write_zeroes_alignment) {
> +        /* Align request.  Block drivers can expect the "bulk" of the request
> +         * to be aligned.
> +         */
> +        if (bs->bl.write_zeroes_alignment
> +            && num > bs->bl.write_zeroes_alignment) {
> +            if (sector_num % bs->bl.write_zeroes_alignment != 0) {
> +                /* Make a small request up to the first aligned sector.  */
>                   num = bs->bl.write_zeroes_alignment;
> +                num -= sector_num % bs->bl.write_zeroes_alignment;
> +            } else if (num >= bs->bl.write_zeroes_alignment) {
> +                /* Shorten the request to the last aligned sector.  */
> +                num -= (sector_num + num) % bs->bl.write_zeroes_alignment;
>               }
> -            num -= sector_num % bs->bl.write_zeroes_alignment;
>           }
>   
>           /* limit request size */
> @@ -2785,24 +2790,27 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
>           if (ret == -ENOTSUP) {
>               /* Fall back to bounce buffer if write zeroes is unsupported */
>               iov.iov_len = num * BDRV_SECTOR_SIZE;
>               if (iov.iov_base == NULL) {
> -                /* allocate bounce buffer only once and ensure that it
> -                 * is big enough for this and all future requests.
> -                 */
> -                size_t bufsize = num <= nb_sectors ? num : max_write_zeroes;
> -                iov.iov_base = qemu_blockalign(bs, bufsize * BDRV_SECTOR_SIZE);
> -                memset(iov.iov_base, 0, bufsize * BDRV_SECTOR_SIZE);
> +                iov.iov_base = qemu_blockalign(bs, num * BDRV_SECTOR_SIZE);
> +                memset(iov.iov_base, 0, num * BDRV_SECTOR_SIZE);
>               }
>               qemu_iovec_init_external(&qiov, &iov, 1);
>   
>               ret = drv->bdrv_co_writev(bs, sector_num, num, &qiov);
> +            if (num <= max_write_zeroes) {
This frees the bounce buffer ever and ever. num can never be greater than
max_write_zeroes. So you need a num < max_write_zeroes here.

Peter
> +                /* Allocate bounce buffer only once if it is
> +                 * big enough for this and all future requests.
> +                 */
> +                qemu_vfree(iov.iov_base);
> +                iov.iov_base = NULL;
> +            }
>           }
>   
>           sector_num += num;
>           nb_sectors -= num;
>       }
>   
>       qemu_vfree(iov.iov_base);
>       return ret;
>   }
>   


-- 

Mit freundlichen Grüßen

Peter Lieven

...........................................................

   KAMP Netzwerkdienste GmbH
   Vestische Str. 89-91 | 46117 Oberhausen
   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
   pl@kamp.de | http://www.kamp.de

   Geschäftsführer: Heiner Lante | Michael Lante
   Amtsgericht Duisburg | HRB Nr. 12154
   USt-Id-Nr.: DE 120607556

...........................................................

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/20] block drivers: add discard/write_zeroes properties to bdrv_get_info implementation
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 08/20] block drivers: add discard/write_zeroes properties to bdrv_get_info implementation Paolo Bonzini
@ 2013-11-21 11:33   ` Peter Lieven
  2013-11-21 11:39     ` Paolo Bonzini
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Lieven @ 2013-11-21 11:33 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, stefanha

On 19.11.2013 18:07, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   block/qcow2.c | 2 ++
>   block/qed.c   | 2 ++
>   block/vdi.c   | 3 +++
>   block/vhdx.c  | 3 +++
>   block/vpc.c   | 2 ++
>   5 files changed, 12 insertions(+)
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 2fe37ed..92ed895 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -1892,6 +1892,8 @@ static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
>   static int qcow2_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>   {
>       BDRVQcowState *s = bs->opaque;
> +    bdi->unallocated_blocks_are_zero = (bs->backing_hd == NULL);
> +    bdi->can_write_zeroes_with_unmap = (s->qcow_version >= 3);
>       bdi->cluster_size = s->cluster_size;
>       bdi->vm_state_offset = qcow2_vm_state_offset(s);
>       return 0;
> diff --git a/block/qed.c b/block/qed.c
> index adc2736..1f87420 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -1475,6 +1475,8 @@ static int bdrv_qed_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>       memset(bdi, 0, sizeof(*bdi));
>       bdi->cluster_size = s->header.cluster_size;
>       bdi->is_dirty = s->header.features & QED_F_NEED_CHECK;
> +    bdi->unallocated_blocks_are_zero = (bs->backing_hd == NULL);
> +    bdi->can_write_zeroes_with_unmap = true;
>       return 0;
>   }
>   
> diff --git a/block/vdi.c b/block/vdi.c
> index b6ec002..3a05027 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -331,6 +331,9 @@ static int vdi_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>       logout("\n");
>       bdi->cluster_size = s->block_size;
>       bdi->vm_state_offset = 0;
> +
> +    /* Backing file not supported yet.  */
> +    bdi->unallocated_blocks_are_zero = true;
If its not supported why not checking for

(bs->backing_hd == NULL)

in case its added later and this place is overseen.


>       return 0;
>   }
>   
> diff --git a/block/vhdx.c b/block/vhdx.c
> index 9ab2b39..d06585a 100644
> --- a/block/vhdx.c
> +++ b/block/vhdx.c
> @@ -1050,6 +1050,9 @@ static int vhdx_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>       bdi->cluster_size =
>           (s->logical_sector_size / BDRV_SECTOR_SIZE) * s->block_size;
>   
> +    bdi->unallocated_blocks_are_zero =
> +        (s->params.data_bits & VHDX_PARAMS_HAS_PARENT) == 0;
> +
>       return 0;
>   }
>   
> diff --git a/block/vpc.c b/block/vpc.c
> index 551876f..d6a47ef 100644
> --- a/block/vpc.c
> +++ b/block/vpc.c
> @@ -464,6 +464,8 @@ static int vpc_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>           bdi->cluster_size = s->block_size;
>       }
>   
> +    /* Backing file not supported yet.  */
> +    bdi->unallocated_blocks_are_zero = true;
same here.
>       return 0;
>   }
>   

Peter

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests
  2013-11-21 11:30   ` Peter Lieven
@ 2013-11-21 11:37     ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-21 11:37 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, qemu-devel, stefanha

Il 21/11/2013 12:30, Peter Lieven ha scritto:
>> +            if (num <= max_write_zeroes) {
> This frees the bounce buffer ever and ever. num can never be greater than
> max_write_zeroes. So you need a num < max_write_zeroes here.

Thanks, well spotted.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/20] block drivers: add discard/write_zeroes properties to bdrv_get_info implementation
  2013-11-21 11:33   ` Peter Lieven
@ 2013-11-21 11:39     ` Paolo Bonzini
  2013-11-21 11:48       ` Peter Lieven
  0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-21 11:39 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, qemu-devel, stefanha

Il 21/11/2013 12:33, Peter Lieven ha scritto:
> On 19.11.2013 18:07, Paolo Bonzini wrote:
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   block/qcow2.c | 2 ++
>>   block/qed.c   | 2 ++
>>   block/vdi.c   | 3 +++
>>   block/vhdx.c  | 3 +++
>>   block/vpc.c   | 2 ++
>>   5 files changed, 12 insertions(+)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 2fe37ed..92ed895 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -1892,6 +1892,8 @@ static coroutine_fn int
>> qcow2_co_flush_to_os(BlockDriverState *bs)
>>   static int qcow2_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>>   {
>>       BDRVQcowState *s = bs->opaque;
>> +    bdi->unallocated_blocks_are_zero = (bs->backing_hd == NULL);
>> +    bdi->can_write_zeroes_with_unmap = (s->qcow_version >= 3);
>>       bdi->cluster_size = s->cluster_size;
>>       bdi->vm_state_offset = qcow2_vm_state_offset(s);
>>       return 0;
>> diff --git a/block/qed.c b/block/qed.c
>> index adc2736..1f87420 100644
>> --- a/block/qed.c
>> +++ b/block/qed.c
>> @@ -1475,6 +1475,8 @@ static int bdrv_qed_get_info(BlockDriverState
>> *bs, BlockDriverInfo *bdi)
>>       memset(bdi, 0, sizeof(*bdi));
>>       bdi->cluster_size = s->header.cluster_size;
>>       bdi->is_dirty = s->header.features & QED_F_NEED_CHECK;
>> +    bdi->unallocated_blocks_are_zero = (bs->backing_hd == NULL);
>> +    bdi->can_write_zeroes_with_unmap = true;
>>       return 0;
>>   }
>>   diff --git a/block/vdi.c b/block/vdi.c
>> index b6ec002..3a05027 100644
>> --- a/block/vdi.c
>> +++ b/block/vdi.c
>> @@ -331,6 +331,9 @@ static int vdi_get_info(BlockDriverState *bs,
>> BlockDriverInfo *bdi)
>>       logout("\n");
>>       bdi->cluster_size = s->block_size;
>>       bdi->vm_state_offset = 0;
>> +
>> +    /* Backing file not supported yet.  */
>> +    bdi->unallocated_blocks_are_zero = true;
> If its not supported why not checking for
> 
> (bs->backing_hd == NULL)
> 
> in case its added later and this place is overseen.

Actually, bdrv_unallocated_blocks_are_zero checks bs->backing_hd so it's
qcow2 and qed that should have "true" instead.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1
  2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1 Paolo Bonzini
@ 2013-11-21 11:43   ` Peter Lieven
  2013-11-21 11:49     ` Paolo Bonzini
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Lieven @ 2013-11-21 11:43 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: kwolf, stefanha

On 19.11.2013 18:07, Paolo Bonzini wrote:
> The latest revision of SCSI SBC clarifies the semantics of LBPRZ
> in a way that lets the iscsi driver remove the distinction between
> bdrv_unallocated_blocks_are_zero and bdrv_can_write_zeroes_with_unmap.
> See Table 8:
>
>      "[If] the LBA became mapped as the result of an autonomous transition,
>      and no write command has specified that LBA since the LBA was mapped,
>      [a read operation returns the] logical block data that would be returned
>      if that autonomous transition had not occurred and the LBA had remained
>      unmapped."
>
> Thus, we can assume that even after an UNMAP command, unallocated blocks
> return zero.  The distinction may remain for other drivers, for example
> qcow2 or qed or vmdk.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   block/iscsi.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/block/iscsi.c b/block/iscsi.c
> index 93fee6d..63b451d 100644
> --- a/block/iscsi.c
> +++ b/block/iscsi.c
> @@ -984,6 +984,9 @@ coroutine_fn iscsi_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
>   
>       if ((flags & BDRV_REQ_MAY_UNMAP) && !iscsilun->lbp.lbpws) {
>           /* WRITE SAME with UNMAP is not supported by the target */
> +        if (iscsilun->lbp.lbpu && iscsilun->lbprz) {
> +            return iscsi_co_discard(bs, sector_num, nb_sectors);
> +        }
>           return -ENOTSUP;
>       }
>   
> @@ -1579,7 +1582,7 @@ static int iscsi_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>   {
>       IscsiLun *iscsilun = bs->opaque;
>       bdi->unallocated_blocks_are_zero = !!iscsilun->lbprz;
> -    bdi->can_write_zeroes_with_unmap = iscsilun->lbprz && iscsilun->lbp.lbpws;
> +    bdi->can_write_zeroes_with_unmap = !!iscsilun->lbprz;
>       return 0;
>   }
>   
I would definetly not do that! I have seen at least my target to execute only parts of a discard request.
Additionally in our semantic a discard request may silently fail. In general this could lead to data corruption
due to broken implementations.

Peter

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 08/20] block drivers: add discard/write_zeroes properties to bdrv_get_info implementation
  2013-11-21 11:39     ` Paolo Bonzini
@ 2013-11-21 11:48       ` Peter Lieven
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Lieven @ 2013-11-21 11:48 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha

On 21.11.2013 12:39, Paolo Bonzini wrote:
> Il 21/11/2013 12:33, Peter Lieven ha scritto:
>> On 19.11.2013 18:07, Paolo Bonzini wrote:
>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>> ---
>>>    block/qcow2.c | 2 ++
>>>    block/qed.c   | 2 ++
>>>    block/vdi.c   | 3 +++
>>>    block/vhdx.c  | 3 +++
>>>    block/vpc.c   | 2 ++
>>>    5 files changed, 12 insertions(+)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index 2fe37ed..92ed895 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1892,6 +1892,8 @@ static coroutine_fn int
>>> qcow2_co_flush_to_os(BlockDriverState *bs)
>>>    static int qcow2_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>>>    {
>>>        BDRVQcowState *s = bs->opaque;
>>> +    bdi->unallocated_blocks_are_zero = (bs->backing_hd == NULL);
>>> +    bdi->can_write_zeroes_with_unmap = (s->qcow_version >= 3);
>>>        bdi->cluster_size = s->cluster_size;
>>>        bdi->vm_state_offset = qcow2_vm_state_offset(s);
>>>        return 0;
>>> diff --git a/block/qed.c b/block/qed.c
>>> index adc2736..1f87420 100644
>>> --- a/block/qed.c
>>> +++ b/block/qed.c
>>> @@ -1475,6 +1475,8 @@ static int bdrv_qed_get_info(BlockDriverState
>>> *bs, BlockDriverInfo *bdi)
>>>        memset(bdi, 0, sizeof(*bdi));
>>>        bdi->cluster_size = s->header.cluster_size;
>>>        bdi->is_dirty = s->header.features & QED_F_NEED_CHECK;
>>> +    bdi->unallocated_blocks_are_zero = (bs->backing_hd == NULL);
>>> +    bdi->can_write_zeroes_with_unmap = true;
>>>        return 0;
>>>    }
>>>    diff --git a/block/vdi.c b/block/vdi.c
>>> index b6ec002..3a05027 100644
>>> --- a/block/vdi.c
>>> +++ b/block/vdi.c
>>> @@ -331,6 +331,9 @@ static int vdi_get_info(BlockDriverState *bs,
>>> BlockDriverInfo *bdi)
>>>        logout("\n");
>>>        bdi->cluster_size = s->block_size;
>>>        bdi->vm_state_offset = 0;
>>> +
>>> +    /* Backing file not supported yet.  */
>>> +    bdi->unallocated_blocks_are_zero = true;
>> If its not supported why not checking for
>>
>> (bs->backing_hd == NULL)
>>
>> in case its added later and this place is overseen.
> Actually, bdrv_unallocated_blocks_are_zero checks bs->backing_hd so it's
> qcow2 and qed that should have "true" instead.
You are right ;-) Forgot my own code.

Peter

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1
  2013-11-21 11:43   ` Peter Lieven
@ 2013-11-21 11:49     ` Paolo Bonzini
  2013-11-21 11:54       ` Peter Lieven
  0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-21 11:49 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, qemu-devel, stefanha

Il 21/11/2013 12:43, Peter Lieven ha scritto:
>>
>>   @@ -1579,7 +1582,7 @@ static int iscsi_get_info(BlockDriverState
>> *bs, BlockDriverInfo *bdi)
>>   {
>>       IscsiLun *iscsilun = bs->opaque;
>>       bdi->unallocated_blocks_are_zero = !!iscsilun->lbprz;
>> -    bdi->can_write_zeroes_with_unmap = iscsilun->lbprz &&
>> iscsilun->lbp.lbpws;
>> +    bdi->can_write_zeroes_with_unmap = !!iscsilun->lbprz;
>>       return 0;
>>   }
>>   
> I would definetly not do that! I have seen at least my target to execute
> only parts of a discard request.

Does that target have LBPRZ and LBPU but not LBPWS?  Note that I'm still
preferring WRITE SAME to UNMAP if both are available.

If so, then this patch is indeed problematic.  Otherwise, it's just
making the same assumptions that Linux has been making forever.

> Additionally in our semantic a discard request may silently fail.

That doesn't matter, the silent failure is handled in block.c.  Here I'm
calling iscsi_co_discard, not bdrv_co_discard.  If it returns ENOTSUP,
it is passed up to bdrv_co_do_write_zeroes which will fall back to writes.

> In general this could lead to data corruption
> due to broken implementations.

A broken implementation could also have LBPWS=1 and execute only the
aligned parts of a WRITE SAME with UNMAP request.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1
  2013-11-21 11:49     ` Paolo Bonzini
@ 2013-11-21 11:54       ` Peter Lieven
  2013-11-21 12:05         ` Paolo Bonzini
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Lieven @ 2013-11-21 11:54 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kwolf, qemu-devel, stefanha

On 21.11.2013 12:49, Paolo Bonzini wrote:
> Il 21/11/2013 12:43, Peter Lieven ha scritto:
>>>    @@ -1579,7 +1582,7 @@ static int iscsi_get_info(BlockDriverState
>>> *bs, BlockDriverInfo *bdi)
>>>    {
>>>        IscsiLun *iscsilun = bs->opaque;
>>>        bdi->unallocated_blocks_are_zero = !!iscsilun->lbprz;
>>> -    bdi->can_write_zeroes_with_unmap = iscsilun->lbprz &&
>>> iscsilun->lbp.lbpws;
>>> +    bdi->can_write_zeroes_with_unmap = !!iscsilun->lbprz;
>>>        return 0;
>>>    }
>>>    
>> I would definetly not do that! I have seen at least my target to execute
>> only parts of a discard request.
> Does that target have LBPRZ and LBPU but not LBPWS?  Note that I'm still
> preferring WRITE SAME to UNMAP if both are available.
it has both.
>
> If so, then this patch is indeed problematic.  Otherwise, it's just
> making the same assumptions that Linux has been making forever.
>
>> Additionally in our semantic a discard request may silently fail.
> That doesn't matter, the silent failure is handled in block.c.  Here I'm
> calling iscsi_co_discard, not bdrv_co_discard.  If it returns ENOTSUP,
> it is passed up to bdrv_co_do_write_zeroes which will fall back to writes.
i have seen that, if you insist to keep this patch, you have to change
the following to return -ENOTSUP in iscsi_co_discard:

         if (iTask.status == SCSI_STATUS_CHECK_CONDITION) {
             /* the target might fail with a check condition if it
                is not happy with the alignment of the UNMAP request
                we silently fail in this case */
             return 0;
         }
>
>> In general this could lead to data corruption
>> due to broken implementations.
> A broken implementation could also have LBPWS=1 and execute only the
> aligned parts of a WRITE SAME with UNMAP request.
You told that the last revision clarifies this part, so there is a change
that has been misinterpreted in the past.

What is the actual reason to add this tweak?
I would like to strictly map write_zeroes to write same. If it is not
supported, write plain zeroes.

Peter

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1
  2013-11-21 11:54       ` Peter Lieven
@ 2013-11-21 12:05         ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2013-11-21 12:05 UTC (permalink / raw)
  To: Peter Lieven; +Cc: kwolf, qemu-devel, stefanha

Il 21/11/2013 12:54, Peter Lieven ha scritto:
> What is the actual reason to add this tweak?

The reason is that there are targets that implement UNMAP but not WRITE
SAME.  But I see what you mean---such a target could validly ignore the
unaligned edges of the request.  I guess it could be made to work, but
unless someone really asks for it, let's drop the patch.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2013-11-21 12:05 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-19 17:07 [Qemu-devel] [PATCH v2 00/20] block & scsi: write_zeroes support through the whole stack Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 01/20] block: generalize BlockLimits handling to cover bdrv_aio_discard too Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 02/20] block: add flags to BlockRequest Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 03/20] block: add flags argument to bdrv_co_write_zeroes tracepoint Paolo Bonzini
2013-11-20  9:59   ` Stefan Hajnoczi
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 04/20] block: add bdrv_aio_write_zeroes Paolo Bonzini
2013-11-20 10:02   ` Stefan Hajnoczi
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 05/20] block: handle ENOTSUP from discard in generic code Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 06/20] block: make bdrv_co_do_write_zeroes stricter in producing aligned requests Paolo Bonzini
2013-11-20 10:22   ` Stefan Hajnoczi
2013-11-20 11:01     ` Paolo Bonzini
2013-11-20 14:29       ` Stefan Hajnoczi
2013-11-21 11:30   ` Peter Lieven
2013-11-21 11:37     ` Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 07/20] vpc, vhdx: add get_info Paolo Bonzini
2013-11-20 12:39   ` Stefan Hajnoczi
2013-11-20 12:50     ` Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 08/20] block drivers: add discard/write_zeroes properties to bdrv_get_info implementation Paolo Bonzini
2013-11-21 11:33   ` Peter Lieven
2013-11-21 11:39     ` Paolo Bonzini
2013-11-21 11:48       ` Peter Lieven
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 09/20] block drivers: expose requirement for write same alignment from formats Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 10/20] block/iscsi: remove .bdrv_has_zero_init Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 11/20] block/iscsi: updated copyright Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 12/20] block/iscsi: check WRITE SAME support differently depending on MAY_UNMAP Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 13/20] block/iscsi: use UNMAP to write zeroes if LBPRZ=1 Paolo Bonzini
2013-11-21 11:43   ` Peter Lieven
2013-11-21 11:49     ` Paolo Bonzini
2013-11-21 11:54       ` Peter Lieven
2013-11-21 12:05         ` Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 14/20] raw-posix: implement write_zeroes with MAY_UNMAP for files Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 15/20] raw-posix: implement write_zeroes with MAY_UNMAP for block devices Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 16/20] raw-posix: add support for write_zeroes on XFS and " Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 17/20] qemu-iotests: 033 is fast Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 18/20] scsi-disk: catch write protection errors in UNMAP Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 19/20] scsi-disk: reject ANCHOR=1 for UNMAP and WRITE SAME commands Paolo Bonzini
2013-11-19 17:07 ` [Qemu-devel] [PATCH v2 20/20] scsi-disk: correctly implement WRITE SAME Paolo Bonzini
2013-11-19 17:23   ` ronnie sahlberg
2013-11-19 17:27     ` ronnie sahlberg
2013-11-19 17:31     ` Paolo Bonzini
2013-11-20 14:18   ` Stefan Hajnoczi
2013-11-20 14:19     ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).