* [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation
@ 2013-12-06 17:22 Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 01/19] qemu_memalign: Allow small alignments Kevin Wolf
` (20 more replies)
0 siblings, 21 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
This patch series adds code to the block layer that allows performing
I/O requests in smaller granularities than required by the host backend
(most importantly, O_DIRECT restrictions). It achieves this for reads
by rounding the request to host-side block boundary, and for writes by
performing a read-modify-write cycle (and serialising requests
touching the same block so that the RMW doesn't write back stale data).
Originally I intended to reuse a lot of code from Paolo's previous
patch series, however as I tried to integrate pread/pwrite, which
already do a very similar thing (except for considering concurrency),
and because I wanted to implement zero-copy, most of this series ended
up being new code.
Zero-copy is possible in a common case because while XFS defauls to a
4k sector size and therefore 4k on-disk O_DIRECT alignment for 512E
disks, it still only has a 512 byte memory alignment requirement.
(Unfortunately the XFS_IOC_DIOINFO ioctl claims 4k even for memory, but
we know that the value is wrong and can probe it.)
This series does not cover 4k guests on a 512 byte host, and I'm not
sure yet what to do with this case. Paolos series contained a patch to
protect against "torn reads" (i.e. reads running in parallel with
writes, which return old data for one half of a sector and new data for
the other half) by serialising requests if the guest block size was
greater than the host block size.
One problem with this approach is that it assumes that a single host
block size even exists and can be compared against on the top level.
Different backing files can be stored on different storage, though, with
different block sizes.
Another problem is that block drivers can split requests internally
(imagine a qcow2 image with 512 byte clusters), which would have to be
detected as well.
Finally, it's unclear what to do with cache modes using the kernel page
cache. Technically, these have a required alignment of 1 byte, which is
always smaller than the guest alignment. We always have to expect short
writes, so we can't say "it's always the granularity of the request".
However, serialising _every_ request certainly doesn't seem reasonable;
we've never done it, and we've never got any bug reports.
Other non-file protocols may have the same problem.
(And all of this is ignoring that with multiple users of the block
device - e.g. guest device, NBD server, block jobs - there isn't even a
single guest block size, but it must be passed per request if done
properly.)
Anyway, so I'm hoping for a review of this series in order to get
512b-on-4k merged soon, and some help/discussion for the 4k-on-512
case.
Kevin Wolf (17):
qemu_memalign: Allow small alignments
block: Detect unaligned length in bdrv_qiov_is_aligned()
block: Don't use guest sector size for qemu_blockalign()
block: Introduce bdrv_aligned_preadv()
block: Introduce bdrv_co_do_preadv()
block: Introduce bdrv_aligned_pwritev()
block: write: Handle COR dependency after I/O throttling
block: Introduce bdrv_co_do_pwritev()
block: Switch BdrvTrackedRequest to byte granularity
block: Allow waiting for overlapping requests between begin/end
block: Make zero-after-EOF work with larger alignment
block: Generalise and optimise COR serialisation
block: Make overlap range for serialisation dynamic
block: Align requests in bdrv_co_do_pwritev()
block: Change coroutine wrapper to byte granularity
block: Make bdrv_pread() a bdrv_prwv_co() wrapper
block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper
Paolo Bonzini (2):
block: rename buffer_alignment to guest_block_size
raw: Probe required direct I/O alignment
block.c | 572 ++++++++++++++++++++++++++++++----------------
block/backup.c | 7 +-
block/raw-posix.c | 102 +++++++--
block/raw-win32.c | 41 ++++
hw/block/virtio-blk.c | 2 +-
hw/ide/core.c | 2 +-
hw/scsi/scsi-disk.c | 2 +-
hw/scsi/scsi-generic.c | 2 +-
include/block/block.h | 3 +-
include/block/block_int.h | 24 +-
util/oslib-posix.c | 5 +
11 files changed, 539 insertions(+), 223 deletions(-)
--
1.8.1.4
^ permalink raw reply [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 01/19] qemu_memalign: Allow small alignments
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 02/19] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
` (19 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
The functions used by qemu_memalign() require an alignment that is at
least sizeof(void*). Adjust it if it is too small.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
util/oslib-posix.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e00a44c..54f8932 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -85,6 +85,11 @@ void *qemu_oom_check(void *ptr)
void *qemu_memalign(size_t alignment, size_t size)
{
void *ptr;
+
+ if (alignment < sizeof(void*)) {
+ alignment = sizeof(void*);
+ }
+
#if defined(_POSIX_C_SOURCE) && !defined(__sun__)
int ret;
ret = posix_memalign(&ptr, alignment, size);
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 02/19] block: Detect unaligned length in bdrv_qiov_is_aligned()
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 01/19] qemu_memalign: Allow small alignments Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 19:12 ` Eric Blake
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 03/19] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
` (18 subsequent siblings)
20 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
For an O_DIRECT request to succeed, it's not only necessary that all
base addresses in the qiov are aligned, but also that each lengh in it
is aligned.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/block.c b/block.c
index 382ea71..613201b 100644
--- a/block.c
+++ b/block.c
@@ -4349,6 +4349,9 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
return false;
}
+ if (qiov->iov[i].iov_len % bs->buffer_alignment) {
+ return false;
+ }
}
return true;
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 03/19] block: Don't use guest sector size for qemu_blockalign()
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 01/19] qemu_memalign: Allow small alignments Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 02/19] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-10 3:18 ` Wenchao Xia
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 04/19] block: rename buffer_alignment to guest_block_size Kevin Wolf
` (17 subsequent siblings)
20 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
bs->buffer_alignment is set by the device emulation and contains the
logical block size of the guest device. This isn't something that the
block layer should know, and even less something to use for determining
the right alignment of buffers to be used for the host.
The new function bdrv_opt_mem_align() allows for hooks in a BlockDriver
so that it can tell the qemu block layer the optimal alignment to be
used so that no bounce buffer must be used in the driver.
This patch may change the buffer alignment from 4k to 512 for all
callers that used qemu_blockalign() with the top-level image format
BlockDriverState. The value was never propagated to other levels in the
tree, so in particular raw-posix never required anything else than 512.
While on disks with 4k sectors direct I/O requires a 4k alignment,
memory may still be okay when aligned to 512 byte boundaries. This is
what must have happened in practice, because otherwise this would
already have failed earlier. Therefore I don't expect regressions even
with this intermediate state. Later, raw-posix can implement the hook
and expose a different memory alignment requirement.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 32 +++++++++++++++++++++++++++++---
include/block/block.h | 1 +
include/block/block_int.h | 4 ++++
3 files changed, 34 insertions(+), 3 deletions(-)
diff --git a/block.c b/block.c
index 613201b..669793b 100644
--- a/block.c
+++ b/block.c
@@ -213,6 +213,31 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
qemu_co_queue_next(&bs->throttled_reqs[is_write]);
}
+size_t bdrv_opt_mem_align(BlockDriverState *bs)
+{
+ size_t alignment;
+
+ if (!bs || !bs->drv) {
+ /* 4k should be on the safe side */
+ return 4096;
+ }
+
+ if (bs->drv->bdrv_opt_mem_align) {
+ return bs->drv->bdrv_opt_mem_align(bs);
+ }
+
+ if (bs->file) {
+ alignment = bdrv_opt_mem_align(bs->file);
+ } else {
+ alignment = 512;
+ }
+
+ if (bs->backing_hd) {
+ alignment = MAX(alignment, bdrv_opt_mem_align(bs->backing_hd));
+ }
+ return alignment;
+}
+
/* check if the path starts with "<protocol>:" */
static int path_has_protocol(const char *path)
{
@@ -4335,7 +4360,7 @@ void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
void *qemu_blockalign(BlockDriverState *bs, size_t size)
{
- return qemu_memalign((bs && bs->buffer_alignment) ? bs->buffer_alignment : 512, size);
+ return qemu_memalign(bdrv_opt_mem_align(bs), size);
}
/*
@@ -4344,12 +4369,13 @@ void *qemu_blockalign(BlockDriverState *bs, size_t size)
bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
{
int i;
+ size_t alignment = bdrv_opt_mem_align(bs);
for (i = 0; i < qiov->niov; i++) {
- if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
+ if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
return false;
}
- if (qiov->iov[i].iov_len % bs->buffer_alignment) {
+ if (qiov->iov[i].iov_len % alignment) {
return false;
}
}
diff --git a/include/block/block.h b/include/block/block.h
index 3560deb..d262c0e 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -383,6 +383,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
char *options, uint64_t img_size, int flags,
Error **errp, bool quiet);
+size_t bdrv_opt_mem_align(BlockDriverState *bs);
void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
void *qemu_blockalign(BlockDriverState *bs, size_t size);
bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1666066..6a84f83 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -119,6 +119,10 @@ struct BlockDriver {
int64_t sector_num, int nb_sectors,
BlockDriverCompletionFunc *cb, void *opaque);
+ /* Returns the alignment in bytes that is required so that no bounce buffer
+ * is required throughout the stack */
+ int (*bdrv_opt_mem_align)(BlockDriverState *bs);
+
int coroutine_fn (*bdrv_co_readv)(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 04/19] block: rename buffer_alignment to guest_block_size
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (2 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 03/19] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-10 3:25 ` Wenchao Xia
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 05/19] raw: Probe required direct I/O alignment Kevin Wolf
` (16 subsequent siblings)
20 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
From: Paolo Bonzini <pbonzini@redhat.com>
The alignment field is now set to the value that is promised to the
guest, rather than required by the host. The next patches will make
QEMU aware of the host-provided values, so make this clear.
The alignment is also not about memory buffers, but about the sectors on
the disk, change the documentation of the field.
At this point, the field is set by the device emulation, but completely
ignored by the block layer.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 10 +++++-----
hw/block/virtio-blk.c | 2 +-
hw/ide/core.c | 2 +-
hw/scsi/scsi-disk.c | 2 +-
hw/scsi/scsi-generic.c | 2 +-
include/block/block.h | 2 +-
include/block/block_int.h | 4 ++--
7 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/block.c b/block.c
index 669793b..34fedf0 100644
--- a/block.c
+++ b/block.c
@@ -788,7 +788,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
}
bs->open_flags = flags;
- bs->buffer_alignment = 512;
+ bs->guest_block_size = 512;
bs->zero_beyond_eof = true;
open_flags = bdrv_open_flags(bs, flags);
bs->read_only = !(open_flags & BDRV_O_RDWR);
@@ -1625,7 +1625,7 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
bs_dest->dev_ops = bs_src->dev_ops;
bs_dest->dev_opaque = bs_src->dev_opaque;
bs_dest->dev = bs_src->dev;
- bs_dest->buffer_alignment = bs_src->buffer_alignment;
+ bs_dest->guest_block_size = bs_src->guest_block_size;
bs_dest->copy_on_read = bs_src->copy_on_read;
bs_dest->enable_write_cache = bs_src->enable_write_cache;
@@ -1776,7 +1776,7 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev)
bs->dev = NULL;
bs->dev_ops = NULL;
bs->dev_opaque = NULL;
- bs->buffer_alignment = 512;
+ bs->guest_block_size = 512;
}
/* TODO change to return DeviceState * when all users are qdevified */
@@ -4353,9 +4353,9 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
return NULL;
}
-void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
+void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
{
- bs->buffer_alignment = align;
+ bs->guest_block_size = align;
}
void *qemu_blockalign(BlockDriverState *bs, size_t size)
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 13f6d82..323e9ec 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -720,7 +720,7 @@ static int virtio_blk_device_init(VirtIODevice *vdev)
register_savevm(qdev, "virtio-blk", virtio_blk_id++, 2,
virtio_blk_save, virtio_blk_load, s);
bdrv_set_dev_ops(s->bs, &virtio_block_ops, s);
- bdrv_set_buffer_alignment(s->bs, s->conf->logical_block_size);
+ bdrv_set_guest_block_size(s->bs, s->conf->logical_block_size);
bdrv_iostatus_enable(s->bs);
diff --git a/hw/ide/core.c b/hw/ide/core.c
index e1f4c33..036cd4a 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -2103,7 +2103,7 @@ int ide_init_drive(IDEState *s, BlockDriverState *bs, IDEDriveKind kind,
s->smart_selftest_count = 0;
if (kind == IDE_CD) {
bdrv_set_dev_ops(bs, &ide_cd_block_ops, s);
- bdrv_set_buffer_alignment(bs, 2048);
+ bdrv_set_guest_block_size(bs, 2048);
} else {
if (!bdrv_is_inserted(s->bs)) {
error_report("Device needs media, but drive is empty");
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 74e6a14..1043a5f 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -2117,7 +2117,7 @@ static int scsi_initfn(SCSIDevice *dev)
} else {
bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_disk_block_ops, s);
}
- bdrv_set_buffer_alignment(s->qdev.conf.bs, s->qdev.blocksize);
+ bdrv_set_guest_block_size(s->qdev.conf.bs, s->qdev.blocksize);
bdrv_iostatus_enable(s->qdev.conf.bs);
add_boot_device_path(s->qdev.conf.bootindex, &dev->qdev, NULL);
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 8f195be..f08b64e 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -210,7 +210,7 @@ static void scsi_read_complete(void * opaque, int ret)
s->blocksize = ldl_be_p(&r->buf[8]);
s->max_lba = ldq_be_p(&r->buf[0]);
}
- bdrv_set_buffer_alignment(s->conf.bs, s->blocksize);
+ bdrv_set_guest_block_size(s->conf.bs, s->blocksize);
scsi_req_data(&r->req, len);
if (!r->req.io_canceled) {
diff --git a/include/block/block.h b/include/block/block.h
index d262c0e..564c1c3 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -384,7 +384,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
Error **errp, bool quiet);
size_t bdrv_opt_mem_align(BlockDriverState *bs);
-void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
+void bdrv_set_guest_block_size(BlockDriverState *bs, int align);
void *qemu_blockalign(BlockDriverState *bs, size_t size);
bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 6a84f83..760cfc5 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -293,8 +293,8 @@ struct BlockDriverState {
/* Whether produces zeros when read beyond eof */
bool zero_beyond_eof;
- /* the memory alignment required for the buffers handled by this driver */
- int buffer_alignment;
+ /* the block size for which the guest device expects atomicity */
+ int guest_block_size;
/* do we need to tell the quest if we have a volatile write cache? */
int enable_write_cache;
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 05/19] raw: Probe required direct I/O alignment
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (3 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 04/19] block: rename buffer_alignment to guest_block_size Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:53 ` Paolo Bonzini
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 06/19] block: Introduce bdrv_aligned_preadv() Kevin Wolf
` (15 subsequent siblings)
20 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
From: Paolo Bonzini <pbonzini@redhat.com>
Add a bs->request_alignment field that contains the required
offset/length alignment for I/O requests and fill it in the raw block
drivers. Use ioctls if possible, else see what alignment it takes for
O_DIRECT to succeed.
While at it, also expose the memory alignment requirements, which may be
(and in practice are) different from the disk alignment requirements.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 4 ++
block/raw-posix.c | 102 ++++++++++++++++++++++++++++++++++++++--------
block/raw-win32.c | 41 +++++++++++++++++++
include/block/block_int.h | 3 ++
4 files changed, 132 insertions(+), 18 deletions(-)
diff --git a/block.c b/block.c
index 34fedf0..b4b17fd 100644
--- a/block.c
+++ b/block.c
@@ -789,6 +789,7 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
bs->open_flags = flags;
bs->guest_block_size = 512;
+ bs->request_alignment = 512;
bs->zero_beyond_eof = true;
open_flags = bdrv_open_flags(bs, flags);
bs->read_only = !(open_flags & BDRV_O_RDWR);
@@ -856,6 +857,9 @@ static int bdrv_open_common(BlockDriverState *bs, BlockDriverState *file,
goto free_and_fail;
}
+ assert(bdrv_opt_mem_align(bs) != 0);
+ assert(bs->request_alignment != 0);
+
#ifndef _WIN32
if (bs->is_temporary) {
assert(bs->filename[0] != '\0');
diff --git a/block/raw-posix.c b/block/raw-posix.c
index f836c8e..732aff5 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -127,6 +127,8 @@ typedef struct BDRVRawState {
int fd;
int type;
int open_flags;
+ size_t buf_align;
+
#if defined(__linux__)
/* linux floppy specific */
int64_t fd_open_time;
@@ -211,6 +213,76 @@ static int raw_normalize_devicepath(const char **filename)
}
#endif
+static void raw_probe_alignment(BlockDriverState *bs)
+{
+ BDRVRawState *s = bs->opaque;
+ char *buf;
+ unsigned int sector_size;
+
+ /* For /dev/sg devices the alignment is not really used.
+ With buffered I/O, we don't have any restrictions. */
+ if (bs->sg || !(s->open_flags & O_DIRECT)) {
+ bs->request_alignment = 1;
+ s->buf_align = 1;
+ return;
+ }
+
+ /* Try a few ioctls to get the right size */
+ bs->request_alignment = 0;
+ s->buf_align = 0;
+
+#ifdef BLKSSZGET
+ if (ioctl(s->fd, BLKSSZGET, §or_size) >= 0) {
+ bs->request_alignment = sector_size;
+ }
+#endif
+#ifdef DKIOCGETBLOCKSIZE
+ if (ioctl(s->fd, DKIOCGETBLOCKSIZE, §or_size) >= 0) {
+ bs->request_alignment = sector_size;
+ }
+#endif
+#ifdef DIOCGSECTORSIZE
+ if (ioctl(s->fd, DIOCGSECTORSIZE, §or_size) >= 0) {
+ bs->request_alignment = sector_size;
+ }
+#endif
+#ifdef CONFIG_XFS
+ if (s->is_xfs) {
+ struct dioattr da;
+ if (xfsctl(NULL, s->fd, XFS_IOC_DIOINFO, &da) >= 0) {
+ bs->request_alignment = da.d_miniosz;
+ /* The kernel returns wrong information for d_mem */
+ /* s->buf_align = da.d_mem; */
+ }
+ }
+#endif
+
+ /* If we could not get the sizes so far, we can only guess them */
+ if (!s->buf_align) {
+ size_t align;
+ buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
+ for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+ if (pread(s->fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
+ s->buf_align = align;
+ break;
+ }
+ }
+ qemu_vfree(buf);
+ }
+
+ if (!bs->request_alignment) {
+ size_t align;
+ buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
+ for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+ if (pread(s->fd, buf, align, 0) >= 0) {
+ bs->request_alignment = align;
+ break;
+ }
+ }
+ qemu_vfree(buf);
+ }
+}
+
static void raw_parse_flags(int bdrv_flags, int *open_flags)
{
assert(open_flags != NULL);
@@ -330,6 +402,8 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
}
#endif
+ raw_probe_alignment(bs);
+
ret = 0;
fail:
qemu_opts_del(opts);
@@ -432,7 +506,6 @@ static int raw_reopen_prepare(BDRVReopenState *state,
return ret;
}
-
static void raw_reopen_commit(BDRVReopenState *state)
{
BDRVRawReopenState *raw_s = state->opaque;
@@ -468,23 +541,11 @@ static void raw_reopen_abort(BDRVReopenState *state)
state->opaque = NULL;
}
-
-/* XXX: use host sector size if necessary with:
-#ifdef DIOCGSECTORSIZE
- {
- unsigned int sectorsize = 512;
- if (!ioctl(fd, DIOCGSECTORSIZE, §orsize) &&
- sectorsize > bufsize)
- bufsize = sectorsize;
- }
-#endif
-#ifdef CONFIG_COCOA
- uint32_t blockSize = 512;
- if ( !ioctl( fd, DKIOCGETBLOCKSIZE, &blockSize ) && blockSize > bufsize) {
- bufsize = blockSize;
- }
-#endif
-*/
+static int raw_opt_mem_align(BlockDriverState *bs)
+{
+ BDRVRawState *s = bs->opaque;
+ return s->buf_align;
+}
static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
{
@@ -1227,6 +1288,7 @@ static BlockDriver bdrv_file = {
.bdrv_aio_writev = raw_aio_writev,
.bdrv_aio_flush = raw_aio_flush,
.bdrv_aio_discard = raw_aio_discard,
+ .bdrv_opt_mem_align = raw_opt_mem_align,
.bdrv_truncate = raw_truncate,
.bdrv_getlength = raw_getlength,
@@ -1582,6 +1644,7 @@ static BlockDriver bdrv_host_device = {
.bdrv_aio_writev = raw_aio_writev,
.bdrv_aio_flush = raw_aio_flush,
.bdrv_aio_discard = hdev_aio_discard,
+ .bdrv_opt_mem_align = raw_opt_mem_align,
.bdrv_truncate = raw_truncate,
.bdrv_getlength = raw_getlength,
@@ -1712,6 +1775,7 @@ static BlockDriver bdrv_host_floppy = {
.bdrv_aio_readv = raw_aio_readv,
.bdrv_aio_writev = raw_aio_writev,
.bdrv_aio_flush = raw_aio_flush,
+ .bdrv_opt_mem_align = raw_opt_mem_align,
.bdrv_truncate = raw_truncate,
.bdrv_getlength = raw_getlength,
@@ -1822,6 +1886,7 @@ static BlockDriver bdrv_host_cdrom = {
.bdrv_aio_readv = raw_aio_readv,
.bdrv_aio_writev = raw_aio_writev,
.bdrv_aio_flush = raw_aio_flush,
+ .bdrv_opt_mem_align = raw_opt_mem_align,
.bdrv_truncate = raw_truncate,
.bdrv_getlength = raw_getlength,
@@ -1951,6 +2016,7 @@ static BlockDriver bdrv_host_cdrom = {
.bdrv_aio_readv = raw_aio_readv,
.bdrv_aio_writev = raw_aio_writev,
.bdrv_aio_flush = raw_aio_flush,
+ .bdrv_opt_mem_align = raw_opt_mem_align,
.bdrv_truncate = raw_truncate,
.bdrv_getlength = raw_getlength,
diff --git a/block/raw-win32.c b/block/raw-win32.c
index 2bad5a3..6fe64d2 100644
--- a/block/raw-win32.c
+++ b/block/raw-win32.c
@@ -202,6 +202,35 @@ static int set_sparse(int fd)
NULL, 0, NULL, 0, &returned, NULL);
}
+static void raw_probe_alignment(BlockDriverState *bs)
+{
+ BDRVRawState *s = bs->opaque;
+ DWORD sectorsPerCluster, freeClusters, totalClusters, count;
+ DISK_GEOMETRY_EX dg;
+ BOOL status;
+
+ if (s->type == FTYPE_CD) {
+ bs->request_alignment = 2048;
+ return;
+ }
+ if (s->type == FTYPE_HARDDISK) {
+ status = DeviceIoControl(s->hfile, IOCTL_DISK_GET_DRIVE_GEOMETRY_EX,
+ NULL, 0, &dg, sizeof(dg), &count, NULL);
+ if (status != 0) {
+ bs->request_alignment = dg.Geometry.BytesPerSector;
+ return;
+ }
+ /* try GetDiskFreeSpace too */
+ }
+
+ if (s->drive_path[0]) {
+ GetDiskFreeSpace(s->drive_path, §orsPerCluster,
+ &dg.Geometry.BytesPerSector,
+ &freeClusters, &totalClusters);
+ bs->request_alignment = dg.Geometry.BytesPerSector;
+ }
+}
+
static void raw_parse_flags(int flags, int *access_flags, DWORD *overlapped)
{
assert(access_flags != NULL);
@@ -269,6 +298,17 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
}
}
+ if (filename[0] && filename[1] == ':') {
+ snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", filename[0]);
+ } else if (filename[0] == '\\' && filename[1] == '\\') {
+ s->drive_path[0] = 0;
+ } else {
+ /* Relative path. */
+ char buf[MAX_PATH];
+ GetCurrentDirectory(MAX_PATH, buf);
+ snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", buf[0]);
+ }
+
s->hfile = CreateFile(filename, access_flags,
FILE_SHARE_READ, NULL,
OPEN_EXISTING, overlapped, NULL);
@@ -293,6 +333,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
s->aio = aio;
}
+ raw_probe_alignment(bs);
ret = 0;
fail:
qemu_opts_del(opts);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 760cfc5..31ce7d4 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -293,6 +293,9 @@ struct BlockDriverState {
/* Whether produces zeros when read beyond eof */
bool zero_beyond_eof;
+ /* Alignment requirement for offset/length of I/O requests */
+ unsigned int request_alignment;
+
/* the block size for which the guest device expects atomicity */
int guest_block_size;
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 06/19] block: Introduce bdrv_aligned_preadv()
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (4 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 05/19] raw: Probe required direct I/O alignment Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 07/19] block: Introduce bdrv_co_do_preadv() Kevin Wolf
` (14 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
This separates the part of bdrv_co_do_readv() that needs to happen
before the request is modified to match the backend alignment, and a
part that needs to be executed afterwards and passes the request to the
BlockDriver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 61 +++++++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 43 insertions(+), 18 deletions(-)
diff --git a/block.c b/block.c
index b4b17fd..d1c659c 100644
--- a/block.c
+++ b/block.c
@@ -2638,26 +2638,24 @@ err:
}
/*
- * Handle a read request in coroutine context
+ * Forwards an already correctly aligned request to the BlockDriver. This
+ * handles copy on read and zeroing after EOF; any other features must be
+ * implemented by the caller.
*/
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
- BdrvRequestFlags flags)
+static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
+ int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
{
BlockDriver *drv = bs->drv;
BdrvTrackedRequest req;
int ret;
- if (!drv) {
- return -ENOMEDIUM;
- }
- if (bdrv_check_request(bs, sector_num, nb_sectors)) {
- return -EIO;
- }
+ int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+ unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
- if (bs->copy_on_read) {
- flags |= BDRV_REQ_COPY_ON_READ;
- }
+ assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+ assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+ /* Handle Copy on Read and associated serialisation */
if (flags & BDRV_REQ_COPY_ON_READ) {
bs->copy_on_read_in_flight++;
}
@@ -2666,11 +2664,6 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
wait_for_overlapping_requests(bs, sector_num, nb_sectors);
}
- /* throttling disk I/O */
- if (bs->io_limits_enabled) {
- bdrv_io_limits_intercept(bs, nb_sectors, false);
- }
-
tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
if (flags & BDRV_REQ_COPY_ON_READ) {
@@ -2687,6 +2680,7 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
}
}
+ /* Forward the request to the BlockDriver */
if (!(bs->zero_beyond_eof && bs->growable)) {
ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
} else {
@@ -2727,6 +2721,37 @@ out:
return ret;
}
+/*
+ * Handle a read request in coroutine context
+ */
+static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
+ int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+ BdrvRequestFlags flags)
+{
+ BlockDriver *drv = bs->drv;
+ int ret;
+
+ if (!drv) {
+ return -ENOMEDIUM;
+ }
+ if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+ return -EIO;
+ }
+
+ if (bs->copy_on_read) {
+ flags |= BDRV_REQ_COPY_ON_READ;
+ }
+
+ /* throttling disk I/O */
+ if (bs->io_limits_enabled) {
+ bdrv_io_limits_intercept(bs, nb_sectors, false);
+ }
+
+ ret = bdrv_aligned_preadv(bs, sector_num << BDRV_SECTOR_BITS,
+ nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+ return ret;
+}
+
int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
int nb_sectors, QEMUIOVector *qiov)
{
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 07/19] block: Introduce bdrv_co_do_preadv()
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (5 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 06/19] block: Introduce bdrv_aligned_preadv() Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 08/19] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
` (13 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Similar to bdrv_pread(), which aligns byte-aligned request to 512 byte
sectors, bdrv_co_do_preadv() takes a byte-aligned request and aligns it
to the alignment specified in bs->request_alignment.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 57 insertions(+), 6 deletions(-)
diff --git a/block.c b/block.c
index d1c659c..5f11935 100644
--- a/block.c
+++ b/block.c
@@ -2724,17 +2724,23 @@ out:
/*
* Handle a read request in coroutine context
*/
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
+ int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags)
{
BlockDriver *drv = bs->drv;
+ /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
+ uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
+ uint8_t *head_buf = NULL;
+ uint8_t *tail_buf = NULL;
+ QEMUIOVector local_qiov;
+ bool use_local_qiov = false;
int ret;
if (!drv) {
return -ENOMEDIUM;
}
- if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+ if (bdrv_check_byte_request(bs, offset, bytes)) {
return -EIO;
}
@@ -2744,14 +2750,59 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
/* throttling disk I/O */
if (bs->io_limits_enabled) {
- bdrv_io_limits_intercept(bs, nb_sectors, false);
+ bdrv_io_limits_intercept(bs, bytes >> BDRV_SECTOR_BITS, false);
+ }
+
+ /* Align read if necessary by padding qiov */
+ if (offset & (align - 1)) {
+ head_buf = qemu_blockalign(bs, align);
+ qemu_iovec_init(&local_qiov, qiov->niov + 2);
+ qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
+ qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+ use_local_qiov = true;
+
+ bytes += offset & (align - 1);
+ offset = offset & ~(align - 1);
+ }
+
+ if ((offset + bytes) & (align - 1)) {
+ if (!use_local_qiov) {
+ qemu_iovec_init(&local_qiov, qiov->niov + 1);
+ qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+ use_local_qiov = true;
+ }
+ tail_buf = qemu_blockalign(bs, align);
+ qemu_iovec_add(&local_qiov, tail_buf,
+ align - ((offset + bytes) & (align - 1)));
+
+ bytes = ROUND_UP(bytes, align);
+ }
+
+ ret = bdrv_aligned_preadv(bs, offset, bytes,
+ use_local_qiov ? &local_qiov : qiov,
+ flags);
+
+ if (use_local_qiov) {
+ qemu_iovec_destroy(&local_qiov);
+ qemu_vfree(head_buf);
+ qemu_vfree(tail_buf);
}
- ret = bdrv_aligned_preadv(bs, sector_num << BDRV_SECTOR_BITS,
- nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
return ret;
}
+static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
+ int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+ BdrvRequestFlags flags)
+{
+ if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
+ return -EINVAL;
+ }
+
+ return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
+ nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+}
+
int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
int nb_sectors, QEMUIOVector *qiov)
{
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 08/19] block: Introduce bdrv_aligned_pwritev()
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (6 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 07/19] block: Introduce bdrv_co_do_preadv() Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 09/19] block: write: Handle COR dependency after I/O throttling Kevin Wolf
` (12 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
This separates the part of bdrv_co_do_writev() that needs to happen
before the request is modified to match the backend alignment, and a
part that needs to be executed afterwards and passes the request to the
BlockDriver.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 62 +++++++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 41 insertions(+), 21 deletions(-)
diff --git a/block.c b/block.c
index 5f11935..20ed857 100644
--- a/block.c
+++ b/block.c
@@ -2852,34 +2852,20 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
}
/*
- * Handle a write request in coroutine context
+ * Forwards an already correctly aligned write request to the BlockDriver.
*/
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
- BdrvRequestFlags flags)
+static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
+ int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
{
BlockDriver *drv = bs->drv;
BdrvTrackedRequest req;
int ret;
- if (!bs->drv) {
- return -ENOMEDIUM;
- }
- if (bs->read_only) {
- return -EACCES;
- }
- if (bdrv_check_request(bs, sector_num, nb_sectors)) {
- return -EIO;
- }
-
- if (bs->copy_on_read_in_flight) {
- wait_for_overlapping_requests(bs, sector_num, nb_sectors);
- }
+ int64_t sector_num = offset >> BDRV_SECTOR_BITS;
+ unsigned int nb_sectors = bytes >> BDRV_SECTOR_BITS;
- /* throttling disk I/O */
- if (bs->io_limits_enabled) {
- bdrv_io_limits_intercept(bs, nb_sectors, true);
- }
+ assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+ assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
@@ -2913,6 +2899,40 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
return ret;
}
+/*
+ * Handle a write request in coroutine context
+ */
+static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
+ int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+ BdrvRequestFlags flags)
+{
+ int ret;
+
+ if (!bs->drv) {
+ return -ENOMEDIUM;
+ }
+ if (bs->read_only) {
+ return -EACCES;
+ }
+ if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+ return -EIO;
+ }
+
+ if (bs->copy_on_read_in_flight) {
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+ }
+
+ /* throttling disk I/O */
+ if (bs->io_limits_enabled) {
+ bdrv_io_limits_intercept(bs, nb_sectors, true);
+ }
+
+ ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
+ nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+
+ return ret;
+}
+
int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
int nb_sectors, QEMUIOVector *qiov)
{
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 09/19] block: write: Handle COR dependency after I/O throttling
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (7 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 08/19] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 10/19] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
` (11 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
First waiting for all COR requests to complete and calling the
throttling function afterwards means that the request could be delayed
and we still need to wait for the COR request even if it was issued only
after the throttled write request.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/block.c b/block.c
index 20ed857..f8ed623 100644
--- a/block.c
+++ b/block.c
@@ -2867,6 +2867,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+ if (bs->copy_on_read_in_flight) {
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+ }
+
tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
@@ -2918,10 +2922,6 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
return -EIO;
}
- if (bs->copy_on_read_in_flight) {
- wait_for_overlapping_requests(bs, sector_num, nb_sectors);
- }
-
/* throttling disk I/O */
if (bs->io_limits_enabled) {
bdrv_io_limits_intercept(bs, nb_sectors, true);
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 10/19] block: Introduce bdrv_co_do_pwritev()
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (8 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 09/19] block: write: Handle COR dependency after I/O throttling Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 11/19] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
` (10 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
This is going to become the bdrv_co_do_preadv() equivalent for reads. In
this patch, however, just a function taking byte offsets is created, it
doesn't align anything yet.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/block.c b/block.c
index f8ed623..8825984 100644
--- a/block.c
+++ b/block.c
@@ -2906,8 +2906,8 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
/*
* Handle a write request in coroutine context
*/
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
+ int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags)
{
int ret;
@@ -2918,21 +2918,32 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
if (bs->read_only) {
return -EACCES;
}
- if (bdrv_check_request(bs, sector_num, nb_sectors)) {
+ if (bdrv_check_byte_request(bs, offset, bytes)) {
return -EIO;
}
/* throttling disk I/O */
if (bs->io_limits_enabled) {
- bdrv_io_limits_intercept(bs, nb_sectors, true);
+ bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
}
- ret = bdrv_aligned_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
- nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+ ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
return ret;
}
+static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
+ int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+ BdrvRequestFlags flags)
+{
+ if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
+ return -EINVAL;
+ }
+
+ return bdrv_co_do_pwritev(bs, sector_num << BDRV_SECTOR_BITS,
+ nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
+}
+
int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
int nb_sectors, QEMUIOVector *qiov)
{
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 11/19] block: Switch BdrvTrackedRequest to byte granularity
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (9 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 10/19] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 12/19] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
` (9 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 52 +++++++++++++++++++++++++++++++----------------
block/backup.c | 7 ++++++-
include/block/block_int.h | 4 ++--
3 files changed, 42 insertions(+), 21 deletions(-)
diff --git a/block.c b/block.c
index 8825984..2f1d72e 100644
--- a/block.c
+++ b/block.c
@@ -2014,13 +2014,13 @@ static void tracked_request_end(BdrvTrackedRequest *req)
*/
static void tracked_request_begin(BdrvTrackedRequest *req,
BlockDriverState *bs,
- int64_t sector_num,
- int nb_sectors, bool is_write)
+ int64_t offset,
+ unsigned int bytes, bool is_write)
{
*req = (BdrvTrackedRequest){
.bs = bs,
- .sector_num = sector_num,
- .nb_sectors = nb_sectors,
+ .offset = offset,
+ .bytes = bytes,
.is_write = is_write,
.co = qemu_coroutine_self(),
};
@@ -2051,25 +2051,43 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
}
}
+static void round_bytes_to_clusters(BlockDriverState *bs,
+ int64_t offset, unsigned int bytes,
+ int64_t *cluster_offset,
+ unsigned int *cluster_bytes)
+{
+ BlockDriverInfo bdi;
+
+ if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
+ *cluster_offset = offset;
+ *cluster_bytes = bytes;
+ } else {
+ *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
+ *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
+ bdi.cluster_size);
+ }
+}
+
static bool tracked_request_overlaps(BdrvTrackedRequest *req,
- int64_t sector_num, int nb_sectors) {
+ int64_t offset, int bytes)
+{
/* aaaa bbbb */
- if (sector_num >= req->sector_num + req->nb_sectors) {
+ if (offset >= req->offset + req->bytes) {
return false;
}
/* bbbb aaaa */
- if (req->sector_num >= sector_num + nb_sectors) {
+ if (req->offset >= offset + bytes) {
return false;
}
return true;
}
static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors)
+ int64_t offset, unsigned int bytes)
{
BdrvTrackedRequest *req;
- int64_t cluster_sector_num;
- int cluster_nb_sectors;
+ int64_t cluster_offset;
+ unsigned int cluster_bytes;
bool retry;
/* If we touch the same cluster it counts as an overlap. This guarantees
@@ -2078,14 +2096,12 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
* CoR read and write operations are atomic and guest writes cannot
* interleave between them.
*/
- bdrv_round_to_clusters(bs, sector_num, nb_sectors,
- &cluster_sector_num, &cluster_nb_sectors);
+ round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
do {
retry = false;
QLIST_FOREACH(req, &bs->tracked_requests, list) {
- if (tracked_request_overlaps(req, cluster_sector_num,
- cluster_nb_sectors)) {
+ if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
/* Hitting this means there was a reentrant request, for
* example, a block driver issuing nested requests. This must
* never happen since it means deadlock.
@@ -2661,10 +2677,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
}
if (bs->copy_on_read_in_flight) {
- wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+ wait_for_overlapping_requests(bs, offset, bytes);
}
- tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
+ tracked_request_begin(&req, bs, offset, bytes, false);
if (flags & BDRV_REQ_COPY_ON_READ) {
int pnum;
@@ -2868,10 +2884,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
if (bs->copy_on_read_in_flight) {
- wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+ wait_for_overlapping_requests(bs, offset, bytes);
}
- tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
+ tracked_request_begin(&req, bs, offset, bytes, true);
ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
diff --git a/block/backup.c b/block/backup.c
index cad14c9..1752247 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -180,8 +180,13 @@ static int coroutine_fn backup_before_write_notify(
void *opaque)
{
BdrvTrackedRequest *req = opaque;
+ int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
+ int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
- return backup_do_cow(req->bs, req->sector_num, req->nb_sectors, NULL);
+ assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+ assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+ return backup_do_cow(req->bs, sector_num, nb_sectors, NULL);
}
static void backup_set_speed(BlockJob *job, int64_t speed, Error **errp)
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 31ce7d4..ec4f121 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -56,8 +56,8 @@
typedef struct BdrvTrackedRequest {
BlockDriverState *bs;
- int64_t sector_num;
- int nb_sectors;
+ int64_t offset;
+ unsigned int bytes;
bool is_write;
QLIST_ENTRY(BdrvTrackedRequest) list;
Coroutine *co; /* owner, used for deadlock detection */
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 12/19] block: Allow waiting for overlapping requests between begin/end
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (10 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 11/19] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 13/19] block: Make zero-after-EOF work with larger alignment Kevin Wolf
` (8 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Previously, it was not possible to use wait_for_overlapping_requests()
between tracked_request_begin()/end() because it would wait for itself.
Ignore the current request in the overlap check and run more of the
bdrv_co_do_preadv/pwritev code with a BdrvTrackedRequest present.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)
diff --git a/block.c b/block.c
index 2f1d72e..5eaeb3a 100644
--- a/block.c
+++ b/block.c
@@ -2083,7 +2083,7 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
}
static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
- int64_t offset, unsigned int bytes)
+ BdrvTrackedRequest *self, int64_t offset, unsigned int bytes)
{
BdrvTrackedRequest *req;
int64_t cluster_offset;
@@ -2101,6 +2101,9 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
do {
retry = false;
QLIST_FOREACH(req, &bs->tracked_requests, list) {
+ if (req == self) {
+ continue;
+ }
if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
/* Hitting this means there was a reentrant request, for
* example, a block driver issuing nested requests. This must
@@ -2659,10 +2662,10 @@ err:
* implemented by the caller.
*/
static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
- int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
+ BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
+ QEMUIOVector *qiov, int flags)
{
BlockDriver *drv = bs->drv;
- BdrvTrackedRequest req;
int ret;
int64_t sector_num = offset >> BDRV_SECTOR_BITS;
@@ -2677,11 +2680,9 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
}
if (bs->copy_on_read_in_flight) {
- wait_for_overlapping_requests(bs, offset, bytes);
+ wait_for_overlapping_requests(bs, req, offset, bytes);
}
- tracked_request_begin(&req, bs, offset, bytes, false);
-
if (flags & BDRV_REQ_COPY_ON_READ) {
int pnum;
@@ -2728,8 +2729,6 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
}
out:
- tracked_request_end(&req);
-
if (flags & BDRV_REQ_COPY_ON_READ) {
bs->copy_on_read_in_flight--;
}
@@ -2745,6 +2744,8 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
BdrvRequestFlags flags)
{
BlockDriver *drv = bs->drv;
+ BdrvTrackedRequest req;
+
/* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
uint8_t *head_buf = NULL;
@@ -2794,9 +2795,11 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
bytes = ROUND_UP(bytes, align);
}
- ret = bdrv_aligned_preadv(bs, offset, bytes,
+ tracked_request_begin(&req, bs, offset, bytes, false);
+ ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
use_local_qiov ? &local_qiov : qiov,
flags);
+ tracked_request_end(&req);
if (use_local_qiov) {
qemu_iovec_destroy(&local_qiov);
@@ -2871,10 +2874,10 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
* Forwards an already correctly aligned write request to the BlockDriver.
*/
static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
- int64_t offset, unsigned int bytes, QEMUIOVector *qiov, int flags)
+ BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
+ QEMUIOVector *qiov, int flags)
{
BlockDriver *drv = bs->drv;
- BdrvTrackedRequest req;
int ret;
int64_t sector_num = offset >> BDRV_SECTOR_BITS;
@@ -2884,12 +2887,10 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
if (bs->copy_on_read_in_flight) {
- wait_for_overlapping_requests(bs, offset, bytes);
+ wait_for_overlapping_requests(bs, req, offset, bytes);
}
- tracked_request_begin(&req, bs, offset, bytes, true);
-
- ret = notifier_with_return_list_notify(&bs->before_write_notifiers, &req);
+ ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
if (ret < 0) {
/* Do nothing, write notifier decided to fail this request */
@@ -2914,8 +2915,6 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
}
- tracked_request_end(&req);
-
return ret;
}
@@ -2926,6 +2925,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags)
{
+ BdrvTrackedRequest req;
int ret;
if (!bs->drv) {
@@ -2943,7 +2943,9 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
}
- ret = bdrv_aligned_pwritev(bs, offset, bytes, qiov, flags);
+ tracked_request_begin(&req, bs, offset, bytes, true);
+ ret = bdrv_aligned_pwritev(bs, &req, offset, bytes, qiov, flags);
+ tracked_request_end(&req);
return ret;
}
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 13/19] block: Make zero-after-EOF work with larger alignment
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (11 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 12/19] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 14/19] block: Generalise and optimise COR serialisation Kevin Wolf
` (7 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Odd file sizes could make bdrv_aligned_preadv() shorten the request in
non-aligned ways. Fix it by rounding to the required alignment instead
of 512 bytes.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/block.c b/block.c
index 5eaeb3a..0b1f08a 100644
--- a/block.c
+++ b/block.c
@@ -2663,7 +2663,7 @@ err:
*/
static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
BdrvTrackedRequest *req, int64_t offset, unsigned int bytes,
- QEMUIOVector *qiov, int flags)
+ int64_t align, QEMUIOVector *qiov, int flags)
{
BlockDriver *drv = bs->drv;
int ret;
@@ -2711,7 +2711,7 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
}
total_sectors = DIV_ROUND_UP(len, BDRV_SECTOR_SIZE);
- max_nb_sectors = MAX(0, total_sectors - sector_num);
+ max_nb_sectors = MAX(0, ROUND_UP(total_sectors - sector_num, align));
if (max_nb_sectors > 0) {
ret = drv->bdrv_co_readv(bs, sector_num,
MIN(nb_sectors, max_nb_sectors), qiov);
@@ -2796,7 +2796,7 @@ static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
}
tracked_request_begin(&req, bs, offset, bytes, false);
- ret = bdrv_aligned_preadv(bs, &req, offset, bytes,
+ ret = bdrv_aligned_preadv(bs, &req, offset, bytes, align,
use_local_qiov ? &local_qiov : qiov,
flags);
tracked_request_end(&req);
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 14/19] block: Generalise and optimise COR serialisation
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (12 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 13/19] block: Make zero-after-EOF work with larger alignment Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 15/19] block: Make overlap range for serialisation dynamic Kevin Wolf
` (6 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Change the API so that specific requests can be marked serialising. Only
these requests are checked for overlaps then.
This means that during a Copy on Read operation, not all requests
overlapping other requests are serialised any more, but only those that
actually overlap with the specific COR request.
Also remove COR from function and variable names because this
functionality can be useful in other contexts.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 40 +++++++++++++++++++++++++---------------
include/block/block_int.h | 5 +++--
2 files changed, 28 insertions(+), 17 deletions(-)
diff --git a/block.c b/block.c
index 0b1f08a..53c19a8 100644
--- a/block.c
+++ b/block.c
@@ -2005,6 +2005,10 @@ int bdrv_commit_all(void)
*/
static void tracked_request_end(BdrvTrackedRequest *req)
{
+ if (req->serialising) {
+ req->bs->serialising_in_flight--;
+ }
+
QLIST_REMOVE(req, list);
qemu_co_queue_restart_all(&req->wait_queue);
}
@@ -2023,6 +2027,7 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
.bytes = bytes,
.is_write = is_write,
.co = qemu_coroutine_self(),
+ .serialising = false,
};
qemu_co_queue_init(&req->wait_queue);
@@ -2030,6 +2035,14 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
}
+static void mark_request_serialising(BdrvTrackedRequest *req)
+{
+ if (!req->serialising) {
+ req->bs->serialising_in_flight++;
+ req->serialising = true;
+ }
+}
+
/**
* Round a region to cluster boundaries
*/
@@ -2082,26 +2095,31 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
return true;
}
-static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
- BdrvTrackedRequest *self, int64_t offset, unsigned int bytes)
+static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
{
+ BlockDriverState *bs = self->bs;
BdrvTrackedRequest *req;
int64_t cluster_offset;
unsigned int cluster_bytes;
bool retry;
+ if (!bs->serialising_in_flight) {
+ return;
+ }
+
/* If we touch the same cluster it counts as an overlap. This guarantees
* that allocating writes will be serialized and not race with each other
* for the same cluster. For example, in copy-on-read it ensures that the
* CoR read and write operations are atomic and guest writes cannot
* interleave between them.
*/
- round_bytes_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
+ round_bytes_to_clusters(bs, self->offset, self->bytes,
+ &cluster_offset, &cluster_bytes);
do {
retry = false;
QLIST_FOREACH(req, &bs->tracked_requests, list) {
- if (req == self) {
+ if (req == self || (!req->serialising && !self->serialising)) {
continue;
}
if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
@@ -2676,12 +2694,10 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
/* Handle Copy on Read and associated serialisation */
if (flags & BDRV_REQ_COPY_ON_READ) {
- bs->copy_on_read_in_flight++;
+ mark_request_serialising(req);
}
- if (bs->copy_on_read_in_flight) {
- wait_for_overlapping_requests(bs, req, offset, bytes);
- }
+ wait_serialising_requests(req);
if (flags & BDRV_REQ_COPY_ON_READ) {
int pnum;
@@ -2729,10 +2745,6 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
}
out:
- if (flags & BDRV_REQ_COPY_ON_READ) {
- bs->copy_on_read_in_flight--;
- }
-
return ret;
}
@@ -2886,9 +2898,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
assert((offset & (BDRV_SECTOR_SIZE - 1)) == 0);
assert((bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
- if (bs->copy_on_read_in_flight) {
- wait_for_overlapping_requests(bs, req, offset, bytes);
- }
+ wait_serialising_requests(req);
ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ec4f121..6ba058a 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -59,6 +59,7 @@ typedef struct BdrvTrackedRequest {
int64_t offset;
unsigned int bytes;
bool is_write;
+ bool serialising;
QLIST_ENTRY(BdrvTrackedRequest) list;
Coroutine *co; /* owner, used for deadlock detection */
CoQueue wait_queue; /* coroutines blocked on this request */
@@ -273,8 +274,8 @@ struct BlockDriverState {
/* Callback before write request is processed */
NotifierWithReturnList before_write_notifiers;
- /* number of in-flight copy-on-read requests */
- unsigned int copy_on_read_in_flight;
+ /* number of in-flight serialising requests */
+ unsigned int serialising_in_flight;
/* I/O throttling */
ThrottleState throttle_state;
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 15/19] block: Make overlap range for serialisation dynamic
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (13 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 14/19] block: Generalise and optimise COR serialisation Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 16/19] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
` (5 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Copy on Read wants to serialise with all requests touching the same
cluster, so wait_serialising_requests() rounded to cluster boundaries.
Other users like alignment RMW will have different requirements, though
(requests touching the same sector), so make it dynamic.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 51 +++++++++++++++++++++++------------------------
include/block/block_int.h | 4 ++++
2 files changed, 29 insertions(+), 26 deletions(-)
diff --git a/block.c b/block.c
index 53c19a8..b9fd5c2 100644
--- a/block.c
+++ b/block.c
@@ -2035,12 +2035,19 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
QLIST_INSERT_HEAD(&bs->tracked_requests, req, list);
}
-static void mark_request_serialising(BdrvTrackedRequest *req)
+static void mark_request_serialising(BdrvTrackedRequest *req, size_t align)
{
+ int64_t overlap_offset = req->offset & ~(align - 1);
+ int overlap_bytes = ROUND_UP(req->offset + req->bytes, align)
+ - req->overlap_offset;
+
if (!req->serialising) {
req->bs->serialising_in_flight++;
req->serialising = true;
}
+
+ req->overlap_offset = MIN(req->overlap_offset, overlap_offset);
+ req->overlap_bytes = MAX(req->overlap_bytes, overlap_bytes);
}
/**
@@ -2064,20 +2071,16 @@ void bdrv_round_to_clusters(BlockDriverState *bs,
}
}
-static void round_bytes_to_clusters(BlockDriverState *bs,
- int64_t offset, unsigned int bytes,
- int64_t *cluster_offset,
- unsigned int *cluster_bytes)
+static int bdrv_get_cluster_size(BlockDriverState *bs)
{
BlockDriverInfo bdi;
+ int ret;
- if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
- *cluster_offset = offset;
- *cluster_bytes = bytes;
+ ret = bdrv_get_info(bs, &bdi);
+ if (ret < 0 || bdi.cluster_size == 0) {
+ return bs->request_alignment;
} else {
- *cluster_offset = QEMU_ALIGN_DOWN(offset, bdi.cluster_size);
- *cluster_bytes = QEMU_ALIGN_UP(offset - *cluster_offset + bytes,
- bdi.cluster_size);
+ return bdi.cluster_size;
}
}
@@ -2085,11 +2088,11 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
int64_t offset, int bytes)
{
/* aaaa bbbb */
- if (offset >= req->offset + req->bytes) {
+ if (offset >= req->overlap_offset + req->overlap_bytes) {
return false;
}
/* bbbb aaaa */
- if (req->offset >= offset + bytes) {
+ if (req->overlap_offset >= offset + bytes) {
return false;
}
return true;
@@ -2099,30 +2102,21 @@ static void coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
{
BlockDriverState *bs = self->bs;
BdrvTrackedRequest *req;
- int64_t cluster_offset;
- unsigned int cluster_bytes;
bool retry;
if (!bs->serialising_in_flight) {
return;
}
- /* If we touch the same cluster it counts as an overlap. This guarantees
- * that allocating writes will be serialized and not race with each other
- * for the same cluster. For example, in copy-on-read it ensures that the
- * CoR read and write operations are atomic and guest writes cannot
- * interleave between them.
- */
- round_bytes_to_clusters(bs, self->offset, self->bytes,
- &cluster_offset, &cluster_bytes);
-
do {
retry = false;
QLIST_FOREACH(req, &bs->tracked_requests, list) {
if (req == self || (!req->serialising && !self->serialising)) {
continue;
}
- if (tracked_request_overlaps(req, cluster_offset, cluster_bytes)) {
+ if (tracked_request_overlaps(req, self->overlap_offset,
+ self->overlap_bytes))
+ {
/* Hitting this means there was a reentrant request, for
* example, a block driver issuing nested requests. This must
* never happen since it means deadlock.
@@ -2694,7 +2688,12 @@ static int coroutine_fn bdrv_aligned_preadv(BlockDriverState *bs,
/* Handle Copy on Read and associated serialisation */
if (flags & BDRV_REQ_COPY_ON_READ) {
- mark_request_serialising(req);
+ /* If we touch the same cluster it counts as an overlap. This
+ * guarantees that allocating writes will be serialized and not race
+ * with each other for the same cluster. For example, in copy-on-read
+ * it ensures that the CoR read and write operations are atomic and
+ * guest writes cannot interleave between them. */
+ mark_request_serialising(req, bdrv_get_cluster_size(bs));
}
wait_serialising_requests(req);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 6ba058a..0c1f5fa 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -59,7 +59,11 @@ typedef struct BdrvTrackedRequest {
int64_t offset;
unsigned int bytes;
bool is_write;
+
bool serialising;
+ int64_t overlap_offset;
+ unsigned int overlap_bytes;
+
QLIST_ENTRY(BdrvTrackedRequest) list;
Coroutine *co; /* owner, used for deadlock detection */
CoQueue wait_queue; /* coroutines blocked on this request */
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 16/19] block: Align requests in bdrv_co_do_pwritev()
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (14 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 15/19] block: Make overlap range for serialisation dynamic Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 17/19] block: Change coroutine wrapper to byte granularity Kevin Wolf
` (4 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
This patch changes bdrv_co_do_pwritev() to actually be what its name
promises. If requests aren't properly aligned, it performs a RMW.
Requests touching the same block are serialised against the RMW request.
Further optimisation of this is possible by differentiating types of
requests (concurrent reads should actually be okay here).
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 85 insertions(+), 1 deletion(-)
diff --git a/block.c b/block.c
index b9fd5c2..23ad09d 100644
--- a/block.c
+++ b/block.c
@@ -2935,6 +2935,12 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
BdrvRequestFlags flags)
{
BdrvTrackedRequest req;
+ /* TODO Lift BDRV_SECTOR_SIZE restriction in BlockDriver interface */
+ uint64_t align = MAX(BDRV_SECTOR_SIZE, bs->request_alignment);
+ uint8_t *head_buf = NULL;
+ uint8_t *tail_buf = NULL;
+ QEMUIOVector local_qiov;
+ bool use_local_qiov = false;
int ret;
if (!bs->drv) {
@@ -2952,10 +2958,88 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
bdrv_io_limits_intercept(bs, bytes << BDRV_SECTOR_BITS, true);
}
+ /*
+ * Align write if necessary by performing a read-modify-write cycle.
+ * Pad qiov with the read parts and be sure to have a tracked request not
+ * only for bdrv_aligned_pwritev, but also for the reads of the RMW cycle.
+ */
tracked_request_begin(&req, bs, offset, bytes, true);
- ret = bdrv_aligned_pwritev(bs, &req, offset, bytes, qiov, flags);
+
+ if (offset & (align - 1)) {
+ QEMUIOVector head_qiov;
+ struct iovec head_iov;
+
+ mark_request_serialising(&req, align);
+ wait_serialising_requests(&req);
+
+ head_buf = qemu_blockalign(bs, align);
+ head_iov = (struct iovec) {
+ .iov_base = head_buf,
+ .iov_len = align,
+ };
+ qemu_iovec_init_external(&head_qiov, &head_iov, 1);
+
+ ret = bdrv_aligned_preadv(bs, &req, offset & ~(align - 1), align,
+ align, &head_qiov, 0);
+ if (ret < 0) {
+ goto fail;
+ }
+
+ qemu_iovec_init(&local_qiov, qiov->niov + 2);
+ qemu_iovec_add(&local_qiov, head_buf, offset & (align - 1));
+ qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+ use_local_qiov = true;
+
+ bytes += offset & (align - 1);
+ offset = offset & ~(align - 1);
+ }
+
+ if ((offset + bytes) & (align - 1)) {
+ QEMUIOVector tail_qiov;
+ struct iovec tail_iov;
+ size_t tail_bytes;
+
+ mark_request_serialising(&req, align);
+ wait_serialising_requests(&req);
+
+ tail_buf = qemu_blockalign(bs, align);
+ tail_iov = (struct iovec) {
+ .iov_base = tail_buf,
+ .iov_len = align,
+ };
+ qemu_iovec_init_external(&tail_qiov, &tail_iov, 1);
+
+ ret = bdrv_aligned_preadv(bs, &req, (offset + bytes) & ~(align - 1), align,
+ align, &tail_qiov, 0);
+ if (ret < 0) {
+ goto fail;
+ }
+
+ if (!use_local_qiov) {
+ qemu_iovec_init(&local_qiov, qiov->niov + 1);
+ qemu_iovec_concat(&local_qiov, qiov, 0, qiov->size);
+ use_local_qiov = true;
+ }
+
+ tail_bytes = (offset + bytes) & (align - 1);
+ qemu_iovec_add(&local_qiov, tail_buf + tail_bytes, align - tail_bytes);
+
+ bytes = ROUND_UP(bytes, align);
+ }
+
+ ret = bdrv_aligned_pwritev(bs, &req, offset, bytes,
+ use_local_qiov ? &local_qiov : qiov,
+ flags);
+
+fail:
tracked_request_end(&req);
+ if (use_local_qiov) {
+ qemu_iovec_destroy(&local_qiov);
+ qemu_vfree(head_buf);
+ qemu_vfree(tail_buf);
+ }
+
return ret;
}
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 17/19] block: Change coroutine wrapper to byte granularity
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (15 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 16/19] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 18/19] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
` (3 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 48 ++++++++++++++++++++++++++----------------------
1 file changed, 26 insertions(+), 22 deletions(-)
diff --git a/block.c b/block.c
index 23ad09d..514a5c2 100644
--- a/block.c
+++ b/block.c
@@ -69,11 +69,11 @@ static int coroutine_fn bdrv_co_readv_em(BlockDriverState *bs,
static int coroutine_fn bdrv_co_writev_em(BlockDriverState *bs,
int64_t sector_num, int nb_sectors,
QEMUIOVector *iov);
-static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_preadv(BlockDriverState *bs,
+ int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags);
-static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
+static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
+ int64_t offset, unsigned int bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags);
static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
int64_t sector_num,
@@ -2337,8 +2337,7 @@ static int bdrv_check_request(BlockDriverState *bs, int64_t sector_num,
typedef struct RwCo {
BlockDriverState *bs;
- int64_t sector_num;
- int nb_sectors;
+ int64_t offset;
QEMUIOVector *qiov;
bool is_write;
int ret;
@@ -2350,34 +2349,32 @@ static void coroutine_fn bdrv_rw_co_entry(void *opaque)
RwCo *rwco = opaque;
if (!rwco->is_write) {
- rwco->ret = bdrv_co_do_readv(rwco->bs, rwco->sector_num,
- rwco->nb_sectors, rwco->qiov,
- rwco->flags);
- } else {
- rwco->ret = bdrv_co_do_writev(rwco->bs, rwco->sector_num,
- rwco->nb_sectors, rwco->qiov,
+ rwco->ret = bdrv_co_do_preadv(rwco->bs, rwco->offset,
+ rwco->qiov->size, rwco->qiov,
rwco->flags);
+ } else {
+ rwco->ret = bdrv_co_do_pwritev(rwco->bs, rwco->offset,
+ rwco->qiov->size, rwco->qiov,
+ rwco->flags);
}
}
/*
* Process a vectored synchronous request using coroutines
*/
-static int bdrv_rwv_co(BlockDriverState *bs, int64_t sector_num,
- QEMUIOVector *qiov, bool is_write,
- BdrvRequestFlags flags)
+static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
+ QEMUIOVector *qiov, bool is_write,
+ BdrvRequestFlags flags)
{
Coroutine *co;
RwCo rwco = {
.bs = bs,
- .sector_num = sector_num,
- .nb_sectors = qiov->size >> BDRV_SECTOR_BITS,
+ .offset = offset,
.qiov = qiov,
.is_write = is_write,
.ret = NOT_DONE,
.flags = flags,
};
- assert((qiov->size & (BDRV_SECTOR_SIZE - 1)) == 0);
/**
* In sync call context, when the vcpu is blocked, this throttling timer
@@ -2416,7 +2413,8 @@ static int bdrv_rw_co(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
};
qemu_iovec_init_external(&qiov, &iov, 1);
- return bdrv_rwv_co(bs, sector_num, &qiov, is_write, flags);
+ return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS,
+ &qiov, is_write, flags);
}
/* return < 0 if error. See bdrv_write() for the return codes */
@@ -2454,7 +2452,7 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov)
{
- return bdrv_rwv_co(bs, sector_num, qiov, true, 0);
+ return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS, qiov, true, 0);
}
int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
@@ -4419,9 +4417,15 @@ int bdrv_flush(BlockDriverState *bs)
return rwco.ret;
}
+typedef struct DiscardCo {
+ BlockDriverState *bs;
+ int64_t sector_num;
+ int nb_sectors;
+ int ret;
+} DiscardCo;
static void coroutine_fn bdrv_discard_co_entry(void *opaque)
{
- RwCo *rwco = opaque;
+ DiscardCo *rwco = opaque;
rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors);
}
@@ -4470,7 +4474,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
{
Coroutine *co;
- RwCo rwco = {
+ DiscardCo rwco = {
.bs = bs,
.sector_num = sector_num,
.nb_sectors = nb_sectors,
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 18/19] block: Make bdrv_pread() a bdrv_prwv_co() wrapper
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (16 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 17/19] block: Change coroutine wrapper to byte granularity Kevin Wolf
@ 2013-12-06 17:22 ` Kevin Wolf
2013-12-06 17:23 ` [Qemu-devel] [RFC PATCH 19/19] block: Make bdrv_pwrite() " Kevin Wolf
` (2 subsequent siblings)
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:22 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Instead of implementing the alignment adjustment here, use the now
existing functionality of bdrv_co_do_preadv().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 49 +++++++++++++------------------------------------
1 file changed, 13 insertions(+), 36 deletions(-)
diff --git a/block.c b/block.c
index 514a5c2..ccb4f4c 100644
--- a/block.c
+++ b/block.c
@@ -2461,49 +2461,26 @@ int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
BDRV_REQ_ZERO_WRITE);
}
-int bdrv_pread(BlockDriverState *bs, int64_t offset,
- void *buf, int count1)
+int bdrv_pread(BlockDriverState *bs, int64_t offset, void *buf, int bytes)
{
- uint8_t tmp_buf[BDRV_SECTOR_SIZE];
- int len, nb_sectors, count;
- int64_t sector_num;
+ QEMUIOVector qiov;
+ struct iovec iov = {
+ .iov_base = (void *)buf,
+ .iov_len = bytes,
+ };
int ret;
- count = count1;
- /* first read to align to sector start */
- len = (BDRV_SECTOR_SIZE - offset) & (BDRV_SECTOR_SIZE - 1);
- if (len > count)
- len = count;
- sector_num = offset >> BDRV_SECTOR_BITS;
- if (len > 0) {
- if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
- return ret;
- memcpy(buf, tmp_buf + (offset & (BDRV_SECTOR_SIZE - 1)), len);
- count -= len;
- if (count == 0)
- return count1;
- sector_num++;
- buf += len;
+ if (bytes < 0) {
+ return -EINVAL;
}
- /* read the sectors "in place" */
- nb_sectors = count >> BDRV_SECTOR_BITS;
- if (nb_sectors > 0) {
- if ((ret = bdrv_read(bs, sector_num, buf, nb_sectors)) < 0)
- return ret;
- sector_num += nb_sectors;
- len = nb_sectors << BDRV_SECTOR_BITS;
- buf += len;
- count -= len;
+ qemu_iovec_init_external(&qiov, &iov, 1);
+ ret = bdrv_prwv_co(bs, offset, &qiov, false, 0);
+ if (ret < 0) {
+ return ret;
}
- /* add data from the last sector */
- if (count > 0) {
- if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
- return ret;
- memcpy(buf, tmp_buf, count);
- }
- return count1;
+ return bytes;
}
int bdrv_pwritev(BlockDriverState *bs, int64_t offset, QEMUIOVector *qiov)
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [Qemu-devel] [RFC PATCH 19/19] block: Make bdrv_pwrite() a bdrv_prwv_co() wrapper
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (17 preceding siblings ...)
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 18/19] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
@ 2013-12-06 17:23 ` Kevin Wolf
2013-12-06 17:55 ` [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Paolo Bonzini
2013-12-09 12:51 ` Stefan Hajnoczi
20 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-06 17:23 UTC (permalink / raw)
To: qemu-devel; +Cc: kwolf, pbonzini, armbru, stefanha
Instead of implementing the alignment adjustment here, use the now
existing functionality of bdrv_co_do_pwritev().
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
block.c | 64 +++++++++-------------------------------------------------------
1 file changed, 9 insertions(+), 55 deletions(-)
diff --git a/block.c b/block.c
index ccb4f4c..75ea5ac 100644
--- a/block.c
+++ b/block.c
@@ -2450,11 +2450,6 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
return bdrv_rw_co(bs, sector_num, (uint8_t *)buf, nb_sectors, true, 0);
}
-int bdrv_writev(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov)
-{
- return bdrv_prwv_co(bs, sector_num << BDRV_SECTOR_BITS, qiov, true, 0);
-}
-
int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
{
return bdrv_rw_co(bs, sector_num, NULL, nb_sectors, true,
@@ -2485,70 +2480,29 @@ int bdrv_pread(BlockDriverState *bs, int64_t offset, void *buf, int bytes)
int bdrv_pwritev(BlockDriverState *bs, int64_t offset, QEMUIOVector *qiov)
{
- uint8_t tmp_buf[BDRV_SECTOR_SIZE];
- int len, nb_sectors, count;
- int64_t sector_num;
int ret;
- count = qiov->size;
-
- /* first write to align to sector start */
- len = (BDRV_SECTOR_SIZE - offset) & (BDRV_SECTOR_SIZE - 1);
- if (len > count)
- len = count;
- sector_num = offset >> BDRV_SECTOR_BITS;
- if (len > 0) {
- if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
- return ret;
- qemu_iovec_to_buf(qiov, 0, tmp_buf + (offset & (BDRV_SECTOR_SIZE - 1)),
- len);
- if ((ret = bdrv_write(bs, sector_num, tmp_buf, 1)) < 0)
- return ret;
- count -= len;
- if (count == 0)
- return qiov->size;
- sector_num++;
- }
-
- /* write the sectors "in place" */
- nb_sectors = count >> BDRV_SECTOR_BITS;
- if (nb_sectors > 0) {
- QEMUIOVector qiov_inplace;
-
- qemu_iovec_init(&qiov_inplace, qiov->niov);
- qemu_iovec_concat(&qiov_inplace, qiov, len,
- nb_sectors << BDRV_SECTOR_BITS);
- ret = bdrv_writev(bs, sector_num, &qiov_inplace);
- qemu_iovec_destroy(&qiov_inplace);
- if (ret < 0) {
- return ret;
- }
-
- sector_num += nb_sectors;
- len = nb_sectors << BDRV_SECTOR_BITS;
- count -= len;
+ ret = bdrv_prwv_co(bs, offset, qiov, true, 0);
+ if (ret < 0) {
+ return ret;
}
- /* add data from the last sector */
- if (count > 0) {
- if ((ret = bdrv_read(bs, sector_num, tmp_buf, 1)) < 0)
- return ret;
- qemu_iovec_to_buf(qiov, qiov->size - count, tmp_buf, count);
- if ((ret = bdrv_write(bs, sector_num, tmp_buf, 1)) < 0)
- return ret;
- }
return qiov->size;
}
int bdrv_pwrite(BlockDriverState *bs, int64_t offset,
- const void *buf, int count1)
+ const void *buf, int bytes)
{
QEMUIOVector qiov;
struct iovec iov = {
.iov_base = (void *) buf,
- .iov_len = count1,
+ .iov_len = bytes,
};
+ if (bytes < 0) {
+ return -EINVAL;
+ }
+
qemu_iovec_init_external(&qiov, &iov, 1);
return bdrv_pwritev(bs, offset, &qiov);
}
--
1.8.1.4
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 05/19] raw: Probe required direct I/O alignment
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 05/19] raw: Probe required direct I/O alignment Kevin Wolf
@ 2013-12-06 17:53 ` Paolo Bonzini
2013-12-09 12:58 ` Kevin Wolf
0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2013-12-06 17:53 UTC (permalink / raw)
To: Kevin Wolf; +Cc: qemu-devel, stefanha, armbru
Il 06/12/2013 18:22, Kevin Wolf ha scritto:
> @@ -1227,6 +1288,7 @@ static BlockDriver bdrv_file = {
> .bdrv_aio_writev = raw_aio_writev,
> .bdrv_aio_flush = raw_aio_flush,
> .bdrv_aio_discard = raw_aio_discard,
> + .bdrv_opt_mem_align = raw_opt_mem_align,
>
> .bdrv_truncate = raw_truncate,
> .bdrv_getlength = raw_getlength,
> @@ -1582,6 +1644,7 @@ static BlockDriver bdrv_host_device = {
> .bdrv_aio_writev = raw_aio_writev,
> .bdrv_aio_flush = raw_aio_flush,
> .bdrv_aio_discard = hdev_aio_discard,
> + .bdrv_opt_mem_align = raw_opt_mem_align,
Should this rather be a BlockLimits field?
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (18 preceding siblings ...)
2013-12-06 17:23 ` [Qemu-devel] [RFC PATCH 19/19] block: Make bdrv_pwrite() " Kevin Wolf
@ 2013-12-06 17:55 ` Paolo Bonzini
2013-12-09 11:16 ` Kevin Wolf
2013-12-09 12:51 ` Stefan Hajnoczi
20 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2013-12-06 17:55 UTC (permalink / raw)
To: Kevin Wolf; +Cc: qemu-devel, stefanha, armbru
Il 06/12/2013 18:22, Kevin Wolf ha scritto:
> This patch series adds code to the block layer that allows performing
> I/O requests in smaller granularities than required by the host backend
> (most importantly, O_DIRECT restrictions). It achieves this for reads
> by rounding the request to host-side block boundary, and for writes by
> performing a read-modify-write cycle (and serialising requests
> touching the same block so that the RMW doesn't write back stale data).
>
> Originally I intended to reuse a lot of code from Paolo's previous
> patch series, however as I tried to integrate pread/pwrite, which
> already do a very similar thing (except for considering concurrency),
> and because I wanted to implement zero-copy, most of this series ended
> up being new code.
Very nice! There's not much to comment on regarding the 512-on-4k part.
Just one question. Do you think we could make our emulated disks
512b-logical/4k-physical (512E) by default? In other words, I don't
remember the outcome of our discussion on whether the standard promises
a 4k write granularity for this configuration.
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 02/19] block: Detect unaligned length in bdrv_qiov_is_aligned()
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 02/19] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
@ 2013-12-06 19:12 ` Eric Blake
0 siblings, 0 replies; 32+ messages in thread
From: Eric Blake @ 2013-12-06 19:12 UTC (permalink / raw)
To: Kevin Wolf, qemu-devel; +Cc: pbonzini, armbru, stefanha
[-- Attachment #1: Type: text/plain, Size: 916 bytes --]
On 12/06/2013 10:22 AM, Kevin Wolf wrote:
> For an O_DIRECT request to succeed, it's not only necessary that all
> base addresses in the qiov are aligned, but also that each lengh in it
s/lengh/length/
> is aligned.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
> block.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/block.c b/block.c
> index 382ea71..613201b 100644
> --- a/block.c
> +++ b/block.c
> @@ -4349,6 +4349,9 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
> if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
> return false;
> }
> + if (qiov->iov[i].iov_len % bs->buffer_alignment) {
> + return false;
> + }
> }
>
> return true;
>
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation
2013-12-06 17:55 ` [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Paolo Bonzini
@ 2013-12-09 11:16 ` Kevin Wolf
0 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-09 11:16 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel, stefanha, armbru
Am 06.12.2013 um 18:55 hat Paolo Bonzini geschrieben:
> Il 06/12/2013 18:22, Kevin Wolf ha scritto:
> > This patch series adds code to the block layer that allows performing
> > I/O requests in smaller granularities than required by the host backend
> > (most importantly, O_DIRECT restrictions). It achieves this for reads
> > by rounding the request to host-side block boundary, and for writes by
> > performing a read-modify-write cycle (and serialising requests
> > touching the same block so that the RMW doesn't write back stale data).
> >
> > Originally I intended to reuse a lot of code from Paolo's previous
> > patch series, however as I tried to integrate pread/pwrite, which
> > already do a very similar thing (except for considering concurrency),
> > and because I wanted to implement zero-copy, most of this series ended
> > up being new code.
>
> Very nice! There's not much to comment on regarding the 512-on-4k part.
Thanks. Then I guess it makes sense to push it before attacking
4k-on-512.
> Just one question. Do you think we could make our emulated disks
> 512b-logical/4k-physical (512E) by default? In other words, I don't
> remember the outcome of our discussion on whether the standard promises
> a 4k write granularity for this configuration.
It probably depends on our answer to the questions I raised for
4k-on-512. In generic, I think it's a good idea, but if we come to the
conclusion that it isn't safe, we can't do it, obviously.
As for the standard, which "the" standard? I guess virtio-blk doesn't
say anything on the matter, leaves us with ATA and SBC. I haven't
checked either yet, but I'll see if I can find something.
Kevin
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
` (19 preceding siblings ...)
2013-12-06 17:55 ` [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Paolo Bonzini
@ 2013-12-09 12:51 ` Stefan Hajnoczi
2013-12-09 13:02 ` Kevin Wolf
20 siblings, 1 reply; 32+ messages in thread
From: Stefan Hajnoczi @ 2013-12-09 12:51 UTC (permalink / raw)
To: Kevin Wolf; +Cc: pbonzini, qemu-devel, stefanha, armbru
On Fri, Dec 06, 2013 at 06:22:41PM +0100, Kevin Wolf wrote:
> This series does not cover 4k guests on a 512 byte host, and I'm not
> sure yet what to do with this case. Paolos series contained a patch to
> protect against "torn reads" (i.e. reads running in parallel with
> writes, which return old data for one half of a sector and new data for
> the other half) by serialising requests if the guest block size was
> greater than the host block size.
>
> One problem with this approach is that it assumes that a single host
> block size even exists and can be compared against on the top level.
> Different backing files can be stored on different storage, though, with
> different block sizes.
As long as the backing file is read-only you won't get torn reads.
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 05/19] raw: Probe required direct I/O alignment
2013-12-06 17:53 ` Paolo Bonzini
@ 2013-12-09 12:58 ` Kevin Wolf
2013-12-09 13:40 ` Paolo Bonzini
0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2013-12-09 12:58 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: pl, qemu-devel, stefanha, armbru
Am 06.12.2013 um 18:53 hat Paolo Bonzini geschrieben:
> Il 06/12/2013 18:22, Kevin Wolf ha scritto:
> > @@ -1227,6 +1288,7 @@ static BlockDriver bdrv_file = {
> > .bdrv_aio_writev = raw_aio_writev,
> > .bdrv_aio_flush = raw_aio_flush,
> > .bdrv_aio_discard = raw_aio_discard,
> > + .bdrv_opt_mem_align = raw_opt_mem_align,
> >
> > .bdrv_truncate = raw_truncate,
> > .bdrv_getlength = raw_getlength,
> > @@ -1582,6 +1644,7 @@ static BlockDriver bdrv_host_device = {
> > .bdrv_aio_writev = raw_aio_writev,
> > .bdrv_aio_flush = raw_aio_flush,
> > .bdrv_aio_discard = hdev_aio_discard,
> > + .bdrv_opt_mem_align = raw_opt_mem_align,
>
> Should this rather be a BlockLimits field?
How is BlockLimits supposed with respect to inheritance of values
through the BDS tree? I tried looking at the code, but for example
bl.opt_transfer_length is only forwarded in raw, so for any other format
(or if you ever put a filter there) it simply doesn't work.
I could initialise a new BlockLimits.opt_mem_align field in
bdrv_open_common() with the value of bs->file->bl.opt_mem_align, and in
bdrv_open_backing_file() change it to MAX(bs->bl.opt_mem_align,
bs->backing_hd->bl.opt_mem_align). The block driver could then in
bdrv_open() override the former, but never the latter.
What would happen on bdrv_reopen(), specifically toggling O_DIRECT? The
values would have to change then.
Kevin
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation
2013-12-09 12:51 ` Stefan Hajnoczi
@ 2013-12-09 13:02 ` Kevin Wolf
0 siblings, 0 replies; 32+ messages in thread
From: Kevin Wolf @ 2013-12-09 13:02 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: pbonzini, qemu-devel, stefanha, armbru
Am 09.12.2013 um 13:51 hat Stefan Hajnoczi geschrieben:
> On Fri, Dec 06, 2013 at 06:22:41PM +0100, Kevin Wolf wrote:
> > This series does not cover 4k guests on a 512 byte host, and I'm not
> > sure yet what to do with this case. Paolos series contained a patch to
> > protect against "torn reads" (i.e. reads running in parallel with
> > writes, which return old data for one half of a sector and new data for
> > the other half) by serialising requests if the guest block size was
> > greater than the host block size.
> >
> > One problem with this approach is that it assumes that a single host
> > block size even exists and can be compared against on the top level.
> > Different backing files can be stored on different storage, though, with
> > different block sizes.
>
> As long as the backing file is read-only you won't get torn reads.
Right. Even with image fleecing you might get away with it, because you
don't have an expected alignment on the NBD client (or do you?)
But how about VMDK extents or Quorum backends, then?
Kevin
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 05/19] raw: Probe required direct I/O alignment
2013-12-09 12:58 ` Kevin Wolf
@ 2013-12-09 13:40 ` Paolo Bonzini
0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2013-12-09 13:40 UTC (permalink / raw)
To: Kevin Wolf; +Cc: pl, qemu-devel, stefanha, armbru
Il 09/12/2013 13:58, Kevin Wolf ha scritto:
> How is BlockLimits supposed with respect to inheritance of values
> through the BDS tree?
I think right now the accessor wrappers for the various BlockLimits
field sort that out. Of course this could be slow if the accessors end
up in the fast path (but that's not different from what your series does
already).
> I tried looking at the code, but for example
> bl.opt_transfer_length is only forwarded in raw, so for any other format
> (or if you ever put a filter there) it simply doesn't work.
That's correct.
For example, for qcow2 the optimal transfer length could be the cluster
size. Without benchmarking, I didn't complain about Peter's choice of
leaving it zero.
> I could initialise a new BlockLimits.opt_mem_align field in
> bdrv_open_common() with the value of bs->file->bl.opt_mem_align, and in
> bdrv_open_backing_file() change it to MAX(bs->bl.opt_mem_align,
> bs->backing_hd->bl.opt_mem_align). The block driver could then in
> bdrv_open() override the former, but never the latter.
>
> What would happen on bdrv_reopen(), specifically toggling O_DIRECT? The
> values would have to change then.
Yes. This also goes in favor of making wrappers handle the stacking of
limits, at least for now.
Paolo
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 03/19] block: Don't use guest sector size for qemu_blockalign()
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 03/19] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
@ 2013-12-10 3:18 ` Wenchao Xia
2013-12-10 9:42 ` Kevin Wolf
0 siblings, 1 reply; 32+ messages in thread
From: Wenchao Xia @ 2013-12-10 3:18 UTC (permalink / raw)
To: Kevin Wolf, qemu-devel; +Cc: pbonzini, armbru, stefanha
于 2013/12/7 1:22, Kevin Wolf 写道:
> bs->buffer_alignment is set by the device emulation and contains the
> logical block size of the guest device. This isn't something that the
> block layer should know, and even less something to use for determining
> the right alignment of buffers to be used for the host.
>
> The new function bdrv_opt_mem_align() allows for hooks in a BlockDriver
> so that it can tell the qemu block layer the optimal alignment to be
> used so that no bounce buffer must be used in the driver.
>
> This patch may change the buffer alignment from 4k to 512 for all
> callers that used qemu_blockalign() with the top-level image format
> BlockDriverState. The value was never propagated to other levels in the
> tree, so in particular raw-posix never required anything else than 512.
>
> While on disks with 4k sectors direct I/O requires a 4k alignment,
> memory may still be okay when aligned to 512 byte boundaries. This is
> what must have happened in practice, because otherwise this would
> already have failed earlier. Therefore I don't expect regressions even
> with this intermediate state. Later, raw-posix can implement the hook
> and expose a different memory alignment requirement.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
> block.c | 32 +++++++++++++++++++++++++++++---
> include/block/block.h | 1 +
> include/block/block_int.h | 4 ++++
> 3 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/block.c b/block.c
> index 613201b..669793b 100644
> --- a/block.c
> +++ b/block.c
> @@ -213,6 +213,31 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
> qemu_co_queue_next(&bs->throttled_reqs[is_write]);
> }
>
> +size_t bdrv_opt_mem_align(BlockDriverState *bs)
> +{
> + size_t alignment;
> +
> + if (!bs || !bs->drv) {
> + /* 4k should be on the safe side */
> + return 4096;
> + }
> +
> + if (bs->drv->bdrv_opt_mem_align) {
> + return bs->drv->bdrv_opt_mem_align(bs);
> + }
> +
> + if (bs->file) {
> + alignment = bdrv_opt_mem_align(bs->file);
> + } else {
> + alignment = 512;
> + }
> +
> + if (bs->backing_hd) {
> + alignment = MAX(alignment, bdrv_opt_mem_align(bs->backing_hd));
> + }
Maybe I didn't understand the commit message correctly, does this code
intend to get MAX alignment value in a chain? For example:
base(4096)->mid(512)->top(1024) results: 4096
The condition to traver the backing files seems complex.
> + return alignment;
> +}
> +
> /* check if the path starts with "<protocol>:" */
> static int path_has_protocol(const char *path)
> {
> @@ -4335,7 +4360,7 @@ void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
>
> void *qemu_blockalign(BlockDriverState *bs, size_t size)
> {
> - return qemu_memalign((bs && bs->buffer_alignment) ? bs->buffer_alignment : 512, size);
> + return qemu_memalign(bdrv_opt_mem_align(bs), size);
> }
>
> /*
> @@ -4344,12 +4369,13 @@ void *qemu_blockalign(BlockDriverState *bs, size_t size)
> bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
> {
> int i;
> + size_t alignment = bdrv_opt_mem_align(bs);
>
> for (i = 0; i < qiov->niov; i++) {
> - if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
> + if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
> return false;
> }
> - if (qiov->iov[i].iov_len % bs->buffer_alignment) {
> + if (qiov->iov[i].iov_len % alignment) {
> return false;
> }
> }
> diff --git a/include/block/block.h b/include/block/block.h
> index 3560deb..d262c0e 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -383,6 +383,7 @@ void bdrv_img_create(const char *filename, const char *fmt,
> char *options, uint64_t img_size, int flags,
> Error **errp, bool quiet);
>
> +size_t bdrv_opt_mem_align(BlockDriverState *bs);
> void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
> void *qemu_blockalign(BlockDriverState *bs, size_t size);
> bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 1666066..6a84f83 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -119,6 +119,10 @@ struct BlockDriver {
> int64_t sector_num, int nb_sectors,
> BlockDriverCompletionFunc *cb, void *opaque);
>
> + /* Returns the alignment in bytes that is required so that no bounce buffer
> + * is required throughout the stack */
> + int (*bdrv_opt_mem_align)(BlockDriverState *bs);
> +
> int coroutine_fn (*bdrv_co_readv)(BlockDriverState *bs,
> int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
> int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 04/19] block: rename buffer_alignment to guest_block_size
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 04/19] block: rename buffer_alignment to guest_block_size Kevin Wolf
@ 2013-12-10 3:25 ` Wenchao Xia
0 siblings, 0 replies; 32+ messages in thread
From: Wenchao Xia @ 2013-12-10 3:25 UTC (permalink / raw)
To: Kevin Wolf, qemu-devel; +Cc: pbonzini, armbru, stefanha
Cool to have a separation of emulator/block concepts.
Reviewed-by: Wenchao Xia <xiawenc@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 03/19] block: Don't use guest sector size for qemu_blockalign()
2013-12-10 3:18 ` Wenchao Xia
@ 2013-12-10 9:42 ` Kevin Wolf
2013-12-11 2:43 ` Wenchao Xia
0 siblings, 1 reply; 32+ messages in thread
From: Kevin Wolf @ 2013-12-10 9:42 UTC (permalink / raw)
To: Wenchao Xia; +Cc: pbonzini, qemu-devel, stefanha, armbru
Am 10.12.2013 um 04:18 hat Wenchao Xia geschrieben:
> 于 2013/12/7 1:22, Kevin Wolf 写道:
> > bs->buffer_alignment is set by the device emulation and contains the
> > logical block size of the guest device. This isn't something that the
> > block layer should know, and even less something to use for determining
> > the right alignment of buffers to be used for the host.
> >
> > The new function bdrv_opt_mem_align() allows for hooks in a BlockDriver
> > so that it can tell the qemu block layer the optimal alignment to be
> > used so that no bounce buffer must be used in the driver.
> >
> > This patch may change the buffer alignment from 4k to 512 for all
> > callers that used qemu_blockalign() with the top-level image format
> > BlockDriverState. The value was never propagated to other levels in the
> > tree, so in particular raw-posix never required anything else than 512.
> >
> > While on disks with 4k sectors direct I/O requires a 4k alignment,
> > memory may still be okay when aligned to 512 byte boundaries. This is
> > what must have happened in practice, because otherwise this would
> > already have failed earlier. Therefore I don't expect regressions even
> > with this intermediate state. Later, raw-posix can implement the hook
> > and expose a different memory alignment requirement.
> >
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> > block.c | 32 +++++++++++++++++++++++++++++---
> > include/block/block.h | 1 +
> > include/block/block_int.h | 4 ++++
> > 3 files changed, 34 insertions(+), 3 deletions(-)
> >
> > diff --git a/block.c b/block.c
> > index 613201b..669793b 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -213,6 +213,31 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
> > qemu_co_queue_next(&bs->throttled_reqs[is_write]);
> > }
> >
> > +size_t bdrv_opt_mem_align(BlockDriverState *bs)
> > +{
> > + size_t alignment;
> > +
> > + if (!bs || !bs->drv) {
> > + /* 4k should be on the safe side */
> > + return 4096;
> > + }
> > +
> > + if (bs->drv->bdrv_opt_mem_align) {
> > + return bs->drv->bdrv_opt_mem_align(bs);
> > + }
> > +
> > + if (bs->file) {
> > + alignment = bdrv_opt_mem_align(bs->file);
> > + } else {
> > + alignment = 512;
> > + }
> > +
> > + if (bs->backing_hd) {
> > + alignment = MAX(alignment, bdrv_opt_mem_align(bs->backing_hd));
> > + }
>
> Maybe I didn't understand the commit message correctly, does this code
> intend to get MAX alignment value in a chain? For example:
> base(4096)->mid(512)->top(1024) results: 4096
> The condition to traver the backing files seems complex.
Yes, you want to align any newly allocated buffer such that no matter
what direction it takes on its way through the block layer, it will
never be misaligned. This means that you need to use the highest
alignment restriction.
Kevin
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 03/19] block: Don't use guest sector size for qemu_blockalign()
2013-12-10 9:42 ` Kevin Wolf
@ 2013-12-11 2:43 ` Wenchao Xia
0 siblings, 0 replies; 32+ messages in thread
From: Wenchao Xia @ 2013-12-11 2:43 UTC (permalink / raw)
To: Kevin Wolf; +Cc: pbonzini, qemu-devel, stefanha, armbru
于 2013/12/10 17:42, Kevin Wolf 写道:
> Am 10.12.2013 um 04:18 hat Wenchao Xia geschrieben:
>> 于 2013/12/7 1:22, Kevin Wolf 写道:
>>> bs->buffer_alignment is set by the device emulation and contains the
>>> logical block size of the guest device. This isn't something that the
>>> block layer should know, and even less something to use for determining
>>> the right alignment of buffers to be used for the host.
>>>
>>> The new function bdrv_opt_mem_align() allows for hooks in a BlockDriver
>>> so that it can tell the qemu block layer the optimal alignment to be
>>> used so that no bounce buffer must be used in the driver.
>>>
>>> This patch may change the buffer alignment from 4k to 512 for all
>>> callers that used qemu_blockalign() with the top-level image format
>>> BlockDriverState. The value was never propagated to other levels in the
>>> tree, so in particular raw-posix never required anything else than 512.
>>>
>>> While on disks with 4k sectors direct I/O requires a 4k alignment,
>>> memory may still be okay when aligned to 512 byte boundaries. This is
>>> what must have happened in practice, because otherwise this would
>>> already have failed earlier. Therefore I don't expect regressions even
>>> with this intermediate state. Later, raw-posix can implement the hook
>>> and expose a different memory alignment requirement.
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> ---
>>> block.c | 32 +++++++++++++++++++++++++++++---
>>> include/block/block.h | 1 +
>>> include/block/block_int.h | 4 ++++
>>> 3 files changed, 34 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 613201b..669793b 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -213,6 +213,31 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
>>> qemu_co_queue_next(&bs->throttled_reqs[is_write]);
>>> }
>>>
>>> +size_t bdrv_opt_mem_align(BlockDriverState *bs)
>>> +{
>>> + size_t alignment;
>>> +
>>> + if (!bs || !bs->drv) {
>>> + /* 4k should be on the safe side */
>>> + return 4096;
>>> + }
>>> +
>>> + if (bs->drv->bdrv_opt_mem_align) {
>>> + return bs->drv->bdrv_opt_mem_align(bs);
>>> + }
>>> +
>>> + if (bs->file) {
>>> + alignment = bdrv_opt_mem_align(bs->file);
>>> + } else {
>>> + alignment = 512;
>>> + }
>>> +
>>> + if (bs->backing_hd) {
>>> + alignment = MAX(alignment, bdrv_opt_mem_align(bs->backing_hd));
>>> + }
>>
>> Maybe I didn't understand the commit message correctly, does this code
>> intend to get MAX alignment value in a chain? For example:
>> base(4096)->mid(512)->top(1024) results: 4096
>> The condition to traver the backing files seems complex.
>
> Yes, you want to align any newly allocated buffer such that no matter
> what direction it takes on its way through the block layer, it will
> never be misaligned. This means that you need to use the highest
> alignment restriction.
>
> Kevin
>
The condition to travel and compare, seems complex, since when
"return" is executed, then travers would stop. for example:
base(4096)->mid(512)->top(1024), when top's
bs->drv->bdrv_opt_mem_align != NULL, the result is 1024.
So I wonder whether this is on purpose, or the code should be:
size_t bdrv_opt_mem_align(BlockDriverState *bs)
{
size_t alignment;
if (!bs || !bs->drv) {
/* 4k should be on the safe side */
alignment = 4096;
goto compare;
}
if (bs->drv->bdrv_opt_mem_align) {
alignment = bs->drv->bdrv_opt_mem_align(bs);
goto compare;
}
if (bs->file) {
alignment = bdrv_opt_mem_align(bs->file);
} else {
alignment = 512;
}
compare:
if (bs->backing_hd) {
alignment = MAX(alignment, bdrv_opt_mem_align(bs->backing_hd));
}
return alignment;
}
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2013-12-11 2:43 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-06 17:22 [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 01/19] qemu_memalign: Allow small alignments Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 02/19] block: Detect unaligned length in bdrv_qiov_is_aligned() Kevin Wolf
2013-12-06 19:12 ` Eric Blake
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 03/19] block: Don't use guest sector size for qemu_blockalign() Kevin Wolf
2013-12-10 3:18 ` Wenchao Xia
2013-12-10 9:42 ` Kevin Wolf
2013-12-11 2:43 ` Wenchao Xia
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 04/19] block: rename buffer_alignment to guest_block_size Kevin Wolf
2013-12-10 3:25 ` Wenchao Xia
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 05/19] raw: Probe required direct I/O alignment Kevin Wolf
2013-12-06 17:53 ` Paolo Bonzini
2013-12-09 12:58 ` Kevin Wolf
2013-12-09 13:40 ` Paolo Bonzini
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 06/19] block: Introduce bdrv_aligned_preadv() Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 07/19] block: Introduce bdrv_co_do_preadv() Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 08/19] block: Introduce bdrv_aligned_pwritev() Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 09/19] block: write: Handle COR dependency after I/O throttling Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 10/19] block: Introduce bdrv_co_do_pwritev() Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 11/19] block: Switch BdrvTrackedRequest to byte granularity Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 12/19] block: Allow waiting for overlapping requests between begin/end Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 13/19] block: Make zero-after-EOF work with larger alignment Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 14/19] block: Generalise and optimise COR serialisation Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 15/19] block: Make overlap range for serialisation dynamic Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 16/19] block: Align requests in bdrv_co_do_pwritev() Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 17/19] block: Change coroutine wrapper to byte granularity Kevin Wolf
2013-12-06 17:22 ` [Qemu-devel] [RFC PATCH 18/19] block: Make bdrv_pread() a bdrv_prwv_co() wrapper Kevin Wolf
2013-12-06 17:23 ` [Qemu-devel] [RFC PATCH 19/19] block: Make bdrv_pwrite() " Kevin Wolf
2013-12-06 17:55 ` [Qemu-devel] [RFC PATCH 00/19] block: Support for 512b-on-4k emulation Paolo Bonzini
2013-12-09 11:16 ` Kevin Wolf
2013-12-09 12:51 ` Stefan Hajnoczi
2013-12-09 13:02 ` Kevin Wolf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).