* [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes
@ 2011-12-13 12:37 Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 01/17] block: do not rely on open_flags for bdrv_is_snapshot Paolo Bonzini
` (17 more replies)
0 siblings, 18 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
Running with mismatched host and guest logical block sizes is going
to become more important as 4k-sector disks become more widespread.
This is because we need a 512 byte disk to boot from.
Mismatched block sizes have two problems:
1) with cache=none or with non-raw protocols, you just cannot do 512-byte
granularity output. You need to do read-modify-write cycles like "hybrid"
512b-logical/4k-physical disks do. (Note that actually only the iSCSI
protocol supports 4k logical blocks).
2) when host block size < guest block size, guests issue 4k-aligned
I/O and expect it to be atomic. This problem cannot really be solved
completely, because power or I/O failures could leave a partially-written
block ("torn page"). However, at least you can serialize reads against
overlapping writes, which guarantees correctness as long as shutdown is
clean and there are no I/O errors.
Read-modify-write cycles are of course slower, and need to serialize
writes which makes the situation even worse. However, the performance
impact of emulating 512-byte sectors is within noise when partitions are
aligned. File system blocks are usually 4k or bigger, and OSes tend
to use 4k-aligned buffers. So when partitions are aligned no misaligned
I/O is sent and no bounce buffer is necessary either.
The situation is much different if partitions are misaligned or if the
guest is using O_DIRECT with a 512-byte aligned buffer. I benchmarked
only the former using iozone on a RHEL6 guest (2GB memory, 20GB ext4
partition with the whole 4k-sector disk assigned to the guest). Graphs
aren't really pretty, but two points are more or less discernible (also
more or less obvious):
- writes incur a larger overhead than reads by 5-10%;
- for larger file sizes the penalty is smaller, probably because
the I/O scheduler can work better (with almost no penalty for reads);
for smaller file sizes, up to 1M or even more for some scenarios,
misalignment worsened performance by 10-25%.
The series is structured as follows.
Patches 1 to 6 clean up the handling of flag bits, so that non-raw
protocols can always request read-modify-write operation (even when
cache != none).
Patches 7 to 11 distinguish host and guest block sizes in the
BlockDriverState.
Patches 12 to 15 reuse the request tracking mechanism to implement
RMW and to avoid torn pages.
Patch 16 passes down the host block size as physical block size so
that hopefully guest OSes try to align partitions.
Patch 17 adds an option to qemu-io that lets you test these scenarios
even without a 4k-sector disk.
Paolo Bonzini (17):
block: do not rely on open_flags for bdrv_is_snapshot
block: store actual flags in bs->open_flags
block: pass protocol flags up to the format
block: non-raw protocols never cache
block: remove enable_write_cache
block: move flag bits together
raw: remove the aligned_buf
block: rename buffer_alignment to guest_block_size
block: add host_block_size
raw: probe host_block_size
iscsi: save host block size
block: allow waiting only for overlapping writes
block: allow waiting at arbitrary granularity
block: protect against "torn reads" for guest_block_size > host_block_size
block: align and serialize I/O when guest_block_size < host_block_size
block: default physical block size to host block size
qemu-io: add blocksize argument to open
Makefile.objs | 4 +-
block.c | 313 ++++++++++++++++++++++++++++++++++++++++++++++-------
block.h | 17 +---
block/curl.c | 1 +
block/iscsi.c | 2 +
block/nbd.c | 1 +
block/raw-posix.c | 97 ++++++++++-------
block/raw-win32.c | 42 +++++++
block/rbd.c | 1 +
block/sheepdog.c | 1 +
block/vdi.c | 1 +
block_int.h | 25 ++---
hw/ide/core.c | 2 +-
hw/scsi-disk.c | 2 +-
hw/scsi-generic.c | 2 +-
hw/virtio-blk.c | 2 +-
qemu-io.c | 33 +++++-
trace-events | 1 +
18 files changed, 429 insertions(+), 118 deletions(-)
--
1.7.7.1
^ permalink raw reply [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 01/17] block: do not rely on open_flags for bdrv_is_snapshot
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 02/17] block: store actual flags in bs->open_flags Paolo Bonzini
` (16 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
The BDRV_O_SNAPSHOT flag is stored in open_flags but not passed to
bdrv_open. This makes the usage of bs->open_flags wrong in
bdrv_snapshot_goto.
(Instead, bdrv_commit uses the backing file's open_flags and those
flags never include any of BDRV_O_SNAPSHOT, BDRV_O_NO_BACKING
or BDRV_O_RDWR).
We will fix the open_flags soon. In the meanwhile, do not rely
on open_flags including BDRV_O_SNAPSHOT.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/block.c b/block.c
index 56c7981..a96f080 100644
--- a/block.c
+++ b/block.c
@@ -2463,7 +2463,7 @@ int bdrv_can_snapshot(BlockDriverState *bs)
int bdrv_is_snapshot(BlockDriverState *bs)
{
- return !!(bs->open_flags & BDRV_O_SNAPSHOT);
+ return bs->is_temporary;
}
BlockDriverState *bdrv_snapshots(void)
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 02/17] block: store actual flags in bs->open_flags
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 01/17] block: do not rely on open_flags for bdrv_is_snapshot Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 03/17] block: pass protocol flags up to the format Paolo Bonzini
` (15 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
The passed flags are changed slightly before passing them to bdrv_open.
Store the same flags in bs->open_flags, so that they are used correctly
in bdrv_snapshot_goto. In addition, this way we will be able to query
them and get back consistent values.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 30 +++++++++++++++---------------
1 files changed, 15 insertions(+), 15 deletions(-)
diff --git a/block.c b/block.c
index a96f080..fa11e3a 100644
--- a/block.c
+++ b/block.c
@@ -566,12 +566,26 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
trace_bdrv_open_common(bs, filename, flags, drv->format_name);
+ /*
+ * Clear flags that are internal to the block layer before opening the
+ * image.
+ */
+ open_flags = flags & ~(BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING);
+
+ /*
+ * Snapshots should be writable.
+ */
+ if (bs->is_temporary) {
+ open_flags |= BDRV_O_RDWR;
+ }
+
+
bs->file = NULL;
bs->total_sectors = 0;
bs->encrypted = 0;
bs->valid_key = 0;
bs->sg = 0;
- bs->open_flags = flags;
+ bs->open_flags = open_flags;
bs->growable = 0;
bs->buffer_alignment = 512;
@@ -591,20 +605,6 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
bs->opaque = g_malloc0(drv->instance_size);
bs->enable_write_cache = !!(flags & BDRV_O_CACHE_WB);
-
- /*
- * Clear flags that are internal to the block layer before opening the
- * image.
- */
- open_flags = flags & ~(BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING);
-
- /*
- * Snapshots should be writable.
- */
- if (bs->is_temporary) {
- open_flags |= BDRV_O_RDWR;
- }
-
bs->keep_read_only = bs->read_only = !(open_flags & BDRV_O_RDWR);
/* Open the image, either directly or using a protocol */
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 03/17] block: pass protocol flags up to the format
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 01/17] block: do not rely on open_flags for bdrv_is_snapshot Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 02/17] block: store actual flags in bs->open_flags Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-15 4:10 ` Zhi Yong Wu
2011-12-13 12:37 ` [Qemu-devel] [PATCH 04/17] block: non-raw protocols never cache Paolo Bonzini
` (14 subsequent siblings)
17 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
In the next patches, the protocols will modify bs->open_flags to signify
that they cannot support the exact requested feature set. Pass the
modified flags to the format.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/block.c b/block.c
index fa11e3a..6734e66 100644
--- a/block.c
+++ b/block.c
@@ -612,8 +612,9 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
ret = drv->bdrv_file_open(bs, filename, open_flags);
} else {
ret = bdrv_file_open(&bs->file, filename, open_flags);
+ bs->open_flags = bs->file->open_flags;
if (ret >= 0) {
- ret = drv->bdrv_open(bs, open_flags);
+ ret = drv->bdrv_open(bs, bs->file->open_flags);
}
}
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 04/17] block: non-raw protocols never cache
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (2 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 03/17] block: pass protocol flags up to the format Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 05/17] block: remove enable_write_cache Paolo Bonzini
` (13 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
Non-raw protocols never cache their data. Make this visible by setting
BDRV_O_NOCACHE in the open_flags. It will be used to handle block sizes
smaller than the backend's block size.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block/curl.c | 1 +
block/iscsi.c | 1 +
block/nbd.c | 1 +
block/rbd.c | 1 +
block/sheepdog.c | 1 +
block/vdi.c | 1 +
6 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/block/curl.c b/block/curl.c
index e9102e3..e7243dd 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -409,6 +409,7 @@ static int curl_open(BlockDriverState *bs, const char *filename, int flags)
curl_multi_setopt( s->multi, CURLMOPT_SOCKETFUNCTION, curl_sock_cb );
curl_multi_do(s);
+ bs->open_flags |= BDRV_O_NOCACHE;
return 0;
out:
diff --git a/block/iscsi.c b/block/iscsi.c
index 938c568..4e6cf7a 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -474,6 +474,7 @@ static int iscsi_open(BlockDriverState *bs, const char *filename, int flags)
return -EINVAL;
}
+ bs->open_flags |= BDRV_O_NOCACHE;
memset(iscsilun, 0, sizeof(IscsiLun));
/* Should really append the KVM name after the ':' here */
diff --git a/block/nbd.c b/block/nbd.c
index 882b2dc..d944ee7 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -176,6 +176,7 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
*/
result = nbd_establish_connection(bs);
+ bs->open_flags |= BDRV_O_NOCACHE;
qemu_co_mutex_init(&s->lock);
return result;
}
diff --git a/block/rbd.c b/block/rbd.c
index 312584a..d20955f 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -492,6 +492,7 @@ static int qemu_rbd_open(BlockDriverState *bs, const char *filename, int flags)
}
bs->read_only = (s->snap != NULL);
+ bs->open_flags |= BDRV_O_NOCACHE;
s->event_reader_pos = 0;
r = qemu_pipe(s->fds);
diff --git a/block/sheepdog.c b/block/sheepdog.c
index aa9707f..3ea2872 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -1248,6 +1248,7 @@ static int sd_open(BlockDriverState *bs, const char *filename, int flags)
s->min_dirty_data_idx = UINT32_MAX;
s->max_dirty_data_idx = 0;
+ bs->open_flags |= BDRV_O_NOCACHE;
bs->total_sectors = s->inode.vdi_size / SECTOR_SIZE;
strncpy(s->name, vdi, sizeof(s->name));
qemu_co_mutex_init(&s->lock);
diff --git a/block/vdi.c b/block/vdi.c
index 31cdfab..dfae347 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -441,6 +441,7 @@ static int vdi_open(BlockDriverState *bs, int flags)
goto fail;
}
+ bs->open_flags |= BDRV_O_NOCACHE;
bs->total_sectors = header.disk_size / SECTOR_SIZE;
s->block_size = header.block_size;
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 05/17] block: remove enable_write_cache
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (3 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 04/17] block: non-raw protocols never cache Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 06/17] block: move flag bits together Paolo Bonzini
` (12 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
This duplicates a bit in open_flags, we do not need it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 3 +--
block_int.h | 3 ---
2 files changed, 1 insertions(+), 5 deletions(-)
diff --git a/block.c b/block.c
index 6734e66..4cf441b 100644
--- a/block.c
+++ b/block.c
@@ -604,7 +604,6 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
bs->drv = drv;
bs->opaque = g_malloc0(drv->instance_size);
- bs->enable_write_cache = !!(flags & BDRV_O_CACHE_WB);
bs->keep_read_only = bs->read_only = !(open_flags & BDRV_O_RDWR);
/* Open the image, either directly or using a protocol */
@@ -2011,7 +2010,7 @@ int bdrv_is_sg(BlockDriverState *bs)
int bdrv_enable_write_cache(BlockDriverState *bs)
{
- return bs->enable_write_cache;
+ return !!(bs->open_flags & BDRV_O_CACHE_WB);
}
int bdrv_is_encrypted(BlockDriverState *bs)
diff --git a/block_int.h b/block_int.h
index 311bd2a..f846f90 100644
--- a/block_int.h
+++ b/block_int.h
@@ -244,9 +244,6 @@ struct BlockDriverState {
/* the memory alignment required for the buffers handled by this driver */
int buffer_alignment;
- /* do we need to tell the quest if we have a volatile write cache? */
- int enable_write_cache;
-
/* NOTE: the following infos are only hints for real hardware
drivers. They are not used by the block driver */
int cyls, heads, secs, translation;
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 06/17] block: move flag bits together
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (4 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 05/17] block: remove enable_write_cache Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 07/17] raw: remove the aligned_buf Paolo Bonzini
` (11 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block_int.h | 17 ++++++++---------
1 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/block_int.h b/block_int.h
index f846f90..9324866 100644
--- a/block_int.h
+++ b/block_int.h
@@ -192,15 +192,18 @@ struct BlockDriver {
struct BlockDriverState {
int64_t total_sectors; /* if we are reading a disk image, give its
size in sectors */
- int read_only; /* if true, the media is read only */
- int keep_read_only; /* if true, the media was requested to stay read only */
int open_flags; /* flags used to open the file, re-used for re-open */
- int encrypted; /* if true, the media is encrypted */
- int valid_key; /* if true, a valid encryption key has been set */
- int sg; /* if true, the device is a /dev/sg* */
int copy_on_read; /* if true, copy read backing sectors into image
note this is a reference count */
+ unsigned read_only:1; /* if true, the media is read only */
+ unsigned keep_read_only:1; /* if true, the media was requested to stay read only */
+ unsigned encrypted:1; /* if true, the media is encrypted */
+ unsigned valid_key:1; /* if true, a valid encryption key has been set */
+ unsigned sg:1; /* if true, the device is a /dev/sg* */
+ unsigned growable:1; /* if true, the disk can expand beyond total_sectors */
+ unsigned is_temporary:1; /* if true, the disk was created from a snapshot */
+
BlockDriver *drv; /* NULL means no media */
void *opaque;
@@ -213,7 +216,6 @@ struct BlockDriverState {
char backing_file[1024]; /* if non zero, the image is a diff of
this file image */
char backing_format[16]; /* if non-zero and backing_file exists */
- int is_temporary;
BlockDriverState *backing_hd;
BlockDriverState *file;
@@ -238,9 +240,6 @@ struct BlockDriverState {
uint64_t total_time_ns[BDRV_MAX_IOTYPE];
uint64_t wr_highest_sector;
- /* Whether the disk can expand beyond total_sectors */
- int growable;
-
/* the memory alignment required for the buffers handled by this driver */
int buffer_alignment;
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 07/17] raw: remove the aligned_buf
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (5 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 06/17] block: move flag bits together Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 08/17] block: rename buffer_alignment to guest_block_size Paolo Bonzini
` (10 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
This has been dead since the removal of synchronous I/O callbacks.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block/raw-posix.c | 25 +++----------------------
1 files changed, 3 insertions(+), 22 deletions(-)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 2ee5d69..007d1d3 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -129,8 +129,6 @@ typedef struct BDRVRawState {
int use_aio;
void *aio_ctx;
#endif
- uint8_t *aligned_buf;
- unsigned aligned_buf_size;
#ifdef CONFIG_XFS
bool is_xfs : 1;
#endif
@@ -216,23 +214,10 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
return ret;
}
s->fd = fd;
- s->aligned_buf = NULL;
-
- if ((bdrv_flags & BDRV_O_NOCACHE)) {
- /*
- * Allocate a buffer for read/modify/write cycles. Chose the size
- * pessimistically as we don't know the block size yet.
- */
- s->aligned_buf_size = 32 * MAX_BLOCKSIZE;
- s->aligned_buf = qemu_memalign(MAX_BLOCKSIZE, s->aligned_buf_size);
- if (s->aligned_buf == NULL) {
- goto out_close;
- }
- }
/* We're falling back to POSIX AIO in some cases so init always */
if (paio_init() < 0) {
- goto out_free_buf;
+ goto out_close;
}
#ifdef CONFIG_LINUX_AIO
@@ -245,7 +230,7 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
s->aio_ctx = laio_init();
if (!s->aio_ctx) {
- goto out_free_buf;
+ goto out_close;
}
s->use_aio = 1;
} else
@@ -264,8 +249,6 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
return 0;
-out_free_buf:
- qemu_vfree(s->aligned_buf);
out_close:
close(fd);
return -errno;
@@ -326,7 +309,7 @@ static BlockDriverAIOCB *raw_aio_submit(BlockDriverState *bs,
* boundary. Check if this is the case or tell the low-level
* driver that it needs to copy the buffer.
*/
- if (s->aligned_buf) {
+ if ((bs->open_flags & BDRV_O_NOCACHE)) {
if (!qiov_is_aligned(bs, qiov)) {
type |= QEMU_AIO_MISALIGNED;
#ifdef CONFIG_LINUX_AIO
@@ -374,8 +357,6 @@ static void raw_close(BlockDriverState *bs)
if (s->fd >= 0) {
close(s->fd);
s->fd = -1;
- if (s->aligned_buf != NULL)
- qemu_vfree(s->aligned_buf);
}
}
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 08/17] block: rename buffer_alignment to guest_block_size
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (6 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 07/17] raw: remove the aligned_buf Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 09/17] block: add host_block_size Paolo Bonzini
` (9 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
The alignment field is now set to the value that is promised by the guest,
rather than required by the host. The next patches will make QEMU aware
of the host-provided values, so make this clear.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 10 +++++-----
block.h | 2 +-
block/raw-posix.c | 2 +-
block_int.h | 4 ++--
hw/ide/core.c | 2 +-
hw/scsi-disk.c | 2 +-
hw/scsi-generic.c | 2 +-
hw/virtio-blk.c | 2 +-
8 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/block.c b/block.c
index 4cf441b..9e8ffb8 100644
--- a/block.c
+++ b/block.c
@@ -587,7 +587,7 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
bs->sg = 0;
bs->open_flags = open_flags;
bs->growable = 0;
- bs->buffer_alignment = 512;
+ bs->guest_block_size = 512;
assert(bs->copy_on_read == 0); /* bdrv_new() and bdrv_close() make it so */
if ((flags & BDRV_O_RDWR) && (flags & BDRV_O_COPY_ON_READ)) {
@@ -917,7 +917,7 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev)
bs->dev = NULL;
bs->dev_ops = NULL;
bs->dev_opaque = NULL;
- bs->buffer_alignment = 512;
+ bs->guest_block_size = 512;
}
/* TODO change to return DeviceState * when all users are qdevified */
@@ -3557,14 +3557,14 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
return NULL;
}
-void bdrv_set_buffer_alignment(BlockDriverState *bs, int align)
+void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
{
- bs->buffer_alignment = align;
+ bs->guest_block_size = align;
}
void *qemu_blockalign(BlockDriverState *bs, size_t size)
{
- return qemu_memalign((bs && bs->buffer_alignment) ? bs->buffer_alignment : 512, size);
+ return qemu_memalign((bs && bs->guest_block_size) ? bs->guest_block_size : 512, size);
}
void bdrv_set_dirty_tracking(BlockDriverState *bs, int enable)
diff --git a/block.h b/block.h
index 1790f99..e088146 100644
--- a/block.h
+++ b/block.h
@@ -303,7 +303,7 @@ int bdrv_img_create(const char *filename, const char *fmt,
const char *base_filename, const char *base_fmt,
char *options, uint64_t img_size, int flags);
-void bdrv_set_buffer_alignment(BlockDriverState *bs, int align);
+void bdrv_set_guest_block_size(BlockDriverState *bs, int align);
void *qemu_blockalign(BlockDriverState *bs, size_t size);
#define BDRV_SECTORS_PER_DIRTY_CHUNK 2048
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 007d1d3..49a8c21 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -287,7 +287,7 @@ static int qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
int i;
for (i = 0; i < qiov->niov; i++) {
- if ((uintptr_t) qiov->iov[i].iov_base % bs->buffer_alignment) {
+ if ((uintptr_t) qiov->iov[i].iov_base % bs->guest_block_size) {
return 0;
}
}
diff --git a/block_int.h b/block_int.h
index 9324866..99f29e2 100644
--- a/block_int.h
+++ b/block_int.h
@@ -240,8 +240,8 @@ struct BlockDriverState {
uint64_t total_time_ns[BDRV_MAX_IOTYPE];
uint64_t wr_highest_sector;
- /* the memory alignment required for the buffers handled by this driver */
- int buffer_alignment;
+ /* the memory alignment used by the guest for the buffers handled by this driver */
+ int guest_block_size;
/* NOTE: the following infos are only hints for real hardware
drivers. They are not used by the block driver */
diff --git a/hw/ide/core.c b/hw/ide/core.c
index 600eb28..3cceaea 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -1852,7 +1852,7 @@ int ide_init_drive(IDEState *s, BlockDriverState *bs, IDEDriveKind kind,
s->smart_selftest_count = 0;
if (kind == IDE_CD) {
bdrv_set_dev_ops(bs, &ide_cd_block_ops, s);
- bdrv_set_buffer_alignment(bs, 2048);
+ bdrv_set_guest_block_size(bs, 2048);
} else {
if (!bdrv_is_inserted(s->bs)) {
error_report("Device needs media, but drive is empty");
diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index 505accd..f9b11e4 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -1544,7 +1544,7 @@ static int scsi_initfn(SCSIDevice *dev)
if (s->removable) {
bdrv_set_dev_ops(s->qdev.conf.bs, &scsi_cd_block_ops, s);
}
- bdrv_set_buffer_alignment(s->qdev.conf.bs, s->qdev.blocksize);
+ bdrv_set_guest_block_size(s->qdev.conf.bs, s->qdev.blocksize);
bdrv_iostatus_enable(s->qdev.conf.bs);
add_boot_device_path(s->qdev.conf.bootindex, &dev->qdev, NULL);
diff --git a/hw/scsi-generic.c b/hw/scsi-generic.c
index 6f7d3db..b2feb04 100644
--- a/hw/scsi-generic.c
+++ b/hw/scsi-generic.c
@@ -184,7 +184,7 @@ static void scsi_read_complete(void * opaque, int ret)
s->blocksize = ldl_be_p(&r->buf[8]);
s->max_lba = ldq_be_p(&r->buf[0]);
}
- bdrv_set_buffer_alignment(s->conf.bs, s->blocksize);
+ bdrv_set_guest_block_size(s->conf.bs, s->blocksize);
scsi_req_data(&r->req, len);
if (!r->req.io_canceled) {
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index ef27421..d64ec72 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -606,7 +606,7 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,
register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
virtio_blk_save, virtio_blk_load, s);
bdrv_set_dev_ops(s->bs, &virtio_block_ops, s);
- bdrv_set_buffer_alignment(s->bs, conf->logical_block_size);
+ bdrv_set_guest_block_size(s->bs, conf->logical_block_size);
bdrv_iostatus_enable(s->bs);
add_boot_device_path(conf->bootindex, dev, "/disk@0,0");
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 09/17] block: add host_block_size
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (7 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 08/17] block: rename buffer_alignment to guest_block_size Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 10/17] raw: probe host_block_size Paolo Bonzini
` (8 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 4 +++-
block_int.h | 3 +++
2 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/block.c b/block.c
index 9e8ffb8..91622db 100644
--- a/block.c
+++ b/block.c
@@ -588,6 +588,7 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
bs->open_flags = open_flags;
bs->growable = 0;
bs->guest_block_size = 512;
+ bs->host_block_size = 512;
assert(bs->copy_on_read == 0); /* bdrv_new() and bdrv_close() make it so */
if ((flags & BDRV_O_RDWR) && (flags & BDRV_O_COPY_ON_READ)) {
@@ -612,6 +613,7 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
} else {
ret = bdrv_file_open(&bs->file, filename, open_flags);
bs->open_flags = bs->file->open_flags;
+ bs->host_block_size = bs->file->host_block_size;
if (ret >= 0) {
ret = drv->bdrv_open(bs, bs->file->open_flags);
}
@@ -3564,7 +3566,7 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
void *qemu_blockalign(BlockDriverState *bs, size_t size)
{
- return qemu_memalign((bs && bs->guest_block_size) ? bs->guest_block_size : 512, size);
+ return qemu_memalign(bs ? bs->host_block_size : 512, size);
}
void bdrv_set_dirty_tracking(BlockDriverState *bs, int enable)
diff --git a/block_int.h b/block_int.h
index 99f29e2..30393a0 100644
--- a/block_int.h
+++ b/block_int.h
@@ -243,6 +243,9 @@ struct BlockDriverState {
/* the memory alignment used by the guest for the buffers handled by this driver */
int guest_block_size;
+ /* the memory alignment required by the device for the buffers handled by this driver */
+ int host_block_size;
+
/* NOTE: the following infos are only hints for real hardware
drivers. They are not used by the block driver */
int cyls, heads, secs, translation;
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 10/17] raw: probe host_block_size
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (8 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 09/17] block: add host_block_size Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 11/17] iscsi: save host block size Paolo Bonzini
` (7 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
Use ioctls if possible, else see what alignment it takes for O_DIRECT
to succeed.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block/raw-posix.c | 72 ++++++++++++++++++++++++++++++++++++++++------------
block/raw-win32.c | 42 +++++++++++++++++++++++++++++++
2 files changed, 97 insertions(+), 17 deletions(-)
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 49a8c21..3537394 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -52,6 +52,7 @@
#include <sys/param.h>
#include <linux/cdrom.h>
#include <linux/fd.h>
+#include <linux/fs.h>
#endif
#if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
#include <sys/disk.h>
@@ -179,6 +180,58 @@ static int raw_normalize_devicepath(const char **filename)
}
#endif
+static void raw_probe_alignment(BlockDriverState *bs)
+{
+ BDRVRawState *s = bs->opaque;
+ char *buf;
+ unsigned int sector_size;
+
+ /* For /dev/sg devices the alignment is not really used. */
+ if (bs->sg) {
+ return;
+ }
+
+ /* For block devices, try to get the actual sector size even if we
+ * do not need it, so that it can be passed down to the guest.
+ */
+#ifdef BLKSSZGET
+ if (ioctl(s->fd, BLKSSZGET, §or_size) >= 0) {
+ bs->host_block_size = sector_size;
+ }
+#endif
+#ifdef DKIOCGETBLOCKSIZE
+ if (ioctl(s->fd, DKIOCGETBLOCKSIZE, §or_size) >= 0) {
+ bs->host_block_size = sector_size;
+ }
+#endif
+#ifdef DIOCGSECTORSIZE
+ if (ioctl(s->fd, DIOCGSECTORSIZE, §or_size) >= 0) {
+ bs->host_block_size = sector_size;
+ }
+#endif
+
+ /* If we could not get the size so far, we can only guess it if the file
+ * was opened with O_DIRECT. Using the minimal value (512) is okay.
+ * It may or may not be safe if the guest logical block size is >512;
+ * however, we will print a scary message suggesting usage of cache=none.
+ * If they hear our advice, the host block size will be detected correctly
+ * and the scary message will go away.
+ */
+ if (!(bs->open_flags & BDRV_O_NOCACHE)) {
+ return;
+ }
+
+ buf = qemu_memalign(MAX_BLOCKSIZE, MAX_BLOCKSIZE);
+ for (sector_size = 512; sector_size < MAX_BLOCKSIZE; sector_size <<= 1) {
+ /* The buffer must be aligned to sector_size, but not sector_size*2. */
+ if (pread(s->fd, buf + sector_size, sector_size, 0) >= 0) {
+ break;
+ }
+ }
+ bs->host_block_size = sector_size;
+ qemu_vfree(buf);
+}
+
static int raw_open_common(BlockDriverState *bs, const char *filename,
int bdrv_flags, int open_flags)
{
@@ -214,6 +267,7 @@ static int raw_open_common(BlockDriverState *bs, const char *filename,
return ret;
}
s->fd = fd;
+ raw_probe_alignment(bs);
/* We're falling back to POSIX AIO in some cases so init always */
if (paio_init() < 0) {
@@ -262,22 +316,6 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
return raw_open_common(bs, filename, flags, 0);
}
-/* XXX: use host sector size if necessary with:
-#ifdef DIOCGSECTORSIZE
- {
- unsigned int sectorsize = 512;
- if (!ioctl(fd, DIOCGSECTORSIZE, §orsize) &&
- sectorsize > bufsize)
- bufsize = sectorsize;
- }
-#endif
-#ifdef CONFIG_COCOA
- uint32_t blockSize = 512;
- if ( !ioctl( fd, DKIOCGETBLOCKSIZE, &blockSize ) && blockSize > bufsize) {
- bufsize = blockSize;
- }
-#endif
-*/
/*
* Check if all memory in this vector is sector aligned.
@@ -287,7 +325,7 @@ static int qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
int i;
for (i = 0; i < qiov->niov; i++) {
- if ((uintptr_t) qiov->iov[i].iov_base % bs->guest_block_size) {
+ if ((uintptr_t) qiov->iov[i].iov_base % bs->host_block_size) {
return 0;
}
}
diff --git a/block/raw-win32.c b/block/raw-win32.c
index e4b0b75..d8b76de 100644
--- a/block/raw-win32.c
+++ b/block/raw-win32.c
@@ -77,6 +77,35 @@ static int set_sparse(int fd)
NULL, 0, NULL, 0, &returned, NULL);
}
+static void raw_probe_alignment(BlockDriverState *bs)
+{
+ BDRVRawState *s = bs->opaque;
+ DWORD sectorsPerCluster, freeClusters, totalClusters, count;
+ DISK_GEOMETRY_EX dg;
+ BOOL status;
+
+ if (s->type == FTYPE_CD) {
+ bs->host_block_size = 2048;
+ return;
+ }
+ if (s->type == FTYPE_HARDDISK) {
+ status = DeviceIoControl(s->hfile, IOCTL_DISK_GET_DRIVE_GEOMETRY_EX,
+ NULL, 0, &dg, sizeof(dg), &count, NULL);
+ if (status != 0) {
+ bs->host_block_size = dg.Geometry.BytesPerSector;
+ return;
+ }
+ /* try GetDiskFreeSpace too */
+ }
+
+ if (s->drive_path[0]) {
+ GetDiskFreeSpace(s->drive_path, §orsPerCluster,
+ &dg.Geometry.BytesPerSector,
+ &freeClusters, &totalClusters);
+ bs->host_block_size = dg.Geometry.BytesPerSector;
+ }
+}
+
static int raw_open(BlockDriverState *bs, const char *filename, int flags)
{
BDRVRawState *s = bs->opaque;
@@ -96,6 +125,18 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
overlapped |= FILE_FLAG_NO_BUFFERING;
if (!(flags & BDRV_O_CACHE_WB))
overlapped |= FILE_FLAG_WRITE_THROUGH;
+
+ if (filename[0] && filename[1] == ':') {
+ snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", filename[0]);
+ } else if (filename[0] == '\\' && filename[1] == '\\') {
+ s->drive_path[0] = 0;
+ } else {
+ /* Relative path. */
+ char buf[MAX_PATH];
+ GetCurrentDirectory(MAX_PATH, buf);
+ snprintf(s->drive_path, sizeof(s->drive_path), "%c:\\", buf[0]);
+ }
+
s->hfile = CreateFile(filename, access_flags,
FILE_SHARE_READ, NULL,
OPEN_EXISTING, overlapped, NULL);
@@ -106,6 +147,7 @@ static int raw_open(BlockDriverState *bs, const char *filename, int flags)
return -EACCES;
return -1;
}
+ raw_probe_alignment(bs);
return 0;
}
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 11/17] iscsi: save host block size
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (9 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 10/17] raw: probe host_block_size Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 12/17] block: allow waiting only for overlapping writes Paolo Bonzini
` (6 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
The iSCSI backend already gets the block size from the READ CAPACITY
command it sends. Save it so that the generic block layer gets it
too.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block/iscsi.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/block/iscsi.c b/block/iscsi.c
index 4e6cf7a..8a0bce9 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -546,6 +546,7 @@ static int iscsi_open(BlockDriverState *bs, const char *filename, int flags)
if (iscsi_url != NULL) {
iscsi_destroy_url(iscsi_url);
}
+ bs->host_block_size = iscsilun->block_size;
return 0;
failed:
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 12/17] block: allow waiting only for overlapping writes
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (10 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 11/17] iscsi: save host block size Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 13/17] block: allow waiting at arbitrary granularity Paolo Bonzini
` (5 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
To implement mismatching block size, we will reuse the request tracking
mechanism that is used for copy-on-read. However, waiting for overlapping
reads is not needed to protect against "torn reads", so add a flag to
wait_for_overlapping_requests.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 13 ++++++++++---
1 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/block.c b/block.c
index 91622db..9a2ef83 100644
--- a/block.c
+++ b/block.c
@@ -1191,7 +1191,7 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
}
static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors)
+ int64_t sector_num, int nb_sectors, bool writes_only)
{
BdrvTrackedRequest *req;
int64_t cluster_sector_num;
@@ -1207,9 +1207,16 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
round_to_clusters(bs, sector_num, nb_sectors,
&cluster_sector_num, &cluster_nb_sectors);
+ if (writes_only && !(bs->open_flags & BDRV_O_RDWR)) {
+ return;
+ }
+
do {
retry = false;
QLIST_FOREACH(req, &bs->tracked_requests, list) {
+ if (writes_only && !req->is_write) {
+ continue;
+ }
if (tracked_request_overlaps(req, cluster_sector_num,
cluster_nb_sectors)) {
/* Hitting this means there was a reentrant request, for
@@ -1576,7 +1583,7 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
}
if (bs->copy_on_read) {
- wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors, false);
}
tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
@@ -1636,7 +1643,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
}
if (bs->copy_on_read) {
- wait_for_overlapping_requests(bs, sector_num, nb_sectors);
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors, false);
}
tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 13/17] block: allow waiting at arbitrary granularity
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (11 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 12/17] block: allow waiting only for overlapping writes Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 14/17] block: protect against "torn reads" for guest_block_size > host_block_size Paolo Bonzini
` (4 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
When emulating small logical block sizes, the only overlaps
that matter are at host block size granularity, not cluster.
Make wait_for_overlapping_requests more flexible in this
respect, too.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 43 ++++++++++++++++++++++++++++---------------
1 files changed, 28 insertions(+), 15 deletions(-)
diff --git a/block.c b/block.c
index 9a2ef83..07b9cf4 100644
--- a/block.c
+++ b/block.c
@@ -1159,24 +1159,33 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
/**
* Round a region to cluster boundaries
*/
-static void round_to_clusters(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors,
- int64_t *cluster_sector_num,
- int *cluster_nb_sectors)
+static void round_sectors(int64_t alignment,
+ int64_t sector_num, int nb_sectors,
+ int64_t *cluster_sector_num,
+ int *cluster_nb_sectors)
{
- BlockDriverInfo bdi;
-
- if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
+ if (alignment == BDRV_SECTOR_SIZE) {
*cluster_sector_num = sector_num;
*cluster_nb_sectors = nb_sectors;
} else {
- int64_t c = bdi.cluster_size / BDRV_SECTOR_SIZE;
+ int64_t c = alignment / BDRV_SECTOR_SIZE;
*cluster_sector_num = QEMU_ALIGN_DOWN(sector_num, c);
*cluster_nb_sectors = QEMU_ALIGN_UP(sector_num - *cluster_sector_num +
nb_sectors, c);
}
}
+static int64_t get_cluster_size(BlockDriverState *bs)
+{
+ BlockDriverInfo bdi;
+
+ if (bdrv_get_info(bs, &bdi) < 0 || bdi.cluster_size == 0) {
+ return BDRV_SECTOR_SIZE;
+ } else {
+ return bdi.cluster_size / BDRV_SECTOR_SIZE;
+ }
+}
+
static bool tracked_request_overlaps(BdrvTrackedRequest *req,
int64_t sector_num, int nb_sectors) {
/* aaaa bbbb */
@@ -1191,7 +1200,8 @@ static bool tracked_request_overlaps(BdrvTrackedRequest *req,
}
static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors, bool writes_only)
+ int64_t sector_num, int nb_sectors,
+ int64_t granularity, bool writes_only)
{
BdrvTrackedRequest *req;
int64_t cluster_sector_num;
@@ -1204,8 +1214,8 @@ static void coroutine_fn wait_for_overlapping_requests(BlockDriverState *bs,
* CoR read and write operations are atomic and guest writes cannot
* interleave between them.
*/
- round_to_clusters(bs, sector_num, nb_sectors,
- &cluster_sector_num, &cluster_nb_sectors);
+ round_sectors(granularity, sector_num, nb_sectors,
+ &cluster_sector_num, &cluster_nb_sectors);
if (writes_only && !(bs->open_flags & BDRV_O_RDWR)) {
return;
@@ -1525,8 +1535,9 @@ static int coroutine_fn bdrv_co_copy_on_readv(BlockDriverState *bs,
/* Cover entire cluster so no additional backing file I/O is required when
* allocating cluster in the image file.
*/
- round_to_clusters(bs, sector_num, nb_sectors,
- &cluster_sector_num, &cluster_nb_sectors);
+ round_sectors(get_cluster_size(bs),
+ sector_num, nb_sectors,
+ &cluster_sector_num, &cluster_nb_sectors);
trace_bdrv_co_copy_on_readv(bs, sector_num, nb_sectors,
cluster_sector_num, cluster_nb_sectors);
@@ -1583,7 +1594,8 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
}
if (bs->copy_on_read) {
- wait_for_overlapping_requests(bs, sector_num, nb_sectors, false);
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors,
+ get_cluster_size(bs), false);
}
tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
@@ -1643,7 +1655,8 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
}
if (bs->copy_on_read) {
- wait_for_overlapping_requests(bs, sector_num, nb_sectors, false);
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors,
+ get_cluster_size(bs), false);
}
tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 14/17] block: protect against "torn reads" for guest_block_size > host_block_size
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (12 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 13/17] block: allow waiting at arbitrary granularity Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 15/17] block: align and serialize I/O when guest_block_size < host_block_size Paolo Bonzini
` (3 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
When the guest sees a higher alignment than the host, writes may be
done in multiple steps. So, reads have to be serialized against
overlapping writes, so that the writes look atomic to the guest.
This is true even when O_DIRECT is not in use.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+), 0 deletions(-)
diff --git a/block.c b/block.c
index 07b9cf4..9e35c85 100644
--- a/block.c
+++ b/block.c
@@ -1598,6 +1598,16 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
get_cluster_size(bs), false);
}
+ /* When the guest sees a higher alignment than the host, writes may be
+ * done in multiple steps. So, reads have to be serialized against
+ * overlapping writes, so that the writes look atomic to the guest,
+ * even when O_DIRECT is not in use.
+ */
+ if (bs->guest_block_size > bs->host_block_size) {
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors,
+ bs->guest_block_size, true);
+ }
+
tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
if (bs->copy_on_read) {
@@ -3582,6 +3592,18 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
{
bs->guest_block_size = align;
+ if ((bs->open_flags & BDRV_O_RDWR) &&
+ bs->host_block_size < bs->guest_block_size) {
+ error_report("Host block size is %d, guest block size is %d. Due to partially\n"
+ "written sectors, power failures may cause data corruption.%s",
+ bs->host_block_size, bs->guest_block_size,
+
+ /* The host block size might not be detected correctly if
+ * the guest is not using O_DIRECT. */
+ (bs->open_flags & BDRV_O_NOCACHE) ? "" :
+ "\nIf you think this message is wrong, start the guest with cache=none"
+ "\nand see if it disappears.");
+ }
}
void *qemu_blockalign(BlockDriverState *bs, size_t size)
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 15/17] block: align and serialize I/O when guest_block_size < host_block_size
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (13 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 14/17] block: protect against "torn reads" for guest_block_size > host_block_size Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 16/17] block: default physical block size to host block size Paolo Bonzini
` (2 subsequent siblings)
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
When the guest sees a lower alignment than the host, and O_DIRECT
is in place, I/O might not span an integer number of host sectors.
If this is the case, add a few sectors at the beginning and the end.
When reading, copy interesting data from there to the guest buffer.
When writing, merge data from there with data from the guest buffer.
Since writes potentially become read-modify-write sequences, they have
to be serialized against other operations. Reads are still atomic
(they only need some postprocessing), so they need not be
serialized.
The math is a bit tricky, but luckily all the primitives are already in
cutils.c and iov.c.
On top of this, the guest's buffers might be misaligned with respect
to the host block size. However, that's caught by raw-posix.c and
posix-aio-compat.c, so we need not care about it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
Makefile.objs | 4 +-
block.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
trace-events | 1 +
3 files changed, 175 insertions(+), 4 deletions(-)
diff --git a/Makefile.objs b/Makefile.objs
index 3a699ee..2033c96 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -23,7 +23,7 @@ coroutine-obj-$(CONFIG_WIN32) += coroutine-win32.o
#######################################################################
# block-obj-y is code used by both qemu system emulation and qemu-img
-block-obj-y = cutils.o cache-utils.o qemu-option.o module.o async.o
+block-obj-y = cutils.o cache-utils.o qemu-option.o module.o async.o iov.o
block-obj-y += nbd.o block.o aio.o aes.o qemu-config.o qemu-progress.o qemu-sockets.o
block-obj-y += $(coroutine-obj-y) $(qobject-obj-y) $(version-obj-y)
block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
@@ -157,7 +157,7 @@ endif
common-obj-y += $(addprefix ui/, $(ui-obj-y))
common-obj-$(CONFIG_VNC) += $(addprefix ui/, $(vnc-obj-y))
-common-obj-y += iov.o acl.o
+common-obj-y += acl.o
common-obj-$(CONFIG_POSIX) += compatfd.o
common-obj-y += notify.o event_notifier.o
common-obj-y += qemu-timer.o qemu-timer-common.o
diff --git a/block.c b/block.c
index 9e35c85..44f5d83 100644
--- a/block.c
+++ b/block.c
@@ -28,6 +28,7 @@
#include "block_int.h"
#include "module.h"
#include "qjson.h"
+#include "iov.h"
#include "qemu-coroutine.h"
#include "qmp-commands.h"
#include "qemu-timer.h"
@@ -1175,6 +1176,21 @@ static void round_sectors(int64_t alignment,
}
}
+static bool req_is_aligned(int64_t alignment,
+ int64_t sector_num, int nb_sectors)
+{
+ if (alignment != BDRV_SECTOR_SIZE) {
+ int64_t c = alignment / BDRV_SECTOR_SIZE;
+ if ((sector_num & (c - 1)) != 0) {
+ return false;
+ }
+ if ((nb_sectors & (c - 1)) != 0) {
+ return false;
+ }
+ }
+ return true;
+}
+
static int64_t get_cluster_size(BlockDriverState *bs)
{
BlockDriverInfo bdi;
@@ -1571,6 +1587,135 @@ err:
return ret;
}
+static int coroutine_fn bdrv_rw_co_align(BlockDriverState *bs,
+ int64_t sector_num, int nb_sectors, QEMUIOVector *qiov, bool is_write)
+{
+ BlockDriver *drv = bs->drv;
+ QEMUIOVector aligned_qiov;
+
+ int alignment = bs->host_block_size / BDRV_SECTOR_SIZE;
+ int64_t io_sector_num;
+ int io_nb_sectors;
+ int head_extra, tail_sectors;
+ int head_sectors = 0;
+ int tail_offset = 0;
+ int ret;
+ uint8_t *head = NULL;
+ uint8_t *tail = NULL;
+
+ trace_bdrv_rw_co_align(bs, sector_num, nb_sectors, is_write);
+ round_sectors(bs->host_block_size, sector_num, nb_sectors,
+ &io_sector_num, &io_nb_sectors);
+
+ /*
+ * Here is a drawing of some involved quantities. Double lines
+ * mark points that are aligned to a host block.
+ *
+ * io_sector_num + io_nb_sectors ---.
+ * .-- io_sector_num \
+ * / sector_num + nb_sectors ---. \
+ * /______________ ________________ _____ ______________\____\
+ * || | || || | ||
+ * || head_extra | head_sectors || ... || tail_sectors | ... ||
+ * ||______________|________________||_____||______________|_____||
+ * \ \ \
+ * `--- sector_num \ `--- sector_num + tail_offset
+ * \
+ * `--- io_sector_num + alignment
+ */
+ head_extra = sector_num & (alignment - 1);
+ tail_sectors = (sector_num + nb_sectors) & (alignment - 1);
+
+ qemu_iovec_init(&aligned_qiov, qiov->niov + 2);
+
+ /* If the initial sector is not aligned, we need a "head" block. If a
+ * single block will satisfy the request, we put it in the "head" too.
+ * This way, the tail always starts on an aligned sector.
+ */
+ if (head_extra != 0 || io_nb_sectors == alignment) {
+ head_sectors = MIN(nb_sectors, alignment - head_extra);
+ head = qemu_blockalign(bs, bs->host_block_size);
+ qemu_iovec_add(&aligned_qiov, head, bs->host_block_size);
+ }
+
+ /* If the aligned request is just 1 block, we will read it in head,
+ * so nothing else to do.
+ */
+ if (io_nb_sectors > alignment) {
+ /* Otherwise, see where the last full block ends... */
+ tail_offset = nb_sectors - tail_sectors;
+ assert(((sector_num + tail_offset) & (alignment - 1)) == 0);
+
+ /* ... add everything after the head and up to it. */
+ qemu_iovec_copy(&aligned_qiov, qiov,
+ head_sectors * BDRV_SECTOR_SIZE,
+ (tail_offset - head_sectors) * BDRV_SECTOR_SIZE);
+
+ /* If the final sector is not aligned, add an extra block at
+ * the end.
+ */
+ if (tail_sectors != 0) {
+ tail = qemu_blockalign(bs, bs->host_block_size);
+ qemu_iovec_add(&aligned_qiov, tail, bs->host_block_size);
+ }
+ }
+
+ assert(head || tail);
+ if (is_write) {
+ QEMUIOVector rmw_qiov;
+ struct iovec rmw_iov;
+
+ /* Merge the user data with data read from the first block... */
+ if (head != NULL) {
+ rmw_iov.iov_base = head;
+ rmw_iov.iov_len = bs->host_block_size;
+ qemu_iovec_init_external(&rmw_qiov, &rmw_iov, 1);
+ ret = drv->bdrv_co_readv(bs, io_sector_num, alignment, &rmw_qiov);
+ if (ret < 0) {
+ return ret;
+ }
+ iov_to_buf(qiov->iov, qiov->size,
+ (head + head_extra * BDRV_SECTOR_SIZE), 0,
+ head_sectors * BDRV_SECTOR_SIZE);
+ }
+
+ /* ... and from the last block. The two reads might be merged in
+ * some cases, but it is rare enough that we do not care. */
+ if (tail != NULL) {
+ rmw_iov.iov_base = tail;
+ rmw_iov.iov_len = bs->host_block_size;
+ qemu_iovec_init_external(&rmw_qiov, &rmw_iov, 1);
+ ret = drv->bdrv_co_readv(bs, sector_num + tail_offset, alignment, &rmw_qiov);
+ if (ret < 0) {
+ return ret;
+ }
+ iov_to_buf(qiov->iov, qiov->size,
+ tail, tail_offset * BDRV_SECTOR_SIZE,
+ tail_sectors * BDRV_SECTOR_SIZE);
+ }
+ ret = drv->bdrv_co_writev(bs, io_sector_num, io_nb_sectors, &aligned_qiov);
+ } else {
+ ret = drv->bdrv_co_readv(bs, io_sector_num, io_nb_sectors, &aligned_qiov);
+ /* Copy the misaligned data that the user wants. */
+ if (head != NULL) {
+ iov_from_buf(qiov->iov, qiov->size,
+ (head + head_extra * BDRV_SECTOR_SIZE), 0,
+ head_sectors * BDRV_SECTOR_SIZE);
+ }
+ if (tail != NULL) {
+ iov_from_buf(qiov->iov, qiov->size,
+ tail, tail_offset * BDRV_SECTOR_SIZE,
+ tail_sectors * BDRV_SECTOR_SIZE);
+ }
+ }
+
+ qemu_iovec_destroy(&aligned_qiov);
+ g_free(head);
+ g_free(tail);
+ return ret;
+}
+
+
/*
* Handle a read request in coroutine context
*/
@@ -1624,7 +1769,11 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
}
}
- ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+ if (!req_is_aligned(bs->host_block_size, sector_num, nb_sectors)) {
+ ret = bdrv_rw_co_align(bs, sector_num, nb_sectors, qiov, false);
+ } else {
+ ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+ }
out:
tracked_request_end(&req);
@@ -1669,9 +1818,24 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
get_cluster_size(bs), false);
}
+ /* When the guest sees a lower alignment than the host, and O_DIRECT
+ * is in place, assume that the write will become a read-modify-write
+ * sequence. These have to be serialized against each other, so that
+ * they look atomic to the guest.
+ */
+ if ((bs->open_flags & BDRV_O_NOCACHE) &&
+ bs->guest_block_size < bs->host_block_size) {
+ wait_for_overlapping_requests(bs, sector_num, nb_sectors,
+ bs->host_block_size, true);
+ }
+
tracked_request_begin(&req, bs, sector_num, nb_sectors, true);
- ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
+ if (!req_is_aligned(bs->host_block_size, sector_num, nb_sectors)) {
+ ret = bdrv_rw_co_align(bs, sector_num, nb_sectors, qiov, true);
+ } else {
+ ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
+ }
if (bs->dirty_bitmap) {
set_dirty_bitmap(bs, sector_num, nb_sectors, 1);
@@ -3592,6 +3756,12 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
void bdrv_set_guest_block_size(BlockDriverState *bs, int align)
{
bs->guest_block_size = align;
+ if ((bs->open_flags & BDRV_O_NOCACHE) &&
+ bs->guest_block_size < bs->host_block_size) {
+ error_report("Host block size is %d, guest block size is %d. Direct\n"
+ "access to the host device may cause performance degradation.",
+ bs->host_block_size, bs->guest_block_size);
+ }
if ((bs->open_flags & BDRV_O_RDWR) &&
bs->host_block_size < bs->guest_block_size) {
error_report("Host block size is %d, guest block size is %d. Due to partially\n"
diff --git a/trace-events b/trace-events
index 514849a..d33191d 100644
--- a/trace-events
+++ b/trace-events
@@ -68,6 +68,7 @@ bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"P
bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num %"PRId64" nb_sectors %d"
bdrv_co_io_em(void *bs, int64_t sector_num, int nb_sectors, int is_write, void *acb) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d acb %p"
bdrv_co_copy_on_readv(void *bs, int64_t sector_num, int nb_sectors, int64_t cluster_sector_num, int cluster_nb_sectors) "bs %p sector_num %"PRId64" nb_sectors %d cluster_sector_num %"PRId64" cluster_nb_sectors %d"
+bdrv_rw_co_align(void *bs, int64_t sector_num, int nb_sectors, bool is_write) "bs %p sector_num %"PRId64" nb_sectors %d is_write %d"
# hw/virtio-blk.c
virtio_blk_req_complete(void *req, int status) "req %p status %d"
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 16/17] block: default physical block size to host block size
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (14 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 15/17] block: align and serialize I/O when guest_block_size < host_block_size Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 17/17] qemu-io: add blocksize argument to open Paolo Bonzini
2011-12-14 11:13 ` [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Kevin Wolf
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
Hopefully, this will help operating systems and they will make less
misaligned I/O operations. Or on Linux, at least fdisk will tell the
user that they should try and align partitions.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
block.c | 17 +++++++++++++++++
block.h | 15 ++-------------
2 files changed, 19 insertions(+), 13 deletions(-)
diff --git a/block.c b/block.c
index 44f5d83..1bfadaf 100644
--- a/block.c
+++ b/block.c
@@ -4041,3 +4041,20 @@ out:
return ret;
}
+
+unsigned int get_physical_block_exp(BlockConf *conf)
+{
+ unsigned int exp = 0, size;
+
+ if (conf->bs && !conf->physical_block_size) {
+ size = conf->bs->host_block_size;
+ } else {
+ size = conf->physical_block_size;
+ }
+
+ for (; size > conf->logical_block_size; size >>= 1) {
+ exp++;
+ }
+
+ return exp;
+}
diff --git a/block.h b/block.h
index e088146..67594d4 100644
--- a/block.h
+++ b/block.h
@@ -401,25 +401,14 @@ typedef struct BlockConf {
uint32_t discard_granularity;
} BlockConf;
-static inline unsigned int get_physical_block_exp(BlockConf *conf)
-{
- unsigned int exp = 0, size;
-
- for (size = conf->physical_block_size;
- size > conf->logical_block_size;
- size >>= 1) {
- exp++;
- }
-
- return exp;
-}
+unsigned int get_physical_block_exp(BlockConf *conf);
#define DEFINE_BLOCK_PROPERTIES(_state, _conf) \
DEFINE_PROP_DRIVE("drive", _state, _conf.bs), \
DEFINE_PROP_UINT16("logical_block_size", _state, \
_conf.logical_block_size, 512), \
DEFINE_PROP_UINT16("physical_block_size", _state, \
- _conf.physical_block_size, 512), \
+ _conf.physical_block_size, 0), \
DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0), \
DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0), \
DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1), \
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [Qemu-devel] [PATCH 17/17] qemu-io: add blocksize argument to open
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (15 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 16/17] block: default physical block size to host block size Paolo Bonzini
@ 2011-12-13 12:37 ` Paolo Bonzini
2011-12-14 11:13 ` [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Kevin Wolf
17 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-13 12:37 UTC (permalink / raw)
To: qemu-devel
Make it possible to test the alignment code using qemu-io.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
qemu-io.c | 33 ++++++++++++++++++++++++++++-----
1 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/qemu-io.c b/qemu-io.c
index ffa62fb..f866eb3 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -1594,7 +1594,7 @@ static const cmdinfo_t close_cmd = {
.oneline = "close the current open file",
};
-static int openfile(char *name, int flags, int growable)
+static int openfile(char *name, int flags, int growable, int blocksize)
{
if (bs) {
fprintf(stderr, "file open already, try 'help close'\n");
@@ -1615,6 +1615,17 @@ static int openfile(char *name, int flags, int growable)
bs = NULL;
return 1;
}
+ if (blocksize) {
+ if (blocksize < bs->host_block_size && (flags & BDRV_O_NOCACHE)) {
+ fprintf(stderr, "%s: block size cannot be smaller than the actual size\n", progname);
+ } else {
+ if (blocksize > 512) {
+ /* Always go through the RMW logic. */
+ bs->open_flags |= BDRV_O_NOCACHE;
+ }
+ bs->host_block_size = blocksize;
+ }
+ }
}
return 0;
@@ -1633,7 +1644,8 @@ static void open_help(void)
" -r, -- open file read-only\n"
" -s, -- use snapshot file\n"
" -n, -- disable host cache\n"
-" -g, -- allow file to grow (only applies to protocols)"
+" -g, -- allow file to grow (only applies to protocols)\n"
+" -bSIZE, -- behave as if the host block size was SIZE\n"
"\n");
}
@@ -1656,9 +1668,11 @@ static int open_f(int argc, char **argv)
int flags = 0;
int readonly = 0;
int growable = 0;
+ long blocksize = 0;
+ char *endptr = NULL;
int c;
- while ((c = getopt(argc, argv, "snrg")) != EOF) {
+ while ((c = getopt(argc, argv, "snrgb:")) != EOF) {
switch (c) {
case 's':
flags |= BDRV_O_SNAPSHOT;
@@ -1672,6 +1686,15 @@ static int open_f(int argc, char **argv)
case 'g':
growable = 1;
break;
+ case 'b':
+ blocksize = strtol(optarg, &endptr, 0);
+ if (blocksize < 512 || blocksize > 65536 ||
+ (blocksize & (blocksize - 1)) || *endptr != '\0') {
+ printf("The block size must be a power of two "
+ "between 512 and 65536.\n");
+ return -1;
+ }
+ break;
default:
return command_usage(&open_cmd);
}
@@ -1685,7 +1708,7 @@ static int open_f(int argc, char **argv)
return command_usage(&open_cmd);
}
- return openfile(argv[optind], flags, growable);
+ return openfile(argv[optind], flags, growable, blocksize);
}
static int init_args_command(int index)
@@ -1825,7 +1848,7 @@ int main(int argc, char **argv)
}
if ((argc - optind) == 1) {
- openfile(argv[optind], flags, growable);
+ openfile(argv[optind], flags, growable, 0);
}
command_loop();
--
1.7.7.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
` (16 preceding siblings ...)
2011-12-13 12:37 ` [Qemu-devel] [PATCH 17/17] qemu-io: add blocksize argument to open Paolo Bonzini
@ 2011-12-14 11:13 ` Kevin Wolf
2011-12-14 11:47 ` Paolo Bonzini
17 siblings, 1 reply; 25+ messages in thread
From: Kevin Wolf @ 2011-12-14 11:13 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Christoph Hellwig, qemu-devel, Stefan Hajnoczi
Am 13.12.2011 13:37, schrieb Paolo Bonzini:
> Running with mismatched host and guest logical block sizes is going
> to become more important as 4k-sector disks become more widespread.
> This is because we need a 512 byte disk to boot from.
>
> Mismatched block sizes have two problems:
>
> 1) with cache=none or with non-raw protocols, you just cannot do 512-byte
> granularity output. You need to do read-modify-write cycles like "hybrid"
> 512b-logical/4k-physical disks do. (Note that actually only the iSCSI
> protocol supports 4k logical blocks).
>
> 2) when host block size < guest block size, guests issue 4k-aligned
> I/O and expect it to be atomic. This problem cannot really be solved
> completely, because power or I/O failures could leave a partially-written
> block ("torn page"). However, at least you can serialize reads against
> overlapping writes, which guarantees correctness as long as shutdown is
> clean and there are no I/O errors.
As we discussed before, the really interesting point here is defaults,
and whatever you choose to do is wrong in some respect.
So it looks like you chose to make the virtual device default to the
host block size. If you want to migrate between two hosts with different
sector sizes, you need to specify the block size of the virtual device
explicitly. I think this is reasonable so far.
It also means that depending on which disk your image resides, the guest
sees different hardware by default. It may well happen that an image
that ran just fine on a host with 512 byte sectors refuses to work on a
host with 4k sectors. This is something that I don't really like to see.
(I wonder if it would make sense to optionally save a block size in the
qcow2 image header that would be used by default?)
Other defaults would imply doing (potentially unsafe) emulation, so I
think what you chose might be the least bad option. Still worth having a
discussion about it on the list.
Kevin
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes
2011-12-14 11:13 ` [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Kevin Wolf
@ 2011-12-14 11:47 ` Paolo Bonzini
2011-12-14 12:05 ` Kevin Wolf
0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-14 11:47 UTC (permalink / raw)
To: Kevin Wolf; +Cc: Christoph Hellwig, qemu-devel, Stefan Hajnoczi
On 12/14/2011 12:13 PM, Kevin Wolf wrote:
> As we discussed before, the really interesting point here is defaults,
> and whatever you choose to do is wrong in some respect.
>
> So it looks like you chose to make the virtual device default to the
> host block size.
... wait wait, I default to 512. :)
Here is the rationale. 512-over-4k may be slow, but is safe (but it is
not slow if you align partitions properly). 4k-over-512 is unsafe. So,
defaulting to 512 seemed the right thing after all.
Paolo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes
2011-12-14 11:47 ` Paolo Bonzini
@ 2011-12-14 12:05 ` Kevin Wolf
2011-12-14 12:40 ` Paolo Bonzini
0 siblings, 1 reply; 25+ messages in thread
From: Kevin Wolf @ 2011-12-14 12:05 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Christoph Hellwig, qemu-devel, Stefan Hajnoczi
Am 14.12.2011 12:47, schrieb Paolo Bonzini:
> On 12/14/2011 12:13 PM, Kevin Wolf wrote:
>> As we discussed before, the really interesting point here is defaults,
>> and whatever you choose to do is wrong in some respect.
>>
>> So it looks like you chose to make the virtual device default to the
>> host block size.
>
> ... wait wait, I default to 512. :)
>
> Here is the rationale. 512-over-4k may be slow, but is safe (but it is
> not slow if you align partitions properly). 4k-over-512 is unsafe. So,
> defaulting to 512 seemed the right thing after all.
Which means bounce buffers by default on 4k hosts. Is this going to
become our next cache=writethrough? At some point 4k disks will be in
wide use, but we'll still be stuck with a slow default of 512.
No matter what we decide here, I think it might really be a good idea to
save the block size in the image and use that as the default if nothing
else is specified on the command line.
Kevin
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes
2011-12-14 12:05 ` Kevin Wolf
@ 2011-12-14 12:40 ` Paolo Bonzini
2011-12-21 16:55 ` Christoph Hellwig
0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-14 12:40 UTC (permalink / raw)
To: Kevin Wolf; +Cc: Christoph Hellwig, qemu-devel, Stefan Hajnoczi
On 12/14/2011 01:05 PM, Kevin Wolf wrote:
> Am 14.12.2011 12:47, schrieb Paolo Bonzini:
>> On 12/14/2011 12:13 PM, Kevin Wolf wrote:
>>> As we discussed before, the really interesting point here is defaults,
>>> and whatever you choose to do is wrong in some respect.
>>>
>>> So it looks like you chose to make the virtual device default to the
>>> host block size.
>>
>> ... wait wait, I default to 512. :)
>>
>> Here is the rationale. 512-over-4k may be slow, but is safe (but it is
>> not slow if you align partitions properly). 4k-over-512 is unsafe. So,
>> defaulting to 512 seemed the right thing after all.
>
> Which means bounce buffers by default on 4k hosts.
In practice it doesn't (quite surprisingly). The patches do the following:
- if the initial and ending sector is aligned, submit directly to paio.
Otherwise, only bounce the extra host sectors (up to 2) required to
align the operation: the bulk of the request will reuse the guest's data
buffer.
- if the buffer is not 4k-aligned, paio will linearize the request with
a bounce buffer. This will almost always happen if the initial sector
is misaligned, but not if only the ending sector is misaligned.
If the partitions are aligned, the OS will always issue aligned
requests, because file system blocks are already 4k. And kernel buffers
will usually be page-aligned rather than block-aligned, so paio will
also let the request through. You'll see perhaps half a dozen
misaligned requests in the whole guest lifetime, for example to read the
partition table. So for aligned partitions, the performance difference
on iozone was well within statistical noise.
> Is this going to
> become our next cache=writethrough? At some point 4k disks will be in
> wide use, but we'll still be stuck with a slow default of 512.
Unless you switch to EFI, the boot disk has to remain anyway on 512-byte
blocks.
> No matter what we decide here, I think it might really be a good idea to
> save the block size in the image and use that as the default if nothing
> else is specified on the command line.
Yeah, that's sensible to do (though it can be a follow-up).
Paolo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Qemu-devel] [PATCH 03/17] block: pass protocol flags up to the format
2011-12-13 12:37 ` [Qemu-devel] [PATCH 03/17] block: pass protocol flags up to the format Paolo Bonzini
@ 2011-12-15 4:10 ` Zhi Yong Wu
0 siblings, 0 replies; 25+ messages in thread
From: Zhi Yong Wu @ 2011-12-15 4:10 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel
On Tue, Dec 13, 2011 at 8:37 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> In the next patches, the protocols will modify bs->open_flags to signify
> that they cannot support the exact requested feature set. Pass the
> modified flags to the format.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> block.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/block.c b/block.c
> index fa11e3a..6734e66 100644
> --- a/block.c
> +++ b/block.c
> @@ -612,8 +612,9 @@ static int bdrv_open_common(BlockDriverState *bs, const char *filename,
> ret = drv->bdrv_file_open(bs, filename, open_flags);
> } else {
> ret = bdrv_file_open(&bs->file, filename, open_flags);
> + bs->open_flags = bs->file->open_flags;
> if (ret >= 0) {
> - ret = drv->bdrv_open(bs, open_flags);
> + ret = drv->bdrv_open(bs, bs->file->open_flags);
s/bs->file->open_flags/bs->open_flags/, it is more reasonable to use
bs->open_flags here.
> }
> }
>
> --
> 1.7.7.1
>
>
>
--
Regards,
Zhi Yong Wu
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes
2011-12-14 12:40 ` Paolo Bonzini
@ 2011-12-21 16:55 ` Christoph Hellwig
2011-12-21 17:00 ` Paolo Bonzini
0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2011-12-21 16:55 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Kevin Wolf, Christoph Hellwig, qemu-devel, Stefan Hajnoczi
On Wed, Dec 14, 2011 at 01:40:22PM +0100, Paolo Bonzini wrote:
> If the partitions are aligned, the OS will always issue aligned requests,
> because file system blocks are already 4k.
It won't nessecarily. For example XFS will do a fair amount of sub-blocksize
I/O for metadata and the log. Note that if you tell XFS that the
minimal_io_size is 4k it will at least align the log write to it, which
is what really matters.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes
2011-12-21 16:55 ` Christoph Hellwig
@ 2011-12-21 17:00 ` Paolo Bonzini
0 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2011-12-21 17:00 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi
On 12/21/2011 05:55 PM, Christoph Hellwig wrote:
> On Wed, Dec 14, 2011 at 01:40:22PM +0100, Paolo Bonzini wrote:
>> If the partitions are aligned, the OS will always issue aligned requests,
>> because file system blocks are already 4k.
>
> It won't nessecarily. For example XFS will do a fair amount of sub-blocksize
> I/O for metadata and the log. Note that if you tell XFS that the
> minimal_io_size is 4k it will at least align the log write to it, which
> is what really matters.
The physical_block_size doesn't matter?
Paolo
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2011-12-21 17:00 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-13 12:37 [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 01/17] block: do not rely on open_flags for bdrv_is_snapshot Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 02/17] block: store actual flags in bs->open_flags Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 03/17] block: pass protocol flags up to the format Paolo Bonzini
2011-12-15 4:10 ` Zhi Yong Wu
2011-12-13 12:37 ` [Qemu-devel] [PATCH 04/17] block: non-raw protocols never cache Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 05/17] block: remove enable_write_cache Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 06/17] block: move flag bits together Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 07/17] raw: remove the aligned_buf Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 08/17] block: rename buffer_alignment to guest_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 09/17] block: add host_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 10/17] raw: probe host_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 11/17] iscsi: save host block size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 12/17] block: allow waiting only for overlapping writes Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 13/17] block: allow waiting at arbitrary granularity Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 14/17] block: protect against "torn reads" for guest_block_size > host_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 15/17] block: align and serialize I/O when guest_block_size < host_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 16/17] block: default physical block size to host block size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 17/17] qemu-io: add blocksize argument to open Paolo Bonzini
2011-12-14 11:13 ` [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Kevin Wolf
2011-12-14 11:47 ` Paolo Bonzini
2011-12-14 12:05 ` Kevin Wolf
2011-12-14 12:40 ` Paolo Bonzini
2011-12-21 16:55 ` Christoph Hellwig
2011-12-21 17:00 ` Paolo Bonzini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).