From: Paolo Bonzini <pbonzini@redhat.com>
To: qemu-devel@nongnu.org
Subject: [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes
Date: Tue, 13 Dec 2011 13:37:03 +0100 [thread overview]
Message-ID: <1323779840-4235-1-git-send-email-pbonzini@redhat.com> (raw)
Running with mismatched host and guest logical block sizes is going
to become more important as 4k-sector disks become more widespread.
This is because we need a 512 byte disk to boot from.
Mismatched block sizes have two problems:
1) with cache=none or with non-raw protocols, you just cannot do 512-byte
granularity output. You need to do read-modify-write cycles like "hybrid"
512b-logical/4k-physical disks do. (Note that actually only the iSCSI
protocol supports 4k logical blocks).
2) when host block size < guest block size, guests issue 4k-aligned
I/O and expect it to be atomic. This problem cannot really be solved
completely, because power or I/O failures could leave a partially-written
block ("torn page"). However, at least you can serialize reads against
overlapping writes, which guarantees correctness as long as shutdown is
clean and there are no I/O errors.
Read-modify-write cycles are of course slower, and need to serialize
writes which makes the situation even worse. However, the performance
impact of emulating 512-byte sectors is within noise when partitions are
aligned. File system blocks are usually 4k or bigger, and OSes tend
to use 4k-aligned buffers. So when partitions are aligned no misaligned
I/O is sent and no bounce buffer is necessary either.
The situation is much different if partitions are misaligned or if the
guest is using O_DIRECT with a 512-byte aligned buffer. I benchmarked
only the former using iozone on a RHEL6 guest (2GB memory, 20GB ext4
partition with the whole 4k-sector disk assigned to the guest). Graphs
aren't really pretty, but two points are more or less discernible (also
more or less obvious):
- writes incur a larger overhead than reads by 5-10%;
- for larger file sizes the penalty is smaller, probably because
the I/O scheduler can work better (with almost no penalty for reads);
for smaller file sizes, up to 1M or even more for some scenarios,
misalignment worsened performance by 10-25%.
The series is structured as follows.
Patches 1 to 6 clean up the handling of flag bits, so that non-raw
protocols can always request read-modify-write operation (even when
cache != none).
Patches 7 to 11 distinguish host and guest block sizes in the
BlockDriverState.
Patches 12 to 15 reuse the request tracking mechanism to implement
RMW and to avoid torn pages.
Patch 16 passes down the host block size as physical block size so
that hopefully guest OSes try to align partitions.
Patch 17 adds an option to qemu-io that lets you test these scenarios
even without a 4k-sector disk.
Paolo Bonzini (17):
block: do not rely on open_flags for bdrv_is_snapshot
block: store actual flags in bs->open_flags
block: pass protocol flags up to the format
block: non-raw protocols never cache
block: remove enable_write_cache
block: move flag bits together
raw: remove the aligned_buf
block: rename buffer_alignment to guest_block_size
block: add host_block_size
raw: probe host_block_size
iscsi: save host block size
block: allow waiting only for overlapping writes
block: allow waiting at arbitrary granularity
block: protect against "torn reads" for guest_block_size > host_block_size
block: align and serialize I/O when guest_block_size < host_block_size
block: default physical block size to host block size
qemu-io: add blocksize argument to open
Makefile.objs | 4 +-
block.c | 313 ++++++++++++++++++++++++++++++++++++++++++++++-------
block.h | 17 +---
block/curl.c | 1 +
block/iscsi.c | 2 +
block/nbd.c | 1 +
block/raw-posix.c | 97 ++++++++++-------
block/raw-win32.c | 42 +++++++
block/rbd.c | 1 +
block/sheepdog.c | 1 +
block/vdi.c | 1 +
block_int.h | 25 ++---
hw/ide/core.c | 2 +-
hw/scsi-disk.c | 2 +-
hw/scsi-generic.c | 2 +-
hw/virtio-blk.c | 2 +-
qemu-io.c | 33 +++++-
trace-events | 1 +
18 files changed, 429 insertions(+), 118 deletions(-)
--
1.7.7.1
next reply other threads:[~2011-12-13 12:37 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-13 12:37 Paolo Bonzini [this message]
2011-12-13 12:37 ` [Qemu-devel] [PATCH 01/17] block: do not rely on open_flags for bdrv_is_snapshot Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 02/17] block: store actual flags in bs->open_flags Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 03/17] block: pass protocol flags up to the format Paolo Bonzini
2011-12-15 4:10 ` Zhi Yong Wu
2011-12-13 12:37 ` [Qemu-devel] [PATCH 04/17] block: non-raw protocols never cache Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 05/17] block: remove enable_write_cache Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 06/17] block: move flag bits together Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 07/17] raw: remove the aligned_buf Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 08/17] block: rename buffer_alignment to guest_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 09/17] block: add host_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 10/17] raw: probe host_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 11/17] iscsi: save host block size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 12/17] block: allow waiting only for overlapping writes Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 13/17] block: allow waiting at arbitrary granularity Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 14/17] block: protect against "torn reads" for guest_block_size > host_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 15/17] block: align and serialize I/O when guest_block_size < host_block_size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 16/17] block: default physical block size to host block size Paolo Bonzini
2011-12-13 12:37 ` [Qemu-devel] [PATCH 17/17] qemu-io: add blocksize argument to open Paolo Bonzini
2011-12-14 11:13 ` [Qemu-devel] [PATCH 00/17] Support mismatched host and guest logical block sizes Kevin Wolf
2011-12-14 11:47 ` Paolo Bonzini
2011-12-14 12:05 ` Kevin Wolf
2011-12-14 12:40 ` Paolo Bonzini
2011-12-21 16:55 ` Christoph Hellwig
2011-12-21 17:00 ` Paolo Bonzini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1323779840-4235-1-git-send-email-pbonzini@redhat.com \
--to=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).