From: Halil Pasic <pasic@linux.ibm.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
"open list:AIO" <linux-aio@kvack.org>,
qemu-block@nongnu.org, qemu-devel@nongnu.org,
Max Reitz <mreitz@redhat.com>, Halil Pasic <pasic@linux.ibm.com>,
Christian Borntraeger <borntraeger@de.ibm.com>,
Benjamin LaHaise <bcrl@kvack.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Jan Hoeppner <Jan.Hoeppner@de.ibm.com>,
Stefan Haberland <sth@linux.ibm.com>
Subject: Re: [PULL 18/28] file-posix: try BLKSECTGET on block devices too, do not round to power of 2
Date: Wed, 22 Sep 2021 21:51:43 +0200 [thread overview]
Message-ID: <20210922215143.7f289016.pasic@linux.ibm.com> (raw)
In-Reply-To: <20210906162420.5af35eb9.pasic@linux.ibm.com>
On Mon, 6 Sep 2021 16:24:20 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:
> On Fri, 25 Jun 2021 16:18:12 +0200
> Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> > bs->sg is only true for character devices, but block devices can also
> > be used with scsi-block and scsi-generic. Unfortunately BLKSECTGET
> > returns bytes in an int for /dev/sgN devices, and sectors in a short
> > for block devices, so account for that in the code.
> >
> > The maximum transfer also need not be a power of 2 (for example I have
> > seen disks with 1280 KiB maximum transfer) so there's no need to pass
> > the result through pow2floor.
> >
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>
> We have found that this patch leads to in guest I/O errors when DASD
> is used as a source device. I.e. libvirt domain xml wise something like:
>
> <disk type='block' device='disk'>
> <driver name='qemu' type='raw' cache='none' io='native' iothread='1'/>
> <source dev='/dev/disk/by-id/ccw-XXXXXXX'/>
> <backingStore/>
> <target dev='vdb' bus='virtio'/>
> <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0008'/>
> </disk>
>
> I don't think it is the fault of this patch: it LGTM. But it correlates
> 100%, furthermore the problem seems to be related to the value of
> bl.max_iov which now comes from sysfs.
>
> We are still investigating what is actually wrong. Just wanted to give
> everybody a heads-up that this does seem to cause a nasty regression on
> s390x, even if the code itself is perfect.
>
We have figured out what is going on here. The problem seems to be
specific to linux aio, which seems to limit the size of the iovec to
1024 (UIO_MAXIOV).
Because of this requests get rejected with -EINVAL by the io_submit()
syscall. Here comes a real world example:
(gdb) p *laiocb
$5 = {co = 0x3ff700072c0, ctx = 0x3fe60002650, iocb = {data = 0x0, aio_rw_flags = 0, key = 0,
aio_lio_opcode = 8, aio_reqprio = 0, aio_fildes = 38, u = {c = {buf = 0x3ff70055bc0,
nbytes = 1274, offset = 19977953280, __pad3 = 0, flags = 1, resfd = 43}, v = {
vec = 0x3ff70055bc0, nr = 0, offset = 19977953280}, poll = {__pad1 = 1023,
events = 1879399360}, saddr = {addr = 0x3ff70055bc0, len = 0}}}, ret = -22,
nbytes = 20586496, qiov = 0x3ff700382f8, is_read = false, next = {sqe_next = 0x0}}
On the host kernel side, the -EINVAL comes from this line:
https://elixir.bootlin.com/linux/v5.15-rc2/source/lib/iov_iter.c#L1863
in iovec_from_user() roughly via: __do_sys_io_submit()->
io_submit_one() -> aio_write() -> aio_setup_rw() -> __import_iovec().
Based on the offline discussion with the DASD maintainers, and on the
linux in source tree documentation (Documentation/block/queue-sysfs.rst
und Documentation/block/biodoc.rst), I believe that the DASD driver is
not wrong when advertising the value 65535 for queue/max_segments.
I believe QEMU jumps to the wrong conclusion in blk_get_max_iov() or
in virtio_blk_submit_multireq(), I can't really tell because I'm not
sure what the semantic of blk_get_max_iov() is. But if, I had to, I would
bet that blk_get_max_iov() returns the wrong value, when linux aio is
used. I'm not sure what is the exact relationship between max_segments
and max_iov.
One idea on how to fix this would be, to cap max_iov to UIO_MAXIOV
(unconditionally, or when linux aio is used). But I have to admit, I
don't have clarity. I couldn't even find documentation that states
this limitation of linux aio (I didn't look for it too hard though), so
I can not even be sure this is a QEMU problem.
That is why I decided to write this up first, and start a discussion on
who is playing foul with the most relevant people posted. I intend to try
myself fixing the problem once my understanding of it reaches a
sufficient level.
Thanks in advance for your contribution!
Regards,
Halil
next prev parent reply other threads:[~2021-09-22 19:53 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-25 14:17 [PULL v2 00/28] Misc (including block file-posix) for 2021-06-23 Paolo Bonzini
2021-06-25 14:17 ` [PULL 01/28] target/i386: kvm: add support for TSC scaling Paolo Bonzini
2021-06-25 14:17 ` [PULL 02/28] meson: drop unused CONFIG_GCRYPT_HMAC Paolo Bonzini
2021-06-25 14:17 ` [PULL 03/28] configure: drop unused variables for xts Paolo Bonzini
2021-06-25 14:17 ` [PULL 04/28] meson: remove preadv from summary Paolo Bonzini
2021-06-25 14:17 ` [PULL 05/28] tests: remove QCRYPTO_HAVE_TLS_TEST_SUPPORT Paolo Bonzini
2021-06-25 14:18 ` [PULL 06/28] configure, meson: convert crypto detection to meson Paolo Bonzini
2021-06-25 14:18 ` [PULL 07/28] configure, meson: convert libtasn1 " Paolo Bonzini
2021-06-25 14:18 ` [PULL 08/28] configure, meson: convert pam " Paolo Bonzini
2021-06-25 14:18 ` [PULL 09/28] configure, meson: convert libusb " Paolo Bonzini
2021-06-25 14:18 ` [PULL 10/28] configure, meson: convert libcacard " Paolo Bonzini
2021-06-25 14:18 ` [PULL 11/28] configure, meson: convert libusbredir " Paolo Bonzini
2021-06-25 14:18 ` [PULL 12/28] KVM: Fix dirty ring mmap incorrect size due to renaming accident Paolo Bonzini
2021-06-25 14:18 ` [PULL 13/28] file-posix: fix max_iov for /dev/sg devices Paolo Bonzini
2021-06-25 14:18 ` [PULL 14/28] scsi-generic: pass max_segments via max_iov field in BlockLimits Paolo Bonzini
2021-06-25 14:18 ` [PULL 15/28] osdep: provide ROUND_DOWN macro Paolo Bonzini
2021-06-29 4:12 ` Philippe Mathieu-Daudé
2021-06-25 14:18 ` [PULL 16/28] block-backend: align max_transfer to request alignment Paolo Bonzini
2021-06-25 14:18 ` [PULL 17/28] block: add max_hw_transfer to BlockLimits Paolo Bonzini
2021-06-25 14:18 ` [PULL 18/28] file-posix: try BLKSECTGET on block devices too, do not round to power of 2 Paolo Bonzini
2021-09-06 14:24 ` Halil Pasic
2021-09-22 19:51 ` Halil Pasic [this message]
2021-09-23 9:18 ` Recent qemu patch results in aio failures with host DASD disks resulting in guest I/O errors Christian Borntraeger
2021-09-23 10:57 ` [PULL 18/28] file-posix: try BLKSECTGET on block devices too, do not round to power of 2 Paolo Bonzini
2021-09-23 12:13 ` Halil Pasic
2021-09-23 13:02 ` Paolo Bonzini
2021-06-25 14:18 ` [PULL 19/28] block: feature detection for host block support Paolo Bonzini
2021-06-25 14:18 ` [PULL 20/28] block: check for sys/disk.h Paolo Bonzini
2021-06-25 14:18 ` [PULL 21/28] block: try BSD disk size ioctls one after another Paolo Bonzini
2021-06-25 14:18 ` [PULL 22/28] block: detect DKIOCGETBLOCKCOUNT/SIZE before use Paolo Bonzini
2021-06-25 14:18 ` [PULL 23/28] file-posix: handle EINTR during ioctl Paolo Bonzini
2021-06-25 14:18 ` [PULL 24/28] machine: move dies from X86MachineState to CpuTopology Paolo Bonzini
2021-06-25 14:18 ` [PULL 25/28] machine: move common smp_parse code to caller Paolo Bonzini
2021-06-25 14:18 ` [PULL 26/28] machine: add error propagation to mc->smp_parse Paolo Bonzini
2021-06-25 14:18 ` [PULL 27/28] machine: pass QAPI struct " Paolo Bonzini
2021-06-25 14:18 ` [PULL 28/28] machine: reject -smp dies!=1 for non-PC machines Paolo Bonzini
2021-06-29 8:37 ` [PULL v2 00/28] Misc (including block file-posix) for 2021-06-23 Peter Maydell
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210922215143.7f289016.pasic@linux.ibm.com \
--to=pasic@linux.ibm.com \
--cc=Jan.Hoeppner@de.ibm.com \
--cc=bcrl@kvack.org \
--cc=borntraeger@de.ibm.com \
--cc=kwolf@redhat.com \
--cc=linux-aio@kvack.org \
--cc=mreitz@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=sth@linux.ibm.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.