From: Maxim Levitsky <mlevitsk@redhat.com>
To: Keith Busch <kbusch@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Christoph Hellwig <hch@lst.de>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
qemu-devel@nongnu.org, kvm@vger.kernel.org,
Kevin Wolf <kwolf@redhat.com>,
Michael Roth <mdroth@linux.vnet.ibm.com>
Subject: Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures
Date: Mon, 03 Oct 2022 10:06:48 +0300 [thread overview]
Message-ID: <2a9fe4f9759b9971e76f719f4c1295eed41ed50c.camel@redhat.com> (raw)
In-Reply-To: <YzmYojlHKZ79mseE@kbusch-mbp>
On Sun, 2022-10-02 at 07:56 -0600, Keith Busch wrote:
> On Sun, Oct 02, 2022 at 11:59:42AM +0300, Maxim Levitsky wrote:
> > On Thu, 2022-09-29 at 19:35 +0200, Paolo Bonzini wrote:
> > > On 9/29/22 18:39, Christoph Hellwig wrote:
> > > > On Thu, Sep 29, 2022 at 10:37:22AM -0600, Keith Busch wrote:
> > > > > > I am aware, and I've submitted the fix to qemu here:
> > > > > >
> > > > > > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
> > > > >
> > > > > I don't think so. Memory alignment and length granularity are two completely
> > > > > different concepts. If anything, the kernel's ABI had been that the length
> > > > > requirement was also required for the memory alignment, not the other way
> > > > > around. That usage will continue working with this kernel patch.
> >
> > Yes, this is how I also understand it - for example for O_DIRECT on a file which
> > resides on 4K block device, you have to use page aligned buffers.
> >
> > But here after the patch, 512 aligned buffer starts working as well - If I
> > understand you correctly the ABI didn't guarantee that such usage would fail,
> > but rather that it might fail.
>
> The kernel patch will allow buffer alignment to work with whatever the hardware
> reports it can support. It could even as low as byte aligned if that's the
> hardware can use that.
>
> The patch aligns direct-io with the same criteria blk_rq_map_user() has always
> used to know if the user space buffer is compatible with the hardware's dma
> requirements. Prior to this patch, the direct-io memory alignment was an
> artificial software constraint, and that constraint creates a lot of
> unnecessary memory pressure.
>
> As has always been the case, each segment needs to be a logical block length
> granularity. QEMU assumed a buffer's page offset also defined the logical block
> size instead of using the actual logical block size that it had previously
> discovered directly.
>
> > If I understand that correctly, after the patch in question,
> > qemu is able to use just 512 bytes aligned buffer to read a single 4K block from the disk,
> > which supposed to fail but wasn't guarnteed to fail.
> >
> > Later qemu it submits iovec which also reads a 4K block but in two parts,
> > and if I understand that correctly, each part (iov) is considered
> > to be a separate IO operation, and thus each has to be in my case 4K in size,
> > and its memory buffer *should* also be 4K aligned.
> >
> > (but it can work with smaller alignement as well).
>
> Right. The iov length needs to match the logical block size. The iov's memory
> offset needs to align to the queue's dma_alignment attribute. The memory
> alignment may be smaller than a block size.
>
> > Assuming that I understand all of this correctly, I agree with Paolo that this is qemu
> > bug, but I do fear that it can cause quite some problems for users,
> > especially for users that use outdated qemu version.
> >
> > It might be too much to ask, but maybe add a Kconfig option to keep legacy behavier
> > for those that need it?
>
> Kconfig doesn't sound right.
>
> The block layer exports all the attributes user space needs to know about for
> direct io.
>
> iov length: /sys/block/<block-dev>/queue/logical_block_size
> iov mem align: /sys/block/<block-dev>/queue/dma_alignment
>
> If you really want to change the behavior, I think maybe we could make the
> dma_alignment attribute writeable (or perhaps add a new attribute specifically
> for dio_alignment) so the user can request something larger.
>
All makes sense now.
New attribute could make sense I guess, and can be set by an udev rule or something.
Anyway I won't worry about this for now, and if there are issues I'll see how we could work
around them.
Thanks for everything,
Best regards,
Maxim Levitsky
next prev parent reply other threads:[~2022-10-03 7:14 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-29 15:41 Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures Maxim Levitsky
2022-09-29 15:48 ` Keith Busch
2022-09-29 16:16 ` Maxim Levitsky
2022-09-29 16:37 ` Keith Busch
2022-09-29 16:39 ` Christoph Hellwig
2022-09-29 17:35 ` Paolo Bonzini
2022-10-02 8:59 ` Maxim Levitsky
2022-10-02 13:56 ` Keith Busch
2022-10-03 7:06 ` Maxim Levitsky [this message]
2022-09-30 11:52 ` Thorsten Leemhuis
2022-11-04 11:59 ` Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures #forregzbot Thorsten Leemhuis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2a9fe4f9759b9971e76f719f4c1295eed41ed50c.camel@redhat.com \
--to=mlevitsk@redhat.com \
--cc=hch@lst.de \
--cc=kbusch@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kwolf@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mdroth@linux.vnet.ibm.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).