* Significant brokenness in DIO loopback path
@ 2022-03-10 1:44 Kent Overstreet
2022-03-10 3:16 ` Ming Lei
0 siblings, 1 reply; 3+ messages in thread
From: Kent Overstreet @ 2022-03-10 1:44 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel, Pavel Begunkov
Cc: Christoph Hellwig, Ming Lei, Jens Axboe, Al Viro
So I'm testing bcachefs with the loopback driver in dio mode, and noticing
_significant_ brokenness in the bio_iov_iter_get_pages() path and elsewhere.
1) We don't check that we're not asking for more pages than we're in the
original bio
Noticed this because of another bug:
2) the loopback driver appears to never look at the underlying filesystem's
block size, meaning if the filesystem advertises a block size of 4k the loopback
device's blocksize will still be 512, and we'll end up issuing IOs the DIO path
shouldn't allow due to alignment.
3) iov_iter_bvec_advance() looks like utter nonsense. We're synthesizing a fake
bvec_iter and never using or even looking at one from the original bio, looking
at the construction in iov_iter_bvec().
This is broken; you're assuming you're never going to see bios with partially
completed bvec_iters, or things are going to explode.
Try putting a md raid0 on top of two loopback devices with a sub page block
size, things are just going to explode.
iov_iter_bvec() needs to be changed to take a bio, not a bvec array, and
iov_iter_bvec_advance() should probably just call bio_advance() - and
bio_iov_bvec_set() needs to be changed to just copy bi_iter from the original
bio into the dest bio. You guys made this way more complicated than it needed to
be.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Significant brokenness in DIO loopback path
2022-03-10 1:44 Significant brokenness in DIO loopback path Kent Overstreet
@ 2022-03-10 3:16 ` Ming Lei
2022-03-10 5:17 ` Kent Overstreet
0 siblings, 1 reply; 3+ messages in thread
From: Ming Lei @ 2022-03-10 3:16 UTC (permalink / raw)
To: Kent Overstreet
Cc: linux-fsdevel, linux-kernel, Pavel Begunkov, Christoph Hellwig,
Jens Axboe, Al Viro
On Wed, Mar 09, 2022 at 08:44:00PM -0500, Kent Overstreet wrote:
> So I'm testing bcachefs with the loopback driver in dio mode, and noticing
> _significant_ brokenness in the bio_iov_iter_get_pages() path and elsewhere.
>
> 1) We don't check that we're not asking for more pages than we're in the
> original bio
>
> Noticed this because of another bug:
>
> 2) the loopback driver appears to never look at the underlying filesystem's
> block size, meaning if the filesystem advertises a block size of 4k the loopback
> device's blocksize will still be 512, and we'll end up issuing IOs the DIO path
> shouldn't allow due to alignment.
I tried to fallback to buffered IO for unaligned dio, it was rejected.
https://lore.kernel.org/linux-block/20211025094437.2837701-1-ming.lei@redhat.com/
Also the ahead of time check may not work as expected because of ioctl
order, I guess that is why you see loop 512 bs even the underlying advertises
big 4k size.
Also loop 512 bs is often useful since the upper FS image may need that.
>
> 3) iov_iter_bvec_advance() looks like utter nonsense. We're synthesizing a fake
> bvec_iter and never using or even looking at one from the original bio, looking
> at the construction in iov_iter_bvec().
>
> This is broken; you're assuming you're never going to see bios with partially
> completed bvec_iters, or things are going to explode.
>
> Try putting a md raid0 on top of two loopback devices with a sub page block
> size, things are just going to explode.
>
> iov_iter_bvec() needs to be changed to take a bio, not a bvec array, and
> iov_iter_bvec_advance() should probably just call bio_advance() - and
> bio_iov_bvec_set() needs to be changed to just copy bi_iter from the original
> bio into the dest bio. You guys made this way more complicated than it needed to
> be.
Can you share the function in loop.c you are talking? Is it lo_rw_aio()? What is
the exact issue in current way?
If the request has > 1 bio, one bvec array is made for call_write_iter.
Thanks
Ming
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Significant brokenness in DIO loopback path
2022-03-10 3:16 ` Ming Lei
@ 2022-03-10 5:17 ` Kent Overstreet
0 siblings, 0 replies; 3+ messages in thread
From: Kent Overstreet @ 2022-03-10 5:17 UTC (permalink / raw)
To: Ming Lei
Cc: linux-fsdevel, linux-kernel, Pavel Begunkov, Christoph Hellwig,
Jens Axboe, Al Viro
On Thu, Mar 10, 2022 at 11:16:35AM +0800, Ming Lei wrote:
> On Wed, Mar 09, 2022 at 08:44:00PM -0500, Kent Overstreet wrote:
> > So I'm testing bcachefs with the loopback driver in dio mode, and noticing
> > _significant_ brokenness in the bio_iov_iter_get_pages() path and elsewhere.
> >
> > 1) We don't check that we're not asking for more pages than we're in the
> > original bio
> >
> > Noticed this because of another bug:
> >
> > 2) the loopback driver appears to never look at the underlying filesystem's
> > block size, meaning if the filesystem advertises a block size of 4k the loopback
> > device's blocksize will still be 512, and we'll end up issuing IOs the DIO path
> > shouldn't allow due to alignment.
>
> I tried to fallback to buffered IO for unaligned dio, it was rejected.
>
> https://lore.kernel.org/linux-block/20211025094437.2837701-1-ming.lei@redhat.com/
I agree with Christoph, mixing buffered and DIO is a bad idea and we should be
rejecting unaligned IO. However, filesystems reject unaligned DIO, I don't think
loopback necessarily needs to be checking for that.
> Also the ahead of time check may not work as expected because of ioctl
> order, I guess that is why you see loop 512 bs even the underlying advertises
> big 4k siz
Ugh... artifact of the old workflow where we create a loopback device, then
attach it to a file. Would be much better if we always created a loopback device
and attached it to a file as part of the same operation.
> Also loop 512 bs is often useful since the upper FS image may need that.
Not a valid rationale here. If 512 aligned DIO doesn't work, then it doesn't
work, it's not loopback's job to fake it.
>
> >
> > 3) iov_iter_bvec_advance() looks like utter nonsense. We're synthesizing a fake
> > bvec_iter and never using or even looking at one from the original bio, looking
> > at the construction in iov_iter_bvec().
> >
> > This is broken; you're assuming you're never going to see bios with partially
> > completed bvec_iters, or things are going to explode.
> >
> > Try putting a md raid0 on top of two loopback devices with a sub page block
> > size, things are just going to explode.
> >
> > iov_iter_bvec() needs to be changed to take a bio, not a bvec array, and
> > iov_iter_bvec_advance() should probably just call bio_advance() - and
> > bio_iov_bvec_set() needs to be changed to just copy bi_iter from the original
> > bio into the dest bio. You guys made this way more complicated than it needed to
> > be.
>
> Can you share the function in loop.c you are talking? Is it lo_rw_aio()? What is
> the exact issue in current way?
So looking at lo_rw_aio() it does look correct, just perverse.
Something's still buggy though,
Why are we using request queues here at all? Request merging was useful in the
rotating disk era, and especially back when filesystems were block based and not
extent based and sent a lot of mergeable requests - that's not the case so much
anymore.
Might it not be simpler to just skip all that and have dio-mode loopback be bio
based?
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-03-10 5:17 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-10 1:44 Significant brokenness in DIO loopback path Kent Overstreet
2022-03-10 3:16 ` Ming Lei
2022-03-10 5:17 ` Kent Overstreet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).