* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
[not found] ` <03b01c699c9fab64736d04891f1e835aef06c886.camel@redhat.com>
@ 2020-11-12 11:19 ` Jan Kara
2020-11-12 12:00 ` Jan Kara
2020-11-12 15:38 ` Maxim Levitsky
0 siblings, 2 replies; 6+ messages in thread
From: Jan Kara @ 2020-11-12 11:19 UTC (permalink / raw)
To: Maxim Levitsky
Cc: qemu-devel, Kevin Wolf, Peter Lieven, Jan Kara, Paolo Bonzini,
Max Reitz, Darrick J . Wong, qemu-block, linux-block,
Christoph Hellwig, Jens Axboe
[added some relevant people and lists to CC]
On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > clone of "starship_production"
>
> The git-publish destroyed the cover letter:
>
> For the reference this is for bz #1872633
>
> The issue is that current kernel code that implements 'fallocate'
> on kernel block devices roughly works like that:
>
> 1. Flush the page cache on the range that is about to be discarded.
> 2. Issue the discard and wait for it to finish.
> (as far as I can see the discard doesn't go through the
> page cache).
>
> 3. Check if the page cache is dirty for this range,
> if it is dirty (meaning that someone wrote to it meanwhile)
> return -EBUSY.
>
> This means that if qemu (or qemu-img) issues a write, and then
> discard to the area that shares a page, -EBUSY can be returned by
> the kernel.
Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
EBUSY which seems wrong to me. IMO we should handle this gracefully in the
kernel so we need to fix this.
> On the other hand, for example, the ext4 implementation of discard
> doesn't seem to be affected. It does take a lock on the inode to avoid
> concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
Well, filesystem hole punching is somewhat different beast than block device
discard (at least implementation wise).
> Doing fsync and retrying is seems to resolve this issue, but it might be
> a too big hammer. Just retrying doesn't work, indicating that maybe the
> code that flushes the page cache in (1) doesn't do this correctly ?
>
> It also can be racy unless special means are done to block IO from happening
> from qemu during this fsync.
>
> This patch series contains two patches:
>
> First patch just lets the file-posix ignore the -EBUSY errors, which is
> technically enough to fail back to plain write in this case, but seems wrong.
>
> And the second patch adds an optimization to qemu-img to avoid such a
> fragmented write/discard in the first place.
>
> Both patches make the reproducer work for this particular bugzilla,
> but I don't think they are enough.
>
> What do you think?
So if the EBUSY error happens because something happened to the page cache
outside of discarded range (like you describe above), that is a kernel bug
than needs to get fixed. EBUSY should really mean - someone wrote to the
discarded range while discard was running and userspace app has to deal
with that depending on what it aims to do...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
2020-11-12 11:19 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Jan Kara
@ 2020-11-12 12:00 ` Jan Kara
2020-11-12 22:08 ` Darrick J. Wong
2020-11-12 15:38 ` Maxim Levitsky
1 sibling, 1 reply; 6+ messages in thread
From: Jan Kara @ 2020-11-12 12:00 UTC (permalink / raw)
To: Maxim Levitsky
Cc: qemu-devel, Kevin Wolf, Peter Lieven, Jan Kara, Paolo Bonzini,
Max Reitz, Darrick J . Wong, qemu-block, linux-block,
Christoph Hellwig, Jens Axboe
On Thu 12-11-20 12:19:51, Jan Kara wrote:
> [added some relevant people and lists to CC]
>
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> >
> > The git-publish destroyed the cover letter:
> >
> > For the reference this is for bz #1872633
> >
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> >
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> > (as far as I can see the discard doesn't go through the
> > page cache).
> >
> > 3. Check if the page cache is dirty for this range,
> > if it is dirty (meaning that someone wrote to it meanwhile)
> > return -EBUSY.
> >
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
>
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
>
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
>
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
>
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer. Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> >
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> >
> > This patch series contains two patches:
> >
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> >
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> >
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> >
> > What do you think?
>
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...
So I was looking what it would take to fix this inside the kernel. The
problem is that invalidate_inode_pages2_range() is working on page
granularity and it is non-trivial to extend it to work on byte granularity
since we don't support something like "try to reclaim part of a page". But
I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
instead of truncate_inode_pages_range() again? I mean the EBUSY detection
cannot be reliable anyway and userspace has no way of knowing whether a
write happened before discard or after it so just discarding data is fine
from this point of view. Darrick?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
2020-11-12 11:19 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Jan Kara
2020-11-12 12:00 ` Jan Kara
@ 2020-11-12 15:38 ` Maxim Levitsky
2020-11-13 10:07 ` Jan Kara
1 sibling, 1 reply; 6+ messages in thread
From: Maxim Levitsky @ 2020-11-12 15:38 UTC (permalink / raw)
To: Jan Kara
Cc: qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz,
Darrick J . Wong, qemu-block, linux-block, Christoph Hellwig,
Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 3345 bytes --]
On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> [added some relevant people and lists to CC]
>
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> >
> > The git-publish destroyed the cover letter:
> >
> > For the reference this is for bz #1872633
> >
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> >
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> > (as far as I can see the discard doesn't go through the
> > page cache).
> >
> > 3. Check if the page cache is dirty for this range,
> > if it is dirty (meaning that someone wrote to it meanwhile)
> > return -EBUSY.
> >
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
>
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
>
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
>
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
>
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer. Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> >
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> >
> > This patch series contains two patches:
> >
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> >
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> >
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> >
> > What do you think?
>
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...
I double checked this, those are the writes/discards according to my debug
prints (I print start and then start+len-1 for each request)
I have attached the patch for this for reference.
ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
fallocate 00007fe00000 00007fffefff
WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
write 00007ffff000 00007ffffdff
ZERO: 0x00007ffffe00
0000801fefff (len:0x1ff200)
fallocate 00007ffffe00 0000801fefff
FALLOCATE failed with error 16
qemu-img: error while writing at byte 2147483136: Device or resource busy
Best regards,
Maxim Levitsky
>
> Honza
[-- Attachment #2: hacks.diff --]
[-- Type: text/x-patch, Size: 3079 bytes --]
commit ce897250babe3527f451c1c54c86b62659a2c29e
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date: Thu Oct 15 17:02:58 2020 +0300
hacks
diff --git a/block/file-posix.c b/block/file-posix.c
index c63926d592..91ca690505 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
while (offset < aiocb->aio_nbytes) {
if (aiocb->aio_type & QEMU_AIO_WRITE) {
+
+ long unsigned int size = aiocb->aio_nbytes - offset;
+ long unsigned int off = aiocb->aio_offset + offset;
+
+ printf(" write %012lx %012lx\n", off, off+size-1);
+
len = pwrite(aiocb->aio_fildes,
(const char *)buf + offset,
aiocb->aio_nbytes - offset,
@@ -1581,10 +1587,15 @@ static int translate_err(int err)
static int do_fallocate(int fd, int mode, off_t offset, off_t len)
{
do {
+
+ printf(" fallocate %012lx %012lx\n", offset, offset+len-1);
+
if (fallocate(fd, mode, offset, len) == 0) {
return 0;
}
} while (errno == EINTR);
+
+ printf("FALLOCATE failed with error %d\n", errno);
return translate_err(-errno);
}
#endif
diff --git a/qemu-img.c b/qemu-img.c
index c2c56fc797..64d3b84728 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
}
s->sector_next_status = sector_num + n;
+ printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9);
}
n = MIN(n, s->sector_next_status - sector_num);
@@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num,
return 0;
}
+static void test_print(const char* op, unsigned long start, unsigned long len)
+{
+ printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len);
+}
static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
int nb_sectors, uint8_t *buf,
@@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
(s->compressed &&
!buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)))
{
+ test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS,
n << BDRV_SECTOR_BITS, buf, flags);
if (ret < 0) {
@@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
assert(!s->target_has_backing);
break;
}
+ test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
ret = blk_co_pwrite_zeroes(s->target,
sector_num << BDRV_SECTOR_BITS,
n << BDRV_SECTOR_BITS,
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
2020-11-12 12:00 ` Jan Kara
@ 2020-11-12 22:08 ` Darrick J. Wong
2020-12-07 17:23 ` Maxim Levitsky
0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2020-11-12 22:08 UTC (permalink / raw)
To: Jan Kara
Cc: Maxim Levitsky, qemu-devel, Kevin Wolf, Peter Lieven,
Paolo Bonzini, Max Reitz, qemu-block, linux-block,
Christoph Hellwig, Jens Axboe
On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> >
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > >
> > > The git-publish destroyed the cover letter:
> > >
> > > For the reference this is for bz #1872633
> > >
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > >
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > > (as far as I can see the discard doesn't go through the
> > > page cache).
> > >
> > > 3. Check if the page cache is dirty for this range,
> > > if it is dirty (meaning that someone wrote to it meanwhile)
> > > return -EBUSY.
> > >
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> >
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> >
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> >
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> >
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer. Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > >
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > >
> > > This patch series contains two patches:
> > >
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > >
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > >
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > >
> > > What do you think?
> >
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
>
> So I was looking what it would take to fix this inside the kernel. The
> problem is that invalidate_inode_pages2_range() is working on page
> granularity and it is non-trivial to extend it to work on byte granularity
> since we don't support something like "try to reclaim part of a page". But
> I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> cannot be reliable anyway and userspace has no way of knowing whether a
> write happened before discard or after it so just discarding data is fine
> from this point of view. Darrick?
Hmmm, I think I overlooked the fact that we can do buffered writes into
a block device's pagecache without taking any of the usual locks that
have to be held for filesystem files. This is essentially a race
between a not-page-aligned fallocate and a buffered write to a different
sector that is mapped by a page that caches part of the fallocate range.
So yes, Jan is right that we need to use truncate_bdev_range instead of
invalidate_inode_pages2_range because the former will zero the sub-page
ranges on either end of the fallocate request instead of returning
-EBUSY because someone dirtied a part of a page that wasn't involved in
the fallocate operation.
I /probably/ just copy-pasta'd that invalidation call from directio
without thinking hard enough about it, sorry about that. :(
--D
>
> Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
2020-11-12 15:38 ` Maxim Levitsky
@ 2020-11-13 10:07 ` Jan Kara
0 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2020-11-13 10:07 UTC (permalink / raw)
To: Maxim Levitsky
Cc: Jan Kara, qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini,
Max Reitz, Darrick J . Wong, qemu-block, linux-block,
Christoph Hellwig, Jens Axboe
On Thu 12-11-20 17:38:36, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> >
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > >
> > > The git-publish destroyed the cover letter:
> > >
> > > For the reference this is for bz #1872633
> > >
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > >
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > > (as far as I can see the discard doesn't go through the
> > > page cache).
> > >
> > > 3. Check if the page cache is dirty for this range,
> > > if it is dirty (meaning that someone wrote to it meanwhile)
> > > return -EBUSY.
> > >
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> >
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> >
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> >
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> >
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer. Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > >
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > >
> > > This patch series contains two patches:
> > >
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > >
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > >
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > >
> > > What do you think?
> >
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> I double checked this, those are the writes/discards according to my debug
> prints (I print start and then start+len-1 for each request)
> I have attached the patch for this for reference.
>
> ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
> fallocate 00007fe00000 00007fffefff
Yeah, the end at 7ffff000 is indeed not 4k aligned...
> WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
> write 00007ffff000 00007ffffdff
.. and this write is following discarded area in the same page
(7ffff000..7ffffdff).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
2020-11-12 22:08 ` Darrick J. Wong
@ 2020-12-07 17:23 ` Maxim Levitsky
0 siblings, 0 replies; 6+ messages in thread
From: Maxim Levitsky @ 2020-12-07 17:23 UTC (permalink / raw)
To: Darrick J. Wong, Jan Kara
Cc: qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz,
qemu-block, linux-block, Christoph Hellwig, Jens Axboe
On Thu, 2020-11-12 at 14:08 -0800, Darrick J. Wong wrote:
> On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> > On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > > [added some relevant people and lists to CC]
> > >
> > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > > clone of "starship_production"
> > > >
> > > > The git-publish destroyed the cover letter:
> > > >
> > > > For the reference this is for bz #1872633
> > > >
> > > > The issue is that current kernel code that implements 'fallocate'
> > > > on kernel block devices roughly works like that:
> > > >
> > > > 1. Flush the page cache on the range that is about to be discarded.
> > > > 2. Issue the discard and wait for it to finish.
> > > > (as far as I can see the discard doesn't go through the
> > > > page cache).
> > > >
> > > > 3. Check if the page cache is dirty for this range,
> > > > if it is dirty (meaning that someone wrote to it meanwhile)
> > > > return -EBUSY.
> > > >
> > > > This means that if qemu (or qemu-img) issues a write, and then
> > > > discard to the area that shares a page, -EBUSY can be returned by
> > > > the kernel.
> > >
> > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > > kernel so we need to fix this.
> > >
> > > > On the other hand, for example, the ext4 implementation of discard
> > > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > >
> > > Well, filesystem hole punching is somewhat different beast than block device
> > > discard (at least implementation wise).
> > >
> > > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > > a too big hammer. Just retrying doesn't work, indicating that maybe the
> > > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > >
> > > > It also can be racy unless special means are done to block IO from happening
> > > > from qemu during this fsync.
> > > >
> > > > This patch series contains two patches:
> > > >
> > > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > > technically enough to fail back to plain write in this case, but seems wrong.
> > > >
> > > > And the second patch adds an optimization to qemu-img to avoid such a
> > > > fragmented write/discard in the first place.
> > > >
> > > > Both patches make the reproducer work for this particular bugzilla,
> > > > but I don't think they are enough.
> > > >
> > > > What do you think?
> > >
> > > So if the EBUSY error happens because something happened to the page cache
> > > outside of discarded range (like you describe above), that is a kernel bug
> > > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > > discarded range while discard was running and userspace app has to deal
> > > with that depending on what it aims to do...
> >
> > So I was looking what it would take to fix this inside the kernel. The
> > problem is that invalidate_inode_pages2_range() is working on page
> > granularity and it is non-trivial to extend it to work on byte granularity
> > since we don't support something like "try to reclaim part of a page". But
> > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> > instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> > cannot be reliable anyway and userspace has no way of knowing whether a
> > write happened before discard or after it so just discarding data is fine
> > from this point of view. Darrick?
>
> Hmmm, I think I overlooked the fact that we can do buffered writes into
> a block device's pagecache without taking any of the usual locks that
> have to be held for filesystem files. This is essentially a race
> between a not-page-aligned fallocate and a buffered write to a different
> sector that is mapped by a page that caches part of the fallocate range.
>
> So yes, Jan is right that we need to use truncate_bdev_range instead of
> invalidate_inode_pages2_range because the former will zero the sub-page
> ranges on either end of the fallocate request instead of returning
> -EBUSY because someone dirtied a part of a page that wasn't involved in
> the fallocate operation.
>
> I /probably/ just copy-pasta'd that invalidation call from directio
> without thinking hard enough about it, sorry about that. :(
Any update on this?
Today I took a look at both truncate_bdev_range,
and at invalidate_inode_pages2_range.
I see that the truncate_bdev_range can't really fail other
than check for a mounted
filesystem.
It calls truncate_inode_pages_range which writes back all
dirty pages and waits for writeback to finish.
So it won't detect dirty pages as invalidate_inode_pages2_range does
Also AFAIK the
kernel page cache indeed
only tracks the dirtiness of a whole page (e.g a 512 byte write,
will cause the whole page to be written back, unless the
write was done using O_DIRECT)
So if a page, part of
which is discarded, becomes dirty we can't really
know if someone wrote to the discarded region, or only near it.
I vote to keep the call to invalidate_inode_pages2_range but
exclude the non page
aligned areas from it.
This way we will still do a best effort detection of this case,
while not causing any false positives.
If you agree, I'll send a patch for this.
Best regards,
Maxim Levitsky
>
> --D
>
> > Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-12-07 17:25 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20201111153913.41840-1-mlevitsk@redhat.com>
[not found] ` <03b01c699c9fab64736d04891f1e835aef06c886.camel@redhat.com>
2020-11-12 11:19 ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Jan Kara
2020-11-12 12:00 ` Jan Kara
2020-11-12 22:08 ` Darrick J. Wong
2020-12-07 17:23 ` Maxim Levitsky
2020-11-12 15:38 ` Maxim Levitsky
2020-11-13 10:07 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).