Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
       [not found] ` <03b01c699c9fab64736d04891f1e835aef06c886.camel@redhat.com>
@ 2020-11-12 11:19   ` Jan Kara
  2020-11-12 12:00     ` Jan Kara
  2020-11-12 15:38     ` Maxim Levitsky
  0 siblings, 2 replies; 6+ messages in thread
From: Jan Kara @ 2020-11-12 11:19 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: qemu-devel, Kevin Wolf, Peter Lieven, Jan Kara, Paolo Bonzini,
	Max Reitz, Darrick J . Wong, qemu-block, linux-block,
	Christoph Hellwig, Jens Axboe

[added some relevant people and lists to CC]

On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > clone of "starship_production"
> 
> The git-publish destroyed the cover letter:
> 
> For the reference this is for bz #1872633
> 
> The issue is that current kernel code that implements 'fallocate'
> on kernel block devices roughly works like that:
> 
> 1. Flush the page cache on the range that is about to be discarded.
> 2. Issue the discard and wait for it to finish.
>    (as far as I can see the discard doesn't go through the
>    page cache).
> 
> 3. Check if the page cache is dirty for this range,
>    if it is dirty (meaning that someone wrote to it meanwhile)
>    return -EBUSY.
> 
> This means that if qemu (or qemu-img) issues a write, and then
> discard to the area that shares a page, -EBUSY can be returned by
> the kernel.

Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
EBUSY which seems wrong to me. IMO we should handle this gracefully in the
kernel so we need to fix this.

> On the other hand, for example, the ext4 implementation of discard
> doesn't seem to be affected. It does take a lock on the inode to avoid
> concurrent IO and flushes O_DIRECT writers prior to doing discard thought.

Well, filesystem hole punching is somewhat different beast than block device
discard (at least implementation wise).

> Doing fsync and retrying is seems to resolve this issue, but it might be
> a too big hammer.  Just retrying doesn't work, indicating that maybe the
> code that flushes the page cache in (1) doesn't do this correctly ?
> 
> It also can be racy unless special means are done to block IO from happening
> from qemu during this fsync.
> 
> This patch series contains two patches:
> 
> First patch just lets the file-posix ignore the -EBUSY errors, which is
> technically enough to fail back to plain write in this case, but seems wrong.
> 
> And the second patch adds an optimization to qemu-img to avoid such a
> fragmented write/discard in the first place.
> 
> Both patches make the reproducer work for this particular bugzilla,
> but I don't think they are enough.
> 
> What do you think?

So if the EBUSY error happens because something happened to the page cache
outside of discarded range (like you describe above), that is a kernel bug
than needs to get fixed. EBUSY should really mean - someone wrote to the
discarded range while discard was running and userspace app has to deal
with that depending on what it aims to do...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 11:19   ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Jan Kara
@ 2020-11-12 12:00     ` Jan Kara
  2020-11-12 22:08       ` Darrick J. Wong
  2020-11-12 15:38     ` Maxim Levitsky
  1 sibling, 1 reply; 6+ messages in thread
From: Jan Kara @ 2020-11-12 12:00 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: qemu-devel, Kevin Wolf, Peter Lieven, Jan Kara, Paolo Bonzini,
	Max Reitz, Darrick J . Wong, qemu-block, linux-block,
	Christoph Hellwig, Jens Axboe

On Thu 12-11-20 12:19:51, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...

So I was looking what it would take to fix this inside the kernel. The
problem is that invalidate_inode_pages2_range() is working on page
granularity and it is non-trivial to extend it to work on byte granularity
since we don't support something like "try to reclaim part of a page". But
I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
instead of truncate_inode_pages_range() again? I mean the EBUSY detection
cannot be reliable anyway and userspace has no way of knowing whether a
write happened before discard or after it so just discarding data is fine
from this point of view. Darrick?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 11:19   ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Jan Kara
  2020-11-12 12:00     ` Jan Kara
@ 2020-11-12 15:38     ` Maxim Levitsky
  2020-11-13 10:07       ` Jan Kara
  1 sibling, 1 reply; 6+ messages in thread
From: Maxim Levitsky @ 2020-11-12 15:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz,
	Darrick J . Wong, qemu-block, linux-block, Christoph Hellwig,
	Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 3345 bytes --]

On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> [added some relevant people and lists to CC]
> 
> On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > clone of "starship_production"
> > 
> > The git-publish destroyed the cover letter:
> > 
> > For the reference this is for bz #1872633
> > 
> > The issue is that current kernel code that implements 'fallocate'
> > on kernel block devices roughly works like that:
> > 
> > 1. Flush the page cache on the range that is about to be discarded.
> > 2. Issue the discard and wait for it to finish.
> >    (as far as I can see the discard doesn't go through the
> >    page cache).
> > 
> > 3. Check if the page cache is dirty for this range,
> >    if it is dirty (meaning that someone wrote to it meanwhile)
> >    return -EBUSY.
> > 
> > This means that if qemu (or qemu-img) issues a write, and then
> > discard to the area that shares a page, -EBUSY can be returned by
> > the kernel.
> 
> Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> kernel so we need to fix this.
> 
> > On the other hand, for example, the ext4 implementation of discard
> > doesn't seem to be affected. It does take a lock on the inode to avoid
> > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> 
> Well, filesystem hole punching is somewhat different beast than block device
> discard (at least implementation wise).
> 
> > Doing fsync and retrying is seems to resolve this issue, but it might be
> > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > code that flushes the page cache in (1) doesn't do this correctly ?
> > 
> > It also can be racy unless special means are done to block IO from happening
> > from qemu during this fsync.
> > 
> > This patch series contains two patches:
> > 
> > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > technically enough to fail back to plain write in this case, but seems wrong.
> > 
> > And the second patch adds an optimization to qemu-img to avoid such a
> > fragmented write/discard in the first place.
> > 
> > Both patches make the reproducer work for this particular bugzilla,
> > but I don't think they are enough.
> > 
> > What do you think?
> 
> So if the EBUSY error happens because something happened to the page cache
> outside of discarded range (like you describe above), that is a kernel bug
> than needs to get fixed. EBUSY should really mean - someone wrote to the
> discarded range while discard was running and userspace app has to deal
> with that depending on what it aims to do...
I double checked this, those are the writes/discards according to my debug
prints (I print start and then start+len-1 for each request)
I have attached the patch for this for reference.

ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
       fallocate 00007fe00000 00007fffefff
WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
       write 00007ffff000 00007ffffdff
ZERO: 0x00007ffffe00
0000801fefff (len:0x1ff200)
       fallocate 00007ffffe00 0000801fefff
FALLOCATE failed with error 16
qemu-img: error while writing at byte 2147483136: Device or resource busy


Best regards,
     Maxim Levitsky

> 
> 								Honza


[-- Attachment #2: hacks.diff --]
[-- Type: text/x-patch, Size: 3079 bytes --]

commit ce897250babe3527f451c1c54c86b62659a2c29e
Author: Maxim Levitsky <mlevitsk@redhat.com>
Date:   Thu Oct 15 17:02:58 2020 +0300

    hacks

diff --git a/block/file-posix.c b/block/file-posix.c
index c63926d592..91ca690505 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1440,6 +1440,12 @@ static ssize_t handle_aiocb_rw_linear(RawPosixAIOData *aiocb, char *buf)
 
     while (offset < aiocb->aio_nbytes) {
         if (aiocb->aio_type & QEMU_AIO_WRITE) {
+
+            long unsigned int size = aiocb->aio_nbytes - offset;
+            long unsigned int off = aiocb->aio_offset + offset;
+
+            printf("       write %012lx %012lx\n", off, off+size-1);
+
             len = pwrite(aiocb->aio_fildes,
                          (const char *)buf + offset,
                          aiocb->aio_nbytes - offset,
@@ -1581,10 +1587,15 @@ static int translate_err(int err)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
     do {
+
+        printf("       fallocate %012lx %012lx\n", offset, offset+len-1);
+
         if (fallocate(fd, mode, offset, len) == 0) {
             return 0;
         }
     } while (errno == EINTR);
+
+    printf("FALLOCATE failed with error %d\n", errno);
     return translate_err(-errno);
 }
 #endif
diff --git a/qemu-img.c b/qemu-img.c
index c2c56fc797..64d3b84728 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1803,6 +1803,7 @@ static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
         }
 
         s->sector_next_status = sector_num + n;
+        printf("NEXT_STATUS: %lx\n", s->sector_next_status << 9);
     }
 
     n = MIN(n, s->sector_next_status - sector_num);
@@ -1878,6 +1879,10 @@ static int coroutine_fn convert_co_read(ImgConvertState *s, int64_t sector_num,
     return 0;
 }
 
+static void test_print(const char* op, unsigned long start, unsigned long len)
+{
+    printf("%s: 0x%012lx %012lx (len:0x%lx)\n", op, start, start + len-1, len);
+}
 
 static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                                          int nb_sectors, uint8_t *buf,
@@ -1911,6 +1916,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 (s->compressed &&
                  !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)))
             {
+                test_print("WRITE", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
                 ret = blk_co_pwrite(s->target, sector_num << BDRV_SECTOR_BITS,
                                     n << BDRV_SECTOR_BITS, buf, flags);
                 if (ret < 0) {
@@ -1925,6 +1932,8 @@ static int coroutine_fn convert_co_write(ImgConvertState *s, int64_t sector_num,
                 assert(!s->target_has_backing);
                 break;
             }
+            test_print("ZERO", sector_num << BDRV_SECTOR_BITS, n << BDRV_SECTOR_BITS);
+
             ret = blk_co_pwrite_zeroes(s->target,
                                        sector_num << BDRV_SECTOR_BITS,
                                        n << BDRV_SECTOR_BITS,

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 12:00     ` Jan Kara
@ 2020-11-12 22:08       ` Darrick J. Wong
  2020-12-07 17:23         ` Maxim Levitsky
  0 siblings, 1 reply; 6+ messages in thread
From: Darrick J. Wong @ 2020-11-12 22:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Maxim Levitsky, qemu-devel, Kevin Wolf, Peter Lieven,
	Paolo Bonzini, Max Reitz, qemu-block, linux-block,
	Christoph Hellwig, Jens Axboe

On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> 
> So I was looking what it would take to fix this inside the kernel. The
> problem is that invalidate_inode_pages2_range() is working on page
> granularity and it is non-trivial to extend it to work on byte granularity
> since we don't support something like "try to reclaim part of a page". But
> I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> cannot be reliable anyway and userspace has no way of knowing whether a
> write happened before discard or after it so just discarding data is fine
> from this point of view. Darrick?

Hmmm, I think I overlooked the fact that we can do buffered writes into
a block device's pagecache without taking any of the usual locks that
have to be held for filesystem files.  This is essentially a race
between a not-page-aligned fallocate and a buffered write to a different
sector that is mapped by a page that caches part of the fallocate range.

So yes, Jan is right that we need to use truncate_bdev_range instead of
invalidate_inode_pages2_range because the former will zero the sub-page
ranges on either end of the fallocate request instead of returning
-EBUSY because someone dirtied a part of a page that wasn't involved in
the fallocate operation.

I /probably/ just copy-pasta'd that invalidation call from directio
without thinking hard enough about it, sorry about that. :(

--D

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 15:38     ` Maxim Levitsky
@ 2020-11-13 10:07       ` Jan Kara
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2020-11-13 10:07 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Jan Kara, qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini,
	Max Reitz, Darrick J . Wong, qemu-block, linux-block,
	Christoph Hellwig, Jens Axboe

On Thu 12-11-20 17:38:36, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 12:19 +0100, Jan Kara wrote:
> > [added some relevant people and lists to CC]
> > 
> > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > clone of "starship_production"
> > > 
> > > The git-publish destroyed the cover letter:
> > > 
> > > For the reference this is for bz #1872633
> > > 
> > > The issue is that current kernel code that implements 'fallocate'
> > > on kernel block devices roughly works like that:
> > > 
> > > 1. Flush the page cache on the range that is about to be discarded.
> > > 2. Issue the discard and wait for it to finish.
> > >    (as far as I can see the discard doesn't go through the
> > >    page cache).
> > > 
> > > 3. Check if the page cache is dirty for this range,
> > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > >    return -EBUSY.
> > > 
> > > This means that if qemu (or qemu-img) issues a write, and then
> > > discard to the area that shares a page, -EBUSY can be returned by
> > > the kernel.
> > 
> > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > kernel so we need to fix this.
> > 
> > > On the other hand, for example, the ext4 implementation of discard
> > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > 
> > Well, filesystem hole punching is somewhat different beast than block device
> > discard (at least implementation wise).
> > 
> > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > 
> > > It also can be racy unless special means are done to block IO from happening
> > > from qemu during this fsync.
> > > 
> > > This patch series contains two patches:
> > > 
> > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > technically enough to fail back to plain write in this case, but seems wrong.
> > > 
> > > And the second patch adds an optimization to qemu-img to avoid such a
> > > fragmented write/discard in the first place.
> > > 
> > > Both patches make the reproducer work for this particular bugzilla,
> > > but I don't think they are enough.
> > > 
> > > What do you think?
> > 
> > So if the EBUSY error happens because something happened to the page cache
> > outside of discarded range (like you describe above), that is a kernel bug
> > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > discarded range while discard was running and userspace app has to deal
> > with that depending on what it aims to do...
> I double checked this, those are the writes/discards according to my debug
> prints (I print start and then start+len-1 for each request)
> I have attached the patch for this for reference.
> 
> ZERO: 0x00007fe00000 00007fffefff (len:0x1ff000)
>        fallocate 00007fe00000 00007fffefff

Yeah, the end at 7ffff000 is indeed not 4k aligned...

> WRITE: 0x00007ffff000 00007ffffdff (len:0xe00)
>        write 00007ffff000 00007ffffdff

.. and this write is following discarded area in the same page
(7ffff000..7ffffdff).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT
  2020-11-12 22:08       ` Darrick J. Wong
@ 2020-12-07 17:23         ` Maxim Levitsky
  0 siblings, 0 replies; 6+ messages in thread
From: Maxim Levitsky @ 2020-12-07 17:23 UTC (permalink / raw)
  To: Darrick J. Wong, Jan Kara
  Cc: qemu-devel, Kevin Wolf, Peter Lieven, Paolo Bonzini, Max Reitz,
	qemu-block, linux-block, Christoph Hellwig, Jens Axboe

On Thu, 2020-11-12 at 14:08 -0800, Darrick J. Wong wrote:
> On Thu, Nov 12, 2020 at 01:00:56PM +0100, Jan Kara wrote:
> > On Thu 12-11-20 12:19:51, Jan Kara wrote:
> > > [added some relevant people and lists to CC]
> > > 
> > > On Wed 11-11-20 17:44:05, Maxim Levitsky wrote:
> > > > On Wed, 2020-11-11 at 17:39 +0200, Maxim Levitsky wrote:
> > > > > clone of "starship_production"
> > > > 
> > > > The git-publish destroyed the cover letter:
> > > > 
> > > > For the reference this is for bz #1872633
> > > > 
> > > > The issue is that current kernel code that implements 'fallocate'
> > > > on kernel block devices roughly works like that:
> > > > 
> > > > 1. Flush the page cache on the range that is about to be discarded.
> > > > 2. Issue the discard and wait for it to finish.
> > > >    (as far as I can see the discard doesn't go through the
> > > >    page cache).
> > > > 
> > > > 3. Check if the page cache is dirty for this range,
> > > >    if it is dirty (meaning that someone wrote to it meanwhile)
> > > >    return -EBUSY.
> > > > 
> > > > This means that if qemu (or qemu-img) issues a write, and then
> > > > discard to the area that shares a page, -EBUSY can be returned by
> > > > the kernel.
> > > 
> > > Indeed, if you don't submit PAGE_SIZE aligned discards, you can get back
> > > EBUSY which seems wrong to me. IMO we should handle this gracefully in the
> > > kernel so we need to fix this.
> > > 
> > > > On the other hand, for example, the ext4 implementation of discard
> > > > doesn't seem to be affected. It does take a lock on the inode to avoid
> > > > concurrent IO and flushes O_DIRECT writers prior to doing discard thought.
> > > 
> > > Well, filesystem hole punching is somewhat different beast than block device
> > > discard (at least implementation wise).
> > > 
> > > > Doing fsync and retrying is seems to resolve this issue, but it might be
> > > > a too big hammer.  Just retrying doesn't work, indicating that maybe the
> > > > code that flushes the page cache in (1) doesn't do this correctly ?
> > > > 
> > > > It also can be racy unless special means are done to block IO from happening
> > > > from qemu during this fsync.
> > > > 
> > > > This patch series contains two patches:
> > > > 
> > > > First patch just lets the file-posix ignore the -EBUSY errors, which is
> > > > technically enough to fail back to plain write in this case, but seems wrong.
> > > > 
> > > > And the second patch adds an optimization to qemu-img to avoid such a
> > > > fragmented write/discard in the first place.
> > > > 
> > > > Both patches make the reproducer work for this particular bugzilla,
> > > > but I don't think they are enough.
> > > > 
> > > > What do you think?
> > > 
> > > So if the EBUSY error happens because something happened to the page cache
> > > outside of discarded range (like you describe above), that is a kernel bug
> > > than needs to get fixed. EBUSY should really mean - someone wrote to the
> > > discarded range while discard was running and userspace app has to deal
> > > with that depending on what it aims to do...
> > 
> > So I was looking what it would take to fix this inside the kernel. The
> > problem is that invalidate_inode_pages2_range() is working on page
> > granularity and it is non-trivial to extend it to work on byte granularity
> > since we don't support something like "try to reclaim part of a page". But
> > I'm also somewhat wondering why we use invalidate_inode_pages2_range() here
> > instead of truncate_inode_pages_range() again? I mean the EBUSY detection
> > cannot be reliable anyway and userspace has no way of knowing whether a
> > write happened before discard or after it so just discarding data is fine
> > from this point of view. Darrick?
> 
> Hmmm, I think I overlooked the fact that we can do buffered writes into
> a block device's pagecache without taking any of the usual locks that
> have to be held for filesystem files.  This is essentially a race
> between a not-page-aligned fallocate and a buffered write to a different
> sector that is mapped by a page that caches part of the fallocate range.
> 
> So yes, Jan is right that we need to use truncate_bdev_range instead of
> invalidate_inode_pages2_range because the former will zero the sub-page
> ranges on either end of the fallocate request instead of returning
> -EBUSY because someone dirtied a part of a page that wasn't involved in
> the fallocate operation.
> 
> I /probably/ just copy-pasta'd that invalidation call from directio
> without thinking hard enough about it, sorry about that. :(

Any update on this?
 
Today I took a look at both truncate_bdev_range,
and at invalidate_inode_pages2_range.
 
 
I see that the truncate_bdev_range can't really fail other 
than check for a mounted
filesystem.
 
It calls truncate_inode_pages_range which writes back all
dirty pages and waits for writeback to finish.
So it won't detect dirty pages as invalidate_inode_pages2_range does
 
 
Also AFAIK the
kernel page cache indeed 
only tracks the dirtiness of a whole page (e.g a 512 byte write, 
will cause the whole page to be written back, unless the 
write was done using O_DIRECT)
 
So if a page, part of
which is discarded, becomes dirty we can't really
know if someone wrote to the discarded region, or only near it.
 
I vote to keep the call to invalidate_inode_pages2_range but
exclude the non page
aligned areas from it.
 
This way we will still do a best effort detection of this case,
while not causing any false positives.
 
If you agree, I'll send a patch for this.


Best regards,
	Maxim Levitsky

> 
> --D
> 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-07 17:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20201111153913.41840-1-mlevitsk@redhat.com>
     [not found] ` <03b01c699c9fab64736d04891f1e835aef06c886.camel@redhat.com>
2020-11-12 11:19   ` [PATCH 0/2] RFC: Issue with discards on raw block device without O_DIRECT Jan Kara
2020-11-12 12:00     ` Jan Kara
2020-11-12 22:08       ` Darrick J. Wong
2020-12-07 17:23         ` Maxim Levitsky
2020-11-12 15:38     ` Maxim Levitsky
2020-11-13 10:07       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).