LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
@ 2023-02-15 20:04 Chris Murphy
  2023-02-15 20:16 ` Chris Murphy
  2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis)
  0 siblings, 2 replies; 26+ messages in thread
From: Chris Murphy @ 2023-02-15 20:04 UTC (permalink / raw)
  To: Btrfs BTRFS

Downstream bug report, reproducer test file, and gdb session transcript
https://bugzilla.redhat.com/show_bug.cgi?id=2169947

I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here.



--
Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy
@ 2023-02-15 20:16 ` Chris Murphy
  2023-02-15 21:41   ` Filipe Manana
  2023-02-15 23:21   ` Boris Burkov
  2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis)
  1 sibling, 2 replies; 26+ messages in thread
From: Chris Murphy @ 2023-02-15 20:16 UTC (permalink / raw)
  To: Btrfs BTRFS



On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> Downstream bug report, reproducer test file, and gdb session transcript
> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
>
> I speculated that maybe it's similar to the issue we have with VM's 
> when O_DIRECT is used, but it seems that's not the case here.

I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.

kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-15 20:16 ` Chris Murphy
@ 2023-02-15 21:41   ` Filipe Manana
  2023-02-15 23:21   ` Boris Burkov
  1 sibling, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2023-02-15 21:41 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Wed, Feb 15, 2023 at 8:30 PM Chris Murphy <chris@colorremedies.com> wrote:
>
>
>
> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > Downstream bug report, reproducer test file, and gdb session transcript
> > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> >
> > I speculated that maybe it's similar to the issue we have with VM's
> > when O_DIRECT is used, but it seems that's not the case here.
>
> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
>
> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.

I'm able to reproduce it too.

I'll have a look at it, thanks.

>
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-15 20:16 ` Chris Murphy
  2023-02-15 21:41   ` Filipe Manana
@ 2023-02-15 23:21   ` Boris Burkov
  2023-02-16  0:34     ` Boris Burkov
  2023-02-16 10:05     ` Qu Wenruo
  1 sibling, 2 replies; 26+ messages in thread
From: Boris Burkov @ 2023-02-15 23:21 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> 
> 
> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > Downstream bug report, reproducer test file, and gdb session transcript
> > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> >
> > I speculated that maybe it's similar to the issue we have with VM's 
> > when O_DIRECT is used, but it seems that's not the case here.
> 
> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> 
> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> 

I was also able to reproduce on the current misc-next. However, when I
hacked the kernel to always fall back from DIO to buffered IO, it no
longer reproduced. So that's one hint..

The next observation comes from comparing the happy vs unhappy file
extents on disk:
happy: https://pastebin.com/k4EPFKhc
unhappy: https://pastebin.com/hNSBR0yv

The broken DIO case is missing file extents between bytes 8192 and 65536
which corresponds to the observed zeros.

Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
instrumented that codepath. I observed a single successful write to 8192
bytes, then a second write which first does a partial write from 8192 to
65536 and then faults in the rest of the iov_iter and finishes the
write.

I'm now trying to figure out how these partial writes might lead us to
not create all the EXTENT_DATA items for the file extents.

Boris

> 
> -- 
> Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-15 23:21   ` Boris Burkov
@ 2023-02-16  0:34     ` Boris Burkov
  2023-02-16  1:46       ` Boris Burkov
  2023-02-16 11:57       ` Filipe Manana
  2023-02-16 10:05     ` Qu Wenruo
  1 sibling, 2 replies; 26+ messages in thread
From: Boris Burkov @ 2023-02-16  0:34 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > 
> > 
> > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > Downstream bug report, reproducer test file, and gdb session transcript
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > >
> > > I speculated that maybe it's similar to the issue we have with VM's 
> > > when O_DIRECT is used, but it seems that's not the case here.
> > 
> > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > 
> > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > 
> 
> I was also able to reproduce on the current misc-next. However, when I
> hacked the kernel to always fall back from DIO to buffered IO, it no
> longer reproduced. So that's one hint..
> 
> The next observation comes from comparing the happy vs unhappy file
> extents on disk:
> happy: https://pastebin.com/k4EPFKhc
> unhappy: https://pastebin.com/hNSBR0yv
> 
> The broken DIO case is missing file extents between bytes 8192 and 65536
> which corresponds to the observed zeros.
> 
> Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> instrumented that codepath. I observed a single successful write to 8192
> bytes, then a second write which first does a partial write from 8192 to
> 65536 and then faults in the rest of the iov_iter and finishes the
> write.
> 
> I'm now trying to figure out how these partial writes might lead us to
> not create all the EXTENT_DATA items for the file extents.

I believe the issue is indeed caused by faults reading the mapped region
during direct io. Roughly what is happening is:

- we start the dio write (offset 8192 len 1826816)
- __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
  creates an ordered extent for the full write.
- iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
  bytes and breaks out early, but submits the partial bio.
- the partial bio completes and calls the various endio callbacks,
  resulting in a call to btrfs_mark_ordered_io_finished.
- btrfs_mark_ordered_io_finished looks up the ordered extent and finds
  the full ordered extent, but the write that finished is partial, so
  the check for entry->bytes_left fails, and we don't call
  finish_ordered_fn and thus don't create a file extent item for this
  endio.
- the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
  (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
  creates a new ordered extent for off 65536 len 1769472 and that
  ordered extent proceeds as above but successfully, and we get the
  second file extent.

I'm not yet sure how to fix this, but have a couple ideas/questions:
1. Is there anyway we can split off a partial ordered extent and finish
it when we get the partial write done?
2. Can we detect that there is an unfinished ordered extent that
overlaps with our new one on the second write of the partial write
logic?

I'll play around and see if I can hack together a fix..

> 
> Boris
> 
> > 
> > -- 
> > Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16  0:34     ` Boris Burkov
@ 2023-02-16  1:46       ` Boris Burkov
  2023-02-16  5:58         ` Christoph Hellwig
  2023-02-16 11:57       ` Filipe Manana
  1 sibling, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2023-02-16  1:46 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Wed, Feb 15, 2023 at 04:34:13PM -0800, Boris Burkov wrote:
> On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > > 
> > > 
> > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > > Downstream bug report, reproducer test file, and gdb session transcript
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > > >
> > > > I speculated that maybe it's similar to the issue we have with VM's 
> > > > when O_DIRECT is used, but it seems that's not the case here.
> > > 
> > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > > 
> > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > > 
> > 
> > I was also able to reproduce on the current misc-next. However, when I
> > hacked the kernel to always fall back from DIO to buffered IO, it no
> > longer reproduced. So that's one hint..
> > 
> > The next observation comes from comparing the happy vs unhappy file
> > extents on disk:
> > happy: https://pastebin.com/k4EPFKhc
> > unhappy: https://pastebin.com/hNSBR0yv
> > 
> > The broken DIO case is missing file extents between bytes 8192 and 65536
> > which corresponds to the observed zeros.
> > 
> > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> > instrumented that codepath. I observed a single successful write to 8192
> > bytes, then a second write which first does a partial write from 8192 to
> > 65536 and then faults in the rest of the iov_iter and finishes the
> > write.
> > 
> > I'm now trying to figure out how these partial writes might lead us to
> > not create all the EXTENT_DATA items for the file extents.
> 
> I believe the issue is indeed caused by faults reading the mapped region
> during direct io. Roughly what is happening is:
> 
> - we start the dio write (offset 8192 len 1826816)
> - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
>   creates an ordered extent for the full write.
> - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
>   bytes and breaks out early, but submits the partial bio.
> - the partial bio completes and calls the various endio callbacks,
>   resulting in a call to btrfs_mark_ordered_io_finished.
> - btrfs_mark_ordered_io_finished looks up the ordered extent and finds
>   the full ordered extent, but the write that finished is partial, so
>   the check for entry->bytes_left fails, and we don't call
>   finish_ordered_fn and thus don't create a file extent item for this
>   endio.
> - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
>   (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
>   creates a new ordered extent for off 65536 len 1769472 and that
>   ordered extent proceeds as above but successfully, and we get the
>   second file extent.
> 
> I'm not yet sure how to fix this, but have a couple ideas/questions:
> 1. Is there anyway we can split off a partial ordered extent and finish
> it when we get the partial write done?
> 2. Can we detect that there is an unfinished ordered extent that
> overlaps with our new one on the second write of the partial write
> logic?
> 
> I'll play around and see if I can hack together a fix..

The following patch causes the problem to stop reproducing by splitting
the large ordered extent in the case of a short write and leaving it
alone otherwise. I haven't thoroughly tested it, or even thought it
through that well yet (e.g. I have no clue where that extract function
comes from!), but it's a start. I have to sign off for the evening, so
I will leave my investigation here for now.

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index d8b90f95b157..016b1a77af71 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -647,6 +647,11 @@ static bool btrfs_submit_chunk(struct bio *bio, int mirror_num)
        }

        if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
+
+               ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
+               if (ret)
+                       goto fail_put_bio;
+
                if (use_append) {
                        bio->bi_opf &= ~REQ_OP_WRITE;
                        bio->bi_opf |= REQ_OP_ZONE_APPEND;

> 
> > 
> > Boris
> > 
> > > 
> > > -- 
> > > Chris Murphy

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16  1:46       ` Boris Burkov
@ 2023-02-16  5:58         ` Christoph Hellwig
  2023-02-16  9:30           ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2023-02-16  5:58 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Chris Murphy, Btrfs BTRFS

On Wed, Feb 15, 2023 at 05:46:48PM -0800, Boris Burkov wrote:
> The following patch causes the problem to stop reproducing by splitting
> the large ordered extent in the case of a short write and leaving it
> alone otherwise. I haven't thoroughly tested it, or even thought it
> through that well yet (e.g. I have no clue where that extract function
> comes from!), but it's a start. I have to sign off for the evening, so
> I will leave my investigation here for now.

The extract version if used for writes to zoned devices, which need
to record a physical location on a per-write basis.  So what you done
definitively works, as we already do it a little later for a subset of
writes.

I also think it fundamentally is the right thing to do and was planning
to do something similar for completely different reasons a little down
the road.

The downside right now is that you pay the price of an extra ordered
extent lookup for every submitted bio, and that an extra queue_work is
required for the completion.  In my WIP work the former will eventually
go away, and I've not found any benchmark impact from the latter.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16  5:58         ` Christoph Hellwig
@ 2023-02-16  9:30           ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2023-02-16  9:30 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Chris Murphy, Btrfs BTRFS

On Wed, Feb 15, 2023 at 09:58:54PM -0800, Christoph Hellwig wrote:
> The extract version if used for writes to zoned devices, which need
> to record a physical location on a per-write basis.  So what you done
> definitively works, as we already do it a little later for a subset of
> writes.
> 
> I also think it fundamentally is the right thing to do and was planning
> to do something similar for completely different reasons a little down
> the road.
> 
> The downside right now is that you pay the price of an extra ordered
> extent lookup for every submitted bio, and that an extra queue_work is
> required for the completion.  In my WIP work the former will eventually
> go away, and I've not found any benchmark impact from the latter.

Thinking out loud a bit more:

 - if we go down this route, calc_bio_boundaries also needs to set
   len_to_oe_boundary for non-zoned file system to ensure bios don't
   span ordered_extents.  They usually don't in my testing, and
   I have that change in my WIP tree for other reasons.
 - btrfs_extract_ordered_extent as-is actually is overkill for
   non-zone devices.  We'll need a version of it that doesn't do
   call split_zoned_em as that is only needed for the zoned case.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-15 23:21   ` Boris Burkov
  2023-02-16  0:34     ` Boris Burkov
@ 2023-02-16 10:05     ` Qu Wenruo
  2023-02-16 12:01       ` Filipe Manana
  1 sibling, 1 reply; 26+ messages in thread
From: Qu Wenruo @ 2023-02-16 10:05 UTC (permalink / raw)
  To: Boris Burkov, Chris Murphy; +Cc: Btrfs BTRFS



On 2023/2/16 07:21, Boris Burkov wrote:
> On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
>>
>>
>> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
>>> Downstream bug report, reproducer test file, and gdb session transcript
>>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
>>>
>>> I speculated that maybe it's similar to the issue we have with VM's
>>> when O_DIRECT is used, but it seems that's not the case here.
>>
>> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
>>
>> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
>>
> 
> I was also able to reproduce on the current misc-next. However, when I
> hacked the kernel to always fall back from DIO to buffered IO, it no
> longer reproduced. So that's one hint..
> 
> The next observation comes from comparing the happy vs unhappy file
> extents on disk:
> happy: https://pastebin.com/k4EPFKhc
> unhappy: https://pastebin.com/hNSBR0yv
> 
> The broken DIO case is missing file extents between bytes 8192 and 65536
> which corresponds to the observed zeros.
> 
> Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> instrumented that codepath. I observed a single successful write to 8192
> bytes, then a second write which first does a partial write from 8192 to
> 65536 and then faults in the rest of the iov_iter and finishes the
> write.

A little off-topic, as I'm not familiar with Direct IO at all.

That fault (I guess means page fault?) means the Direct IO source 
(memory of the user space program) can not be accessible right now?
(being swapped?)

If that's the case, any idea why we can not wait for the page fault to 
fill the user space memory?
I guess it's some deadlock but is not for sure.

Thanks,
Qu
> 
> I'm now trying to figure out how these partial writes might lead us to
> not create all the EXTENT_DATA items for the file extents.
> 
> Boris
> 
>>
>> -- 
>> Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16  0:34     ` Boris Burkov
  2023-02-16  1:46       ` Boris Burkov
@ 2023-02-16 11:57       ` Filipe Manana
  2023-02-16 17:14         ` Boris Burkov
  1 sibling, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2023-02-16 11:57 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Chris Murphy, Btrfs BTRFS

On Thu, Feb 16, 2023 at 12:42 AM Boris Burkov <boris@bur.io> wrote:
>
> On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > >
> > >
> > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > > Downstream bug report, reproducer test file, and gdb session transcript
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > > >
> > > > I speculated that maybe it's similar to the issue we have with VM's
> > > > when O_DIRECT is used, but it seems that's not the case here.
> > >
> > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > >
> > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > >
> >
> > I was also able to reproduce on the current misc-next. However, when I
> > hacked the kernel to always fall back from DIO to buffered IO, it no
> > longer reproduced. So that's one hint..
> >
> > The next observation comes from comparing the happy vs unhappy file
> > extents on disk:
> > happy: https://pastebin.com/k4EPFKhc
> > unhappy: https://pastebin.com/hNSBR0yv
> >
> > The broken DIO case is missing file extents between bytes 8192 and 65536
> > which corresponds to the observed zeros.
> >
> > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> > instrumented that codepath. I observed a single successful write to 8192
> > bytes, then a second write which first does a partial write from 8192 to
> > 65536 and then faults in the rest of the iov_iter and finishes the
> > write.
> >
> > I'm now trying to figure out how these partial writes might lead us to
> > not create all the EXTENT_DATA items for the file extents.
>
> I believe the issue is indeed caused by faults reading the mapped region
> during direct io. Roughly what is happening is:
>
> - we start the dio write (offset 8192 len 1826816)
> - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
>   creates an ordered extent for the full write.
> - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
>   bytes and breaks out early, but submits the partial bio.
> - the partial bio completes and calls the various endio callbacks,
>   resulting in a call to btrfs_mark_ordered_io_finished.
> - btrfs_mark_ordered_io_finished looks up the ordered extent and finds
>   the full ordered extent, but the write that finished is partial, so
>   the check for entry->bytes_left fails, and we don't call
>   finish_ordered_fn and thus don't create a file extent item for this
>   endio.
> - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
>   (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
>   creates a new ordered extent for off 65536 len 1769472 and that
>   ordered extent proceeds as above but successfully, and we get the
>   second file extent.

So one thing doesn't add up here.

The first write attempt creates the ordered extent for the range
[8192, 1835008],
we then submit a bio only for the range [8192, 65535], due to the page
fault at 64K.

That means that the ordered extent can not be completed, as you said
before, as ->bytes_left will be > 0 even after the submitted bio
completes.

However, the second time we enter btrfs_dio_iomap_begin(), for the
range [65536, 1835008], when we call lock_extent_direct(), we should
find the previous ordered extent, for range [8192, 1835008], which
intersects this second range - and that should make us wait for the
previous
ordered extent until it completes by calling
btrfs_start_ordered_extent() at lock_extent_direct(). This is clearly
not happening, as it would cause
a deadlock/hang.

So something is missing in your analysis, and that's important to
figure out the best solution to the problem.

Thanks.




>
> I'm not yet sure how to fix this, but have a couple ideas/questions:
> 1. Is there anyway we can split off a partial ordered extent and finish
> it when we get the partial write done?
> 2. Can we detect that there is an unfinished ordered extent that
> overlaps with our new one on the second write of the partial write
> logic?
>
> I'll play around and see if I can hack together a fix..
>
> >
> > Boris
> >
> > >
> > > --
> > > Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16 10:05     ` Qu Wenruo
@ 2023-02-16 12:01       ` Filipe Manana
  2023-02-17  0:15         ` Qu Wenruo
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2023-02-16 12:01 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS

On Thu, Feb 16, 2023 at 10:23 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2023/2/16 07:21, Boris Burkov wrote:
> > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> >>
> >>
> >> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> >>> Downstream bug report, reproducer test file, and gdb session transcript
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> >>>
> >>> I speculated that maybe it's similar to the issue we have with VM's
> >>> when O_DIRECT is used, but it seems that's not the case here.
> >>
> >> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> >>
> >> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> >>
> >
> > I was also able to reproduce on the current misc-next. However, when I
> > hacked the kernel to always fall back from DIO to buffered IO, it no
> > longer reproduced. So that's one hint..
> >
> > The next observation comes from comparing the happy vs unhappy file
> > extents on disk:
> > happy: https://pastebin.com/k4EPFKhc
> > unhappy: https://pastebin.com/hNSBR0yv
> >
> > The broken DIO case is missing file extents between bytes 8192 and 65536
> > which corresponds to the observed zeros.
> >
> > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> > instrumented that codepath. I observed a single successful write to 8192
> > bytes, then a second write which first does a partial write from 8192 to
> > 65536 and then faults in the rest of the iov_iter and finishes the
> > write.
>
> A little off-topic, as I'm not familiar with Direct IO at all.
>
> That fault (I guess means page fault?) means the Direct IO source
> (memory of the user space program) can not be accessible right now?
> (being swapped?)
>
> If that's the case, any idea why we can not wait for the page fault to
> fill the user space memory?

Sure, you can call fault_in_iov_iter_readable() for the whole iov_iter
before starting the actual direct IO,
and that will work and not result in any problems most of the time.

However:

1) After that and before starting the write, all pages or some pages
may have been evicted from memory due to memory pressure for e.g.;

2) If the iov_iter refers to a very large buffer, it's inefficient to
fault in all pages.

> I guess it's some deadlock but is not for sure.
>
> Thanks,
> Qu
> >
> > I'm now trying to figure out how these partial writes might lead us to
> > not create all the EXTENT_DATA items for the file extents.
> >
> > Boris
> >
> >>
> >> --
> >> Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16 11:57       ` Filipe Manana
@ 2023-02-16 17:14         ` Boris Burkov
  2023-02-16 18:00           ` Filipe Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2023-02-16 17:14 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Chris Murphy, Btrfs BTRFS

On Thu, Feb 16, 2023 at 11:57:37AM +0000, Filipe Manana wrote:
> On Thu, Feb 16, 2023 at 12:42 AM Boris Burkov <boris@bur.io> wrote:
> >
> > On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> > > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > > >
> > > >
> > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > > > Downstream bug report, reproducer test file, and gdb session transcript
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > > > >
> > > > > I speculated that maybe it's similar to the issue we have with VM's
> > > > > when O_DIRECT is used, but it seems that's not the case here.
> > > >
> > > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > > >
> > > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > > >
> > >
> > > I was also able to reproduce on the current misc-next. However, when I
> > > hacked the kernel to always fall back from DIO to buffered IO, it no
> > > longer reproduced. So that's one hint..
> > >
> > > The next observation comes from comparing the happy vs unhappy file
> > > extents on disk:
> > > happy: https://pastebin.com/k4EPFKhc
> > > unhappy: https://pastebin.com/hNSBR0yv
> > >
> > > The broken DIO case is missing file extents between bytes 8192 and 65536
> > > which corresponds to the observed zeros.
> > >
> > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> > > instrumented that codepath. I observed a single successful write to 8192
> > > bytes, then a second write which first does a partial write from 8192 to
> > > 65536 and then faults in the rest of the iov_iter and finishes the
> > > write.
> > >
> > > I'm now trying to figure out how these partial writes might lead us to
> > > not create all the EXTENT_DATA items for the file extents.
> >
> > I believe the issue is indeed caused by faults reading the mapped region
> > during direct io. Roughly what is happening is:
> >
> > - we start the dio write (offset 8192 len 1826816)
> > - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
> >   creates an ordered extent for the full write.
> > - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
> >   bytes and breaks out early, but submits the partial bio.
> > - the partial bio completes and calls the various endio callbacks,
> >   resulting in a call to btrfs_mark_ordered_io_finished.
> > - btrfs_mark_ordered_io_finished looks up the ordered extent and finds
> >   the full ordered extent, but the write that finished is partial, so
> >   the check for entry->bytes_left fails, and we don't call
> >   finish_ordered_fn and thus don't create a file extent item for this
> >   endio.
> > - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
> >   (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
> >   creates a new ordered extent for off 65536 len 1769472 and that
> >   ordered extent proceeds as above but successfully, and we get the
> >   second file extent.
> 
> So one thing doesn't add up here.
> 
> The first write attempt creates the ordered extent for the range
> [8192, 1835008],
> we then submit a bio only for the range [8192, 65535], due to the page
> fault at 64K.
> 
> That means that the ordered extent can not be completed, as you said
> before, as ->bytes_left will be > 0 even after the submitted bio
> completes.
> 
> However, the second time we enter btrfs_dio_iomap_begin(), for the
> range [65536, 1835008], when we call lock_extent_direct(), we should
> find the previous ordered extent, for range [8192, 1835008], which
> intersects this second range - and that should make us wait for the
> previous
> ordered extent until it completes by calling
> btrfs_start_ordered_extent() at lock_extent_direct(). This is clearly
> not happening, as it would cause
> a deadlock/hang.
> 
> So something is missing in your analysis, and that's important to
> figure out the best solution to the problem.
> 
> Thanks.
> 

Good point. I was confused about the extra ordered extents floating
around as well. I believe I do have an explanation now. I think the
following annotated callchain should explain it decently, but the tl;dr
is that iomap_iter calls btrfs_dio_iomap_end which marks the ordered
extent finished and clears the extent bits, but hits an error so it does
not do much other endio work. (the error comes from
mark_ordered_io_finished getting called without uptodate set)

btrfs_direct_write
  __iomap_dio_rw
    iomap_iter
      btrfs_dio_iomap_begin # creates the oe 8192 1835008
    iomap_dio_iter # submits the partial bio 8192 57344 (CB1)
      btrfs_dio_iomap_end
        btrfs_mark_ordered_io_finished 65536 1769472
        # ^ finds and munges the oe, decides bytes_left is 0 and queues
        # the full finish work (CB2)
    # partial write return
  __iomap_dio_rw # btrfs_direct_write tries again
    iomap_dio_iter
      btrfs_dio_iomap_begin # makes the next OE
    iomap_dio_iter # submits the remaining write 65536 1769472 (CB3)

btrfs_dio_end_io 8192 57344 (CB1)
# finds the full oe and bails, as discussed above

btrfs_finish_ordered_io 65536 1769472 (CB2)
# sees BTRFS_ORDERED_IOERR exits and clears the extent

btrfs_dio_end_io 65536 1769472 (CB3)
# succeeds on the new oe for that range and queues full finish work
# which writes an extent

In my opinion, the part of this that feels like it is not behaving
properly is the interaction between the iomap_iter logic of
__iomap_dio_rw and the implementation of btrfs_dio_iomap_begin/end.
Something about the begin/end calls the generic loop is doing is getting
us to believe that large extent was "finished" but with an error set. If
that is an appropriate way to handle the partial write, then maybe the
splitting I proposed earlier works as a fix. Otherwise we may need to
handle this case more explicitly in iomap_iter/btrfs_dio_iomap_end.

> 
> 
> 
> >
> > I'm not yet sure how to fix this, but have a couple ideas/questions:
> > 1. Is there anyway we can split off a partial ordered extent and finish
> > it when we get the partial write done?
> > 2. Can we detect that there is an unfinished ordered extent that
> > overlaps with our new one on the second write of the partial write
> > logic?
> >
> > I'll play around and see if I can hack together a fix..
> >
> > >
> > > Boris
> > >
> > > >
> > > > --
> > > > Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16 17:14         ` Boris Burkov
@ 2023-02-16 18:00           ` Filipe Manana
  2023-02-16 18:49             ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2023-02-16 18:00 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Chris Murphy, Btrfs BTRFS

On Thu, Feb 16, 2023 at 5:15 PM Boris Burkov <boris@bur.io> wrote:
>
> On Thu, Feb 16, 2023 at 11:57:37AM +0000, Filipe Manana wrote:
> > On Thu, Feb 16, 2023 at 12:42 AM Boris Burkov <boris@bur.io> wrote:
> > >
> > > On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> > > > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > > > >
> > > > >
> > > > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > > > > Downstream bug report, reproducer test file, and gdb session transcript
> > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > > > > >
> > > > > > I speculated that maybe it's similar to the issue we have with VM's
> > > > > > when O_DIRECT is used, but it seems that's not the case here.
> > > > >
> > > > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > > > >
> > > > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > > > >
> > > >
> > > > I was also able to reproduce on the current misc-next. However, when I
> > > > hacked the kernel to always fall back from DIO to buffered IO, it no
> > > > longer reproduced. So that's one hint..
> > > >
> > > > The next observation comes from comparing the happy vs unhappy file
> > > > extents on disk:
> > > > happy: https://pastebin.com/k4EPFKhc
> > > > unhappy: https://pastebin.com/hNSBR0yv
> > > >
> > > > The broken DIO case is missing file extents between bytes 8192 and 65536
> > > > which corresponds to the observed zeros.
> > > >
> > > > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> > > > instrumented that codepath. I observed a single successful write to 8192
> > > > bytes, then a second write which first does a partial write from 8192 to
> > > > 65536 and then faults in the rest of the iov_iter and finishes the
> > > > write.
> > > >
> > > > I'm now trying to figure out how these partial writes might lead us to
> > > > not create all the EXTENT_DATA items for the file extents.
> > >
> > > I believe the issue is indeed caused by faults reading the mapped region
> > > during direct io. Roughly what is happening is:
> > >
> > > - we start the dio write (offset 8192 len 1826816)
> > > - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
> > >   creates an ordered extent for the full write.
> > > - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
> > >   bytes and breaks out early, but submits the partial bio.
> > > - the partial bio completes and calls the various endio callbacks,
> > >   resulting in a call to btrfs_mark_ordered_io_finished.
> > > - btrfs_mark_ordered_io_finished looks up the ordered extent and finds
> > >   the full ordered extent, but the write that finished is partial, so
> > >   the check for entry->bytes_left fails, and we don't call
> > >   finish_ordered_fn and thus don't create a file extent item for this
> > >   endio.
> > > - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
> > >   (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
> > >   creates a new ordered extent for off 65536 len 1769472 and that
> > >   ordered extent proceeds as above but successfully, and we get the
> > >   second file extent.
> >
> > So one thing doesn't add up here.
> >
> > The first write attempt creates the ordered extent for the range
> > [8192, 1835008],
> > we then submit a bio only for the range [8192, 65535], due to the page
> > fault at 64K.
> >
> > That means that the ordered extent can not be completed, as you said
> > before, as ->bytes_left will be > 0 even after the submitted bio
> > completes.
> >
> > However, the second time we enter btrfs_dio_iomap_begin(), for the
> > range [65536, 1835008], when we call lock_extent_direct(), we should
> > find the previous ordered extent, for range [8192, 1835008], which
> > intersects this second range - and that should make us wait for the
> > previous
> > ordered extent until it completes by calling
> > btrfs_start_ordered_extent() at lock_extent_direct(). This is clearly
> > not happening, as it would cause
> > a deadlock/hang.
> >
> > So something is missing in your analysis, and that's important to
> > figure out the best solution to the problem.
> >
> > Thanks.
> >
>
> Good point. I was confused about the extra ordered extents floating
> around as well. I believe I do have an explanation now. I think the
> following annotated callchain should explain it decently, but the tl;dr
> is that iomap_iter calls btrfs_dio_iomap_end which marks the ordered
> extent finished and clears the extent bits, but hits an error so it does
> not do much other endio work. (the error comes from
> mark_ordered_io_finished getting called without uptodate set)
>
> btrfs_direct_write
>   __iomap_dio_rw
>     iomap_iter
>       btrfs_dio_iomap_begin # creates the oe 8192 1835008
>     iomap_dio_iter # submits the partial bio 8192 57344 (CB1)
>       btrfs_dio_iomap_end
>         btrfs_mark_ordered_io_finished 65536 1769472
>         # ^ finds and munges the oe, decides bytes_left is 0 and queues
>         # the full finish work (CB2)
>     # partial write return
>   __iomap_dio_rw # btrfs_direct_write tries again
>     iomap_dio_iter
>       btrfs_dio_iomap_begin # makes the next OE
>     iomap_dio_iter # submits the remaining write 65536 1769472 (CB3)
>
> btrfs_dio_end_io 8192 57344 (CB1)
> # finds the full oe and bails, as discussed above
>
> btrfs_finish_ordered_io 65536 1769472 (CB2)
> # sees BTRFS_ORDERED_IOERR exits and clears the extent
>
> btrfs_dio_end_io 65536 1769472 (CB3)
> # succeeds on the new oe for that range and queues full finish work
> # which writes an extent
>
> In my opinion, the part of this that feels like it is not behaving
> properly is the interaction between the iomap_iter logic of
> __iomap_dio_rw and the implementation of btrfs_dio_iomap_begin/end.
> Something about the begin/end calls the generic loop is doing is getting
> us to believe that large extent was "finished" but with an error set. If
> that is an appropriate way to handle the partial write, then maybe the
> splitting I proposed earlier works as a fix. Otherwise we may need to
> handle this case more explicitly in iomap_iter/btrfs_dio_iomap_end.

Ok, so the problem is btrfs_dio_iomap_end() detects the submitted
amount is less than expected, so it marks the ordered extents as not
up to date, setting the BTRFS_ORDERED_IOERR bit on it.
That results in having an unexpected hole for the range [8192, 65535],
and no error returned to btrfs_direct_write().

My initial thought was to truncate the ordered extent at
btrfs_dio_iomap_end(), similar to what we do at
btrfs_invalidate_folio().
I think that should work, however we would end up with a bookend
extent (but so does your proposed fix), but I don't see an easy way to
get around that.

Thanks.


>
> >
> >
> >
> > >
> > > I'm not yet sure how to fix this, but have a couple ideas/questions:
> > > 1. Is there anyway we can split off a partial ordered extent and finish
> > > it when we get the partial write done?
> > > 2. Can we detect that there is an unfinished ordered extent that
> > > overlaps with our new one on the second write of the partial write
> > > logic?
> > >
> > > I'll play around and see if I can hack together a fix..
> > >
> > > >
> > > > Boris
> > > >
> > > > >
> > > > > --
> > > > > Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16 18:00           ` Filipe Manana
@ 2023-02-16 18:49             ` Christoph Hellwig
  2023-02-16 21:43               ` Filipe Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2023-02-16 18:49 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS

On Thu, Feb 16, 2023 at 06:00:08PM +0000, Filipe Manana wrote:
> Ok, so the problem is btrfs_dio_iomap_end() detects the submitted
> amount is less than expected, so it marks the ordered extents as not
> up to date, setting the BTRFS_ORDERED_IOERR bit on it.
> That results in having an unexpected hole for the range [8192, 65535],
> and no error returned to btrfs_direct_write().
> 
> My initial thought was to truncate the ordered extent at
> btrfs_dio_iomap_end(), similar to what we do at
> btrfs_invalidate_folio().
> I think that should work, however we would end up with a bookend
> extent (but so does your proposed fix), but I don't see an easy way to
> get around that.

Wouldn't a better way to handle this be to cache the ordered_extent in
the btrfs_dio_data, and just reuse it on the next iteration if present
and covering the range?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16 18:49             ` Christoph Hellwig
@ 2023-02-16 21:43               ` Filipe Manana
  2023-02-16 22:45                 ` Boris Burkov
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2023-02-16 21:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS

On Thu, Feb 16, 2023 at 6:49 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 16, 2023 at 06:00:08PM +0000, Filipe Manana wrote:
> > Ok, so the problem is btrfs_dio_iomap_end() detects the submitted
> > amount is less than expected, so it marks the ordered extents as not
> > up to date, setting the BTRFS_ORDERED_IOERR bit on it.
> > That results in having an unexpected hole for the range [8192, 65535],
> > and no error returned to btrfs_direct_write().
> >
> > My initial thought was to truncate the ordered extent at
> > btrfs_dio_iomap_end(), similar to what we do at
> > btrfs_invalidate_folio().
> > I think that should work, however we would end up with a bookend
> > extent (but so does your proposed fix), but I don't see an easy way to
> > get around that.
>
> Wouldn't a better way to handle this be to cache the ordered_extent in
> the btrfs_dio_data, and just reuse it on the next iteration if present
> and covering the range?

That may work too, yes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16 21:43               ` Filipe Manana
@ 2023-02-16 22:45                 ` Boris Burkov
  2023-02-17 11:19                   ` Filipe Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Boris Burkov @ 2023-02-16 22:45 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Christoph Hellwig, Chris Murphy, Btrfs BTRFS

On Thu, Feb 16, 2023 at 09:43:03PM +0000, Filipe Manana wrote:
> On Thu, Feb 16, 2023 at 6:49 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, Feb 16, 2023 at 06:00:08PM +0000, Filipe Manana wrote:
> > > Ok, so the problem is btrfs_dio_iomap_end() detects the submitted
> > > amount is less than expected, so it marks the ordered extents as not
> > > up to date, setting the BTRFS_ORDERED_IOERR bit on it.
> > > That results in having an unexpected hole for the range [8192, 65535],
> > > and no error returned to btrfs_direct_write().
> > >
> > > My initial thought was to truncate the ordered extent at
> > > btrfs_dio_iomap_end(), similar to what we do at
> > > btrfs_invalidate_folio().
> > > I think that should work, however we would end up with a bookend
> > > extent (but so does your proposed fix), but I don't see an easy way to
> > > get around that.
> >
> > Wouldn't a better way to handle this be to cache the ordered_extent in
> > the btrfs_dio_data, and just reuse it on the next iteration if present
> > and covering the range?
> 
> That may work too, yes.

Quick update, I just got a preliminary version of this proposal working:
- reuse btrfs_dio_data across calls to __iomap_dio_rw
- store the dio ordered_extent when we create it in btrfs_dio_iomap_begin
- modify btrfs_dio_iomap_end to not mark the unfinished ios done in the
  incomplete case. (and to drop the ordered extent on done or error)
- modify btrfs_dio_iomap_begin to short-circuit when it has a cached
  ordered_extent

The resulting behavior on this workload is:
- write 8192
- finish OE, write file extent
- write 57344 (no extent, cached OE)
- re-enter __iomap_dio_rw with a live OE
- skip locking extent, reserving space, etc.
- write 1769472
- finish OE, write file extent

and the file looks as if there were no partial write. I think this is a
good structure for a fix to this bug, and plan to polish it up and send
it soon, unless someone objects and thinks we should go a different way.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16 12:01       ` Filipe Manana
@ 2023-02-17  0:15         ` Qu Wenruo
  2023-02-17 11:38           ` Filipe Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Qu Wenruo @ 2023-02-17  0:15 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS



On 2023/2/16 20:01, Filipe Manana wrote:
> On Thu, Feb 16, 2023 at 10:23 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2023/2/16 07:21, Boris Burkov wrote:
>>> On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
>>>>
>>>>
>>>> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
>>>>> Downstream bug report, reproducer test file, and gdb session transcript
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
>>>>>
>>>>> I speculated that maybe it's similar to the issue we have with VM's
>>>>> when O_DIRECT is used, but it seems that's not the case here.
>>>>
>>>> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
>>>>
>>>> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
>>>>
>>>
>>> I was also able to reproduce on the current misc-next. However, when I
>>> hacked the kernel to always fall back from DIO to buffered IO, it no
>>> longer reproduced. So that's one hint..
>>>
>>> The next observation comes from comparing the happy vs unhappy file
>>> extents on disk:
>>> happy: https://pastebin.com/k4EPFKhc
>>> unhappy: https://pastebin.com/hNSBR0yv
>>>
>>> The broken DIO case is missing file extents between bytes 8192 and 65536
>>> which corresponds to the observed zeros.
>>>
>>> Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
>>> instrumented that codepath. I observed a single successful write to 8192
>>> bytes, then a second write which first does a partial write from 8192 to
>>> 65536 and then faults in the rest of the iov_iter and finishes the
>>> write.
>>
>> A little off-topic, as I'm not familiar with Direct IO at all.
>>
>> That fault (I guess means page fault?) means the Direct IO source
>> (memory of the user space program) can not be accessible right now?
>> (being swapped?)
>>
>> If that's the case, any idea why we can not wait for the page fault to
>> fill the user space memory?
> 
> Sure, you can call fault_in_iov_iter_readable() for the whole iov_iter
> before starting the actual direct IO,
> and that will work and not result in any problems most of the time.
> 
> However:
> 
> 1) After that and before starting the write, all pages or some pages
> may have been evicted from memory due to memory pressure for e.g.;
> 
> 2) If the iov_iter refers to a very large buffer, it's inefficient to
> fault in all pages.

Indeed. As even we load in all the pages, there is no guarantee unless 
the user space program pins the pages in advance.

Another question is, can we fall back to buffered IO half way?
E.g. if we hit page fault for certain range, then fall back to buffered 
IO for the remaining part?

Thanks,
Qu

> 
>> I guess it's some deadlock but is not for sure.
>>
>> Thanks,
>> Qu
>>>
>>> I'm now trying to figure out how these partial writes might lead us to
>>> not create all the EXTENT_DATA items for the file extents.
>>>
>>> Boris
>>>
>>>>
>>>> --
>>>> Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-16 22:45                 ` Boris Burkov
@ 2023-02-17 11:19                   ` Filipe Manana
  0 siblings, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2023-02-17 11:19 UTC (permalink / raw)
  To: Boris Burkov; +Cc: Christoph Hellwig, Chris Murphy, Btrfs BTRFS

On Thu, Feb 16, 2023 at 10:45 PM Boris Burkov <boris@bur.io> wrote:
>
> On Thu, Feb 16, 2023 at 09:43:03PM +0000, Filipe Manana wrote:
> > On Thu, Feb 16, 2023 at 6:49 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Thu, Feb 16, 2023 at 06:00:08PM +0000, Filipe Manana wrote:
> > > > Ok, so the problem is btrfs_dio_iomap_end() detects the submitted
> > > > amount is less than expected, so it marks the ordered extents as not
> > > > up to date, setting the BTRFS_ORDERED_IOERR bit on it.
> > > > That results in having an unexpected hole for the range [8192, 65535],
> > > > and no error returned to btrfs_direct_write().
> > > >
> > > > My initial thought was to truncate the ordered extent at
> > > > btrfs_dio_iomap_end(), similar to what we do at
> > > > btrfs_invalidate_folio().
> > > > I think that should work, however we would end up with a bookend
> > > > extent (but so does your proposed fix), but I don't see an easy way to
> > > > get around that.
> > >
> > > Wouldn't a better way to handle this be to cache the ordered_extent in
> > > the btrfs_dio_data, and just reuse it on the next iteration if present
> > > and covering the range?
> >
> > That may work too, yes.
>
> Quick update, I just got a preliminary version of this proposal working:
> - reuse btrfs_dio_data across calls to __iomap_dio_rw
> - store the dio ordered_extent when we create it in btrfs_dio_iomap_begin
> - modify btrfs_dio_iomap_end to not mark the unfinished ios done in the
>   incomplete case. (and to drop the ordered extent on done or error)
> - modify btrfs_dio_iomap_begin to short-circuit when it has a cached
>   ordered_extent
>
> The resulting behavior on this workload is:
> - write 8192
> - finish OE, write file extent
> - write 57344 (no extent, cached OE)
> - re-enter __iomap_dio_rw with a live OE
> - skip locking extent, reserving space, etc.
> - write 1769472
> - finish OE, write file extent
>
> and the file looks as if there were no partial write. I think this is a
> good structure for a fix to this bug, and plan to polish it up and send
> it soon, unless someone objects and thinks we should go a different way.

Sounds good to me. Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-17  0:15         ` Qu Wenruo
@ 2023-02-17 11:38           ` Filipe Manana
  0 siblings, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2023-02-17 11:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Boris Burkov, Chris Murphy, Btrfs BTRFS

On Fri, Feb 17, 2023 at 12:15 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2023/2/16 20:01, Filipe Manana wrote:
> > On Thu, Feb 16, 2023 at 10:23 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>
> >>
> >>
> >> On 2023/2/16 07:21, Boris Burkov wrote:
> >>> On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> >>>>
> >>>>
> >>>> On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> >>>>> Downstream bug report, reproducer test file, and gdb session transcript
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> >>>>>
> >>>>> I speculated that maybe it's similar to the issue we have with VM's
> >>>>> when O_DIRECT is used, but it seems that's not the case here.
> >>>>
> >>>> I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> >>>>
> >>>> kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> >>>>
> >>>
> >>> I was also able to reproduce on the current misc-next. However, when I
> >>> hacked the kernel to always fall back from DIO to buffered IO, it no
> >>> longer reproduced. So that's one hint..
> >>>
> >>> The next observation comes from comparing the happy vs unhappy file
> >>> extents on disk:
> >>> happy: https://pastebin.com/k4EPFKhc
> >>> unhappy: https://pastebin.com/hNSBR0yv
> >>>
> >>> The broken DIO case is missing file extents between bytes 8192 and 65536
> >>> which corresponds to the observed zeros.
> >>>
> >>> Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> >>> instrumented that codepath. I observed a single successful write to 8192
> >>> bytes, then a second write which first does a partial write from 8192 to
> >>> 65536 and then faults in the rest of the iov_iter and finishes the
> >>> write.
> >>
> >> A little off-topic, as I'm not familiar with Direct IO at all.
> >>
> >> That fault (I guess means page fault?) means the Direct IO source
> >> (memory of the user space program) can not be accessible right now?
> >> (being swapped?)
> >>
> >> If that's the case, any idea why we can not wait for the page fault to
> >> fill the user space memory?
> >
> > Sure, you can call fault_in_iov_iter_readable() for the whole iov_iter
> > before starting the actual direct IO,
> > and that will work and not result in any problems most of the time.
> >
> > However:
> >
> > 1) After that and before starting the write, all pages or some pages
> > may have been evicted from memory due to memory pressure for e.g.;
> >
> > 2) If the iov_iter refers to a very large buffer, it's inefficient to
> > fault in all pages.
>
> Indeed. As even we load in all the pages, there is no guarantee unless
> the user space program pins the pages in advance.
>
> Another question is, can we fall back to buffered IO half way?
> E.g. if we hit page fault for certain range, then fall back to buffered
> IO for the remaining part?

That doesn't solve any problem in this case.
It would still result in the partial ordered extent getting marked
with an IO error and result in the hole for the partially completed
range.

>
> Thanks,
> Qu
>
> >
> >> I guess it's some deadlock but is not for sure.
> >>
> >> Thanks,
> >> Qu
> >>>
> >>> I'm now trying to figure out how these partial writes might lead us to
> >>> not create all the EXTENT_DATA items for the file extents.
> >>>
> >>> Boris
> >>>
> >>>>
> >>>> --
> >>>> Chris Murphy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy
  2023-02-15 20:16 ` Chris Murphy
@ 2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis)
  2023-04-06 15:47   ` David Sterba
  1 sibling, 1 reply; 26+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-04-05 13:07 UTC (permalink / raw)
  To: Btrfs BTRFS, Linux kernel regressions list

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 15.02.23 21:04, Chris Murphy wrote:
> Downstream bug report, reproducer test file, and gdb session transcript
> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> 
> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here.

To properly track this, let me add this report as well to the tracking
(I already track another report not mentioned in the commit log of the
proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 )

#regzbot ^introduced 51bd9563b678
#regzbot dup-of:
https://lore.kernel.org/all/efc853aa-0592-e43d-3ad1-c42d33f0ab6b@leemhuis.info/
#regzbot title btrfs: mdb_copy produces a corrupt database on btrfs
#regzbot monitor
https://lore.kernel.org/all/b7c72ffeb725c2a359965655df9827bdbeebfb4e.1679512207.git.boris@bur.io/
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis)
@ 2023-04-06 15:47   ` David Sterba
  2023-04-06 22:40     ` Neal Gompa
  2023-04-07  6:10     ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 2 replies; 26+ messages in thread
From: David Sterba @ 2023-04-06 15:47 UTC (permalink / raw)
  To: Linux regressions mailing list; +Cc: Btrfs BTRFS, boris

On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote:
> [TLDR: I'm adding this report to the list of tracked Linux kernel
> regressions; the text you find below is based on a few templates
> paragraphs you might have encountered already in similar form.
> See link in footer if these mails annoy you.]
> 
> On 15.02.23 21:04, Chris Murphy wrote:
> > Downstream bug report, reproducer test file, and gdb session transcript
> > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > 
> > I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here.
> 
> To properly track this, let me add this report as well to the tracking
> (I already track another report not mentioned in the commit log of the
> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 )

There were several iterations of the fix and the final version is
actually 11 patches (below), and it does not apply cleanly to current
master because of other cleanups.

Given that it's fixing a corruption it should be merged and backported
(at least to 6.1), but we may need to rework it again and minimize/drop
the cleanups.

a8e793f97686 btrfs: add function to create and return an ordered extent
b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio
d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent
812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em
3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
87606cb305ca btrfs: split partial dio bios before submit

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-04-06 15:47   ` David Sterba
@ 2023-04-06 22:40     ` Neal Gompa
  2023-04-07  6:10     ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 0 replies; 26+ messages in thread
From: Neal Gompa @ 2023-04-06 22:40 UTC (permalink / raw)
  To: dsterba; +Cc: Linux regressions mailing list, Btrfs BTRFS, boris

On Thu, Apr 6, 2023 at 11:52 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote:
> > [TLDR: I'm adding this report to the list of tracked Linux kernel
> > regressions; the text you find below is based on a few templates
> > paragraphs you might have encountered already in similar form.
> > See link in footer if these mails annoy you.]
> >
> > On 15.02.23 21:04, Chris Murphy wrote:
> > > Downstream bug report, reproducer test file, and gdb session transcript
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > >
> > > I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here.
> >
> > To properly track this, let me add this report as well to the tracking
> > (I already track another report not mentioned in the commit log of the
> > proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 )
>
> There were several iterations of the fix and the final version is
> actually 11 patches (below), and it does not apply cleanly to current
> master because of other cleanups.
>
> Given that it's fixing a corruption it should be merged and backported
> (at least to 6.1), but we may need to rework it again and minimize/drop
> the cleanups.
>
> a8e793f97686 btrfs: add function to create and return an ordered extent
> b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
> c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio
> d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
> 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
> 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent
> 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
> 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em
> 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
> df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
> 87606cb305ca btrfs: split partial dio bios before submit

At least from Fedora's perspective, we've shifted to the 6.2 kernel
series now. If it's reasonable to get backported to there ASAP,
that'll help for Fedora (where this was discovered).




--
真実はいつも一つ！/ Always, there's only one truth!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-04-06 15:47   ` David Sterba
  2023-04-06 22:40     ` Neal Gompa
@ 2023-04-07  6:10     ` Linux regression tracking (Thorsten Leemhuis)
  2023-04-08  0:08       ` Boris Burkov
  2023-04-11 19:27       ` David Sterba
  1 sibling, 2 replies; 26+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-07  6:10 UTC (permalink / raw)
  To: dsterba, Linux regressions mailing list; +Cc: Btrfs BTRFS, boris

On 06.04.23 17:47, David Sterba wrote:
> On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote:
>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>> regressions; the text you find below is based on a few templates
>> paragraphs you might have encountered already in similar form.
>> See link in footer if these mails annoy you.]
>>
>> On 15.02.23 21:04, Chris Murphy wrote:
>>> Downstream bug report, reproducer test file, and gdb session transcript
>>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
>>>
>>> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here.
>>
>> To properly track this, let me add this report as well to the tracking
>> (I already track another report not mentioned in the commit log of the
>> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 )
> 
> There were several iterations of the fix and the final version is
> actually 11 patches (below), and it does not apply cleanly to current
> master because of other cleanups.
> 
> Given that it's fixing a corruption it should be merged and backported
> (at least to 6.1), but we may need to rework it again and minimize/drop
> the cleanups.
> 
> a8e793f97686 btrfs: add function to create and return an ordered extent
> b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
> c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio
> d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
> 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
> 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent
> 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
> 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em
> 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
> df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
> 87606cb305ca btrfs: split partial dio bios before submit

David, many thx for the update; and thx Boris for all your work on this.

I kept a loose eye on this and noticed that fixing this turned out to be
quite complicated and thus required quite a bit of time. Well, that's a
bit unfortunate, but how it is sometimes, so nothing to worry about.
Makes me wonder if "revert the culprit temporarily, to get this fixed
quickly" was really properly evaluated in this case (if it was, sorry, I
might have missed it or forgotten). But whatever, for this particular
regression that's afaics not worth discussing anymore at this point.

Again, thx to everyone involved helping to get this fixed.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

P.S.: let me use this opportunity to update the regzbot status

#regzbot fix: btrfs: split partial dio bios before submit

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-04-07  6:10     ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-04-08  0:08       ` Boris Burkov
  2023-04-11 19:27       ` David Sterba
  1 sibling, 0 replies; 26+ messages in thread
From: Boris Burkov @ 2023-04-08  0:08 UTC (permalink / raw)
  To: Linux regressions mailing list; +Cc: dsterba, Btrfs BTRFS

On Fri, Apr 07, 2023 at 08:10:24AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 06.04.23 17:47, David Sterba wrote:
> > On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote:
> >> [TLDR: I'm adding this report to the list of tracked Linux kernel
> >> regressions; the text you find below is based on a few templates
> >> paragraphs you might have encountered already in similar form.
> >> See link in footer if these mails annoy you.]
> >>
> >> On 15.02.23 21:04, Chris Murphy wrote:
> >>> Downstream bug report, reproducer test file, and gdb session transcript
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> >>>
> >>> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here.
> >>
> >> To properly track this, let me add this report as well to the tracking
> >> (I already track another report not mentioned in the commit log of the
> >> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 )
> > 
> > There were several iterations of the fix and the final version is
> > actually 11 patches (below), and it does not apply cleanly to current
> > master because of other cleanups.
> > 
> > Given that it's fixing a corruption it should be merged and backported
> > (at least to 6.1), but we may need to rework it again and minimize/drop
> > the cleanups.
> > 
> > a8e793f97686 btrfs: add function to create and return an ordered extent
> > b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
> > c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio
> > d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
> > 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
> > 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent
> > 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
> > 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em
> > 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
> > df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
> > 87606cb305ca btrfs: split partial dio bios before submit
> 
> David, many thx for the update; and thx Boris for all your work on this.
> 
> I kept a loose eye on this and noticed that fixing this turned out to be
> quite complicated and thus required quite a bit of time. Well, that's a
> bit unfortunate, but how it is sometimes, so nothing to worry about.
> Makes me wonder if "revert the culprit temporarily, to get this fixed
> quickly" was really properly evaluated in this case (if it was, sorry, I
> might have missed it or forgotten). But whatever, for this particular
> regression that's afaics not worth discussing anymore at this point.

In this case, the broken patch was also a serious fix for a deadlock,
so I worry it would be "out of the frying pan and into the fire." with
reverting it, so it never really entered my mind as an option.

I'll try to keep that option top of mind next time, though, as it did
take a while to iterate towards a (hopefully) successful fix.

Also thanks for linking that other report. I had no clue that someone
had already found the previous patch on 2/15. I wasn't sure of it till
the next day, when I finished root causing the bug, IIRC.

Thanks,
Boris

> 
> Again, thx to everyone involved helping to get this fixed.
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> P.S.: let me use this opportunity to update the regzbot status
> 
> #regzbot fix: btrfs: split partial dio bios before submit

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-04-07  6:10     ` Linux regression tracking (Thorsten Leemhuis)
  2023-04-08  0:08       ` Boris Burkov
@ 2023-04-11 19:27       ` David Sterba
  2023-04-12  9:57         ` Linux regression tracking (Thorsten Leemhuis)
  1 sibling, 1 reply; 26+ messages in thread
From: David Sterba @ 2023-04-11 19:27 UTC (permalink / raw)
  To: Linux regressions mailing list; +Cc: dsterba, Btrfs BTRFS, boris

On Fri, Apr 07, 2023 at 08:10:24AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 06.04.23 17:47, David Sterba wrote:
> > On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote:
> >> [TLDR: I'm adding this report to the list of tracked Linux kernel
> >> regressions; the text you find below is based on a few templates
> >> paragraphs you might have encountered already in similar form.
> >> See link in footer if these mails annoy you.]
> >>
> >> On 15.02.23 21:04, Chris Murphy wrote:
> >>> Downstream bug report, reproducer test file, and gdb session transcript
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> >>>
> >>> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here.
> >>
> >> To properly track this, let me add this report as well to the tracking
> >> (I already track another report not mentioned in the commit log of the
> >> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 )
> > 
> > There were several iterations of the fix and the final version is
> > actually 11 patches (below), and it does not apply cleanly to current
> > master because of other cleanups.
> > 
> > Given that it's fixing a corruption it should be merged and backported
> > (at least to 6.1), but we may need to rework it again and minimize/drop
> > the cleanups.
> > 
> > a8e793f97686 btrfs: add function to create and return an ordered extent
> > b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
> > c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio
> > d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
> > 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
> > 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent
> > 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
> > 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em
> > 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
> > df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
> > 87606cb305ca btrfs: split partial dio bios before submit
> 
> David, many thx for the update; and thx Boris for all your work on this.
> 
> I kept a loose eye on this and noticed that fixing this turned out to be
> quite complicated and thus required quite a bit of time. Well, that's a
> bit unfortunate, but how it is sometimes, so nothing to worry about.
> Makes me wonder if "revert the culprit temporarily, to get this fixed
> quickly" was really properly evaluated in this case (if it was, sorry, I
> might have missed it or forgotten).

I haven't evaluated a revert before, the patch being fixed is actually
fixing a deadlock and a fstests case tests it. A clean revert works on
5.16 but not on the most recent stable version (6.1).

I have a backport on top of 6.2, so once the patches land in Linus' tree
they can be sent for stable backport, but I don't have an ETA. It's hard
to plan things when we're that close to release, one possibility is to
send the patches now and see.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
  2023-04-11 19:27       ` David Sterba
@ 2023-04-12  9:57         ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 26+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-12  9:57 UTC (permalink / raw)
  To: dsterba, Linux regressions mailing list; +Cc: Btrfs BTRFS, boris



On 11.04.23 21:27, David Sterba wrote:
> On Fri, Apr 07, 2023 at 08:10:24AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 06.04.23 17:47, David Sterba wrote:
>>> On Wed, Apr 05, 2023 at 03:07:52PM +0200, Linux regression tracking #adding (Thorsten Leemhuis) wrote:
>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel
>>>> regressions; the text you find below is based on a few templates
>>>> paragraphs you might have encountered already in similar form.
>>>> See link in footer if these mails annoy you.]
>>>>
>>>> On 15.02.23 21:04, Chris Murphy wrote:
>>>>> Downstream bug report, reproducer test file, and gdb session transcript
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2169947
>>>>>
>>>>> I speculated that maybe it's similar to the issue we have with VM's when O_DIRECT is used, but it seems that's not the case here.
>>>>
>>>> To properly track this, let me add this report as well to the tracking
>>>> (I already track another report not mentioned in the commit log of the
>>>> proposed fix: https://bugzilla.kernel.org/show_bug.cgi?id=217042 )
>>>
>>> There were several iterations of the fix and the final version is
>>> actually 11 patches (below), and it does not apply cleanly to current
>>> master because of other cleanups.
>>>
>>> Given that it's fixing a corruption it should be merged and backported
>>> (at least to 6.1), but we may need to rework it again and minimize/drop
>>> the cleanups.
>>>
>>> a8e793f97686 btrfs: add function to create and return an ordered extent
>>> b85d0977f5be btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
>>> c5e9a883e7c8 btrfs: stash ordered extent in dio_data during iomap dio
>>> d90af6fe39e6 btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
>>> 8d4f5839fe88 btrfs: simplify splitting logic in btrfs_extract_ordered_extent
>>> 880c3efad384 btrfs: sink parameter len to btrfs_split_ordered_extent
>>> 812f614a7ad7 btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
>>> 1334edcf5fa2 btrfs: simplify extent map splitting and rename split_zoned_em
>>> 3e99488588fa btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
>>> df701737e7a6 btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
>>> 87606cb305ca btrfs: split partial dio bios before submit
>>
>> David, many thx for the update; and thx Boris for all your work on this.
>>
>> I kept a loose eye on this and noticed that fixing this turned out to be
>> quite complicated and thus required quite a bit of time. Well, that's a
>> bit unfortunate, but how it is sometimes, so nothing to worry about.
>> Makes me wonder if "revert the culprit temporarily, to get this fixed
>> quickly" was really properly evaluated in this case (if it was, sorry, I
>> might have missed it or forgotten).
> 
> I haven't evaluated a revert before, [...]

No worries, I just try to make that option something that developer
consider and use more often when dealing with regressions -- especially
when the problem like in this case made it to stable trees (either
though a backport or a new mainline release).

> I have a backport on top of 6.2, so once the patches land in Linus' tree
> they can be sent for stable backport, but I don't have an ETA. It's hard
> to plan things when we're that close to release, one possibility is to
> send the patches now and see.

Thx for your work. And don't feel rushed due to my tracking. There are
risks involved in this and experts like you are the best to judge when
the right time to mainline and backport fixes is.

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-04-12  9:57 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy
2023-02-15 20:16 ` Chris Murphy
2023-02-15 21:41   ` Filipe Manana
2023-02-15 23:21   ` Boris Burkov
2023-02-16  0:34     ` Boris Burkov
2023-02-16  1:46       ` Boris Burkov
2023-02-16  5:58         ` Christoph Hellwig
2023-02-16  9:30           ` Christoph Hellwig
2023-02-16 11:57       ` Filipe Manana
2023-02-16 17:14         ` Boris Burkov
2023-02-16 18:00           ` Filipe Manana
2023-02-16 18:49             ` Christoph Hellwig
2023-02-16 21:43               ` Filipe Manana
2023-02-16 22:45                 ` Boris Burkov
2023-02-17 11:19                   ` Filipe Manana
2023-02-16 10:05     ` Qu Wenruo
2023-02-16 12:01       ` Filipe Manana
2023-02-17  0:15         ` Qu Wenruo
2023-02-17 11:38           ` Filipe Manana
2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-06 15:47   ` David Sterba
2023-04-06 22:40     ` Neal Gompa
2023-04-07  6:10     ` Linux regression tracking (Thorsten Leemhuis)
2023-04-08  0:08       ` Boris Burkov
2023-04-11 19:27       ` David Sterba
2023-04-12  9:57         ` Linux regression tracking (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox