All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Chris Murphy <chris@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
Date: Wed, 15 Feb 2023 17:46:48 -0800	[thread overview]
Message-ID: <Y+2LCFrD4Qxff89Y@zen> (raw)
In-Reply-To: <Y+16BVPQiwf8e1J3@zen>

On Wed, Feb 15, 2023 at 04:34:13PM -0800, Boris Burkov wrote:
> On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> > On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > > 
> > > 
> > > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > > Downstream bug report, reproducer test file, and gdb session transcript
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > > >
> > > > I speculated that maybe it's similar to the issue we have with VM's 
> > > > when O_DIRECT is used, but it seems that's not the case here.
> > > 
> > > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > > 
> > > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > > 
> > 
> > I was also able to reproduce on the current misc-next. However, when I
> > hacked the kernel to always fall back from DIO to buffered IO, it no
> > longer reproduced. So that's one hint..
> > 
> > The next observation comes from comparing the happy vs unhappy file
> > extents on disk:
> > happy: https://pastebin.com/k4EPFKhc
> > unhappy: https://pastebin.com/hNSBR0yv
> > 
> > The broken DIO case is missing file extents between bytes 8192 and 65536
> > which corresponds to the observed zeros.
> > 
> > Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> > instrumented that codepath. I observed a single successful write to 8192
> > bytes, then a second write which first does a partial write from 8192 to
> > 65536 and then faults in the rest of the iov_iter and finishes the
> > write.
> > 
> > I'm now trying to figure out how these partial writes might lead us to
> > not create all the EXTENT_DATA items for the file extents.
> 
> I believe the issue is indeed caused by faults reading the mapped region
> during direct io. Roughly what is happening is:
> 
> - we start the dio write (offset 8192 len 1826816)
> - __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
>   creates an ordered extent for the full write.
> - iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
>   bytes and breaks out early, but submits the partial bio.
> - the partial bio completes and calls the various endio callbacks,
>   resulting in a call to btrfs_mark_ordered_io_finished.
> - btrfs_mark_ordered_io_finished looks up the ordered extent and finds
>   the full ordered extent, but the write that finished is partial, so
>   the check for entry->bytes_left fails, and we don't call
>   finish_ordered_fn and thus don't create a file extent item for this
>   endio.
> - the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
>   (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
>   creates a new ordered extent for off 65536 len 1769472 and that
>   ordered extent proceeds as above but successfully, and we get the
>   second file extent.
> 
> I'm not yet sure how to fix this, but have a couple ideas/questions:
> 1. Is there anyway we can split off a partial ordered extent and finish
> it when we get the partial write done?
> 2. Can we detect that there is an unfinished ordered extent that
> overlaps with our new one on the second write of the partial write
> logic?
> 
> I'll play around and see if I can hack together a fix..

The following patch causes the problem to stop reproducing by splitting
the large ordered extent in the case of a short write and leaving it
alone otherwise. I haven't thoroughly tested it, or even thought it
through that well yet (e.g. I have no clue where that extract function
comes from!), but it's a start. I have to sign off for the evening, so
I will leave my investigation here for now.

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index d8b90f95b157..016b1a77af71 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -647,6 +647,11 @@ static bool btrfs_submit_chunk(struct bio *bio, int mirror_num)
        }

        if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
+
+               ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
+               if (ret)
+                       goto fail_put_bio;
+
                if (use_append) {
                        bio->bi_opf &= ~REQ_OP_WRITE;
                        bio->bi_opf |= REQ_OP_ZONE_APPEND;

> 
> > 
> > Boris
> > 
> > > 
> > > -- 
> > > Chris Murphy

  reply	other threads:[~2023-02-16  1:47 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy
2023-02-15 20:16 ` Chris Murphy
2023-02-15 21:41   ` Filipe Manana
2023-02-15 23:21   ` Boris Burkov
2023-02-16  0:34     ` Boris Burkov
2023-02-16  1:46       ` Boris Burkov [this message]
2023-02-16  5:58         ` Christoph Hellwig
2023-02-16  9:30           ` Christoph Hellwig
2023-02-16 11:57       ` Filipe Manana
2023-02-16 17:14         ` Boris Burkov
2023-02-16 18:00           ` Filipe Manana
2023-02-16 18:49             ` Christoph Hellwig
2023-02-16 21:43               ` Filipe Manana
2023-02-16 22:45                 ` Boris Burkov
2023-02-17 11:19                   ` Filipe Manana
2023-02-16 10:05     ` Qu Wenruo
2023-02-16 12:01       ` Filipe Manana
2023-02-17  0:15         ` Qu Wenruo
2023-02-17 11:38           ` Filipe Manana
2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-06 15:47   ` David Sterba
2023-04-06 22:40     ` Neal Gompa
2023-04-07  6:10     ` Linux regression tracking (Thorsten Leemhuis)
2023-04-08  0:08       ` Boris Burkov
2023-04-11 19:27       ` David Sterba
2023-04-12  9:57         ` Linux regression tracking (Thorsten Leemhuis)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+2LCFrD4Qxff89Y@zen \
    --to=boris@bur.io \
    --cc=chris@colorremedies.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.