Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Boris Burkov <boris@bur.io>
To: Chris Murphy <chris@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4
Date: Wed, 15 Feb 2023 16:34:13 -0800	[thread overview]
Message-ID: <Y+16BVPQiwf8e1J3@zen> (raw)
In-Reply-To: <Y+1pAoetotjEuef7@zen>

On Wed, Feb 15, 2023 at 03:21:38PM -0800, Boris Burkov wrote:
> On Wed, Feb 15, 2023 at 03:16:39PM -0500, Chris Murphy wrote:
> > 
> > 
> > On Wed, Feb 15, 2023, at 3:04 PM, Chris Murphy wrote:
> > > Downstream bug report, reproducer test file, and gdb session transcript
> > > https://bugzilla.redhat.com/show_bug.cgi?id=2169947
> > >
> > > I speculated that maybe it's similar to the issue we have with VM's 
> > > when O_DIRECT is used, but it seems that's not the case here.
> > 
> > I can reproduce the mismatching checksums whether the test files are datacow or nodatacow (using chattr +C). There are no kernel messages during the tests.
> > 
> > kernel 6.2rc7 in my case; and in the bug report kernel series 6.1, 6.0, and 5.17 reproduce the problem.
> > 
> 
> I was also able to reproduce on the current misc-next. However, when I
> hacked the kernel to always fall back from DIO to buffered IO, it no
> longer reproduced. So that's one hint..
> 
> The next observation comes from comparing the happy vs unhappy file
> extents on disk:
> happy: https://pastebin.com/k4EPFKhc
> unhappy: https://pastebin.com/hNSBR0yv
> 
> The broken DIO case is missing file extents between bytes 8192 and 65536
> which corresponds to the observed zeros.
> 
> Next, at Josef's suggestion, I looked at the IOMAP_DIO_PARTIAL and
> instrumented that codepath. I observed a single successful write to 8192
> bytes, then a second write which first does a partial write from 8192 to
> 65536 and then faults in the rest of the iov_iter and finishes the
> write.
> 
> I'm now trying to figure out how these partial writes might lead us to
> not create all the EXTENT_DATA items for the file extents.

I believe the issue is indeed caused by faults reading the mapped region
during direct io. Roughly what is happening is:

- we start the dio write (offset 8192 len 1826816)
- __iomap_dio_rw calls iomap_iter which calls btrfs_dio_iomap_begin which
  creates an ordered extent for the full write.
- iomap_dio_iter hits a page fault in bio_iov_iter_get_pages after 57344
  bytes and breaks out early, but submits the partial bio.
- the partial bio completes and calls the various endio callbacks,
  resulting in a call to btrfs_mark_ordered_io_finished.
- btrfs_mark_ordered_io_finished looks up the ordered extent and finds
  the full ordered extent, but the write that finished is partial, so
  the check for entry->bytes_left fails, and we don't call
  finish_ordered_fn and thus don't create a file extent item for this
  endio.
- the IOMAP_DIO_PARTIAL logic results in us retrying starting from 65536
  (8192 + 57344) but we fully exit and re-enter __iomap_dio_rw, which
  creates a new ordered extent for off 65536 len 1769472 and that
  ordered extent proceeds as above but successfully, and we get the
  second file extent.

I'm not yet sure how to fix this, but have a couple ideas/questions:
1. Is there anyway we can split off a partial ordered extent and finish
it when we get the partial write done?
2. Can we detect that there is an unfinished ordered extent that
overlaps with our new one on the second write of the partial write
logic?

I'll play around and see if I can hack together a fix..

> 
> Boris
> 
> > 
> > -- 
> > Chris Murphy

next prev parent reply	other threads:[~2023-02-16  0:34 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-15 20:04 LMDB mdb_copy produces a corrupt database on btrfs, but not on ext4 Chris Murphy
2023-02-15 20:16 ` Chris Murphy
2023-02-15 21:41   ` Filipe Manana
2023-02-15 23:21   ` Boris Burkov
2023-02-16  0:34     ` Boris Burkov [this message]
2023-02-16  1:46       ` Boris Burkov
2023-02-16  5:58         ` Christoph Hellwig
2023-02-16  9:30           ` Christoph Hellwig
2023-02-16 11:57       ` Filipe Manana
2023-02-16 17:14         ` Boris Burkov
2023-02-16 18:00           ` Filipe Manana
2023-02-16 18:49             ` Christoph Hellwig
2023-02-16 21:43               ` Filipe Manana
2023-02-16 22:45                 ` Boris Burkov
2023-02-17 11:19                   ` Filipe Manana
2023-02-16 10:05     ` Qu Wenruo
2023-02-16 12:01       ` Filipe Manana
2023-02-17  0:15         ` Qu Wenruo
2023-02-17 11:38           ` Filipe Manana
2023-04-05 13:07 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-06 15:47   ` David Sterba
2023-04-06 22:40     ` Neal Gompa
2023-04-07  6:10     ` Linux regression tracking (Thorsten Leemhuis)
2023-04-08  0:08       ` Boris Burkov
2023-04-11 19:27       ` David Sterba
2023-04-12  9:57         ` Linux regression tracking (Thorsten Leemhuis)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+16BVPQiwf8e1J3@zen \
    --to=boris@bur.io \
    --cc=chris@colorremedies.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox