From: Liu Bo <bo.li.liu@oracle.com>
To: Omar Sandoval <osandov@osandov.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
Date: Thu, 10 Nov 2016 14:38:14 -0800 [thread overview]
Message-ID: <20161110223813.GB22740@localhost.localdomain> (raw)
In-Reply-To: <20161110202413.GB22359@vader.DHCP.thefacebook.com>
On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote:
> > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote:
> > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > >
> > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > > > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > > > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > > > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > > >
> > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > > > two threads race on adding the same extent map to an inode's extent map
> > > > tree. However, if the added em is merged with an adjacent em in the
> > > > extent tree, then we'll end up with an existing extent that is not
> > > > identical to but instead encompasses the extent we tried to add. When we
> > > > call merge_extent_mapping() to find the nonoverlapping part of the new
> > > > em, the arithmetic overflows because there is no such thing. We then end
> > > > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > > > that can bubble all the way up to userspace.
> > >
> > > I don't get how this could happen(even after reading Commit
> > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by
> > > lock_extent_direct, the assumption is that a racy thread should be
> > > blocked by lock_extent_direct and when it gets the lock, it finds the
> > > just-inserted em when going into btrfs_get_extent if its offset is
> > > within [em->start, extent_map_end(em)].
> > >
> > > I think we may also need to figure out why the above doesn't work as
> > > expected besides fixing another special case.
> > >
> > > Thanks,
> > >
> > > -liubo
> >
> > lock_extent_direct() only protects the range you're doing I/O into, not
> > the entire extent. If two threads are doing two non-overlapping reads in
> > the same extent, then you can get this race.
>
> More concretely, assume the extent tree on disk has:
>
> +-------------------------+-------------------------------+
> |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192|
> +-------------------------+-------------------------------+
>
> And the extent map tree in memory has a single em cached for the second
> extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do
> direct I/O reads:
>
> Thread 1 | Thread 2
> ---------------------------------------+-------------------------------
> pread(offset=0, nbyte=4096) | pread(offset=4096, nbyte=4096)
> lock_extent_direct(start=0, end=4095) | lock_extent_direct(start=4096, end=8191)
> btrfs_get_extent(start=0, len=4096) | btrfs_get_extent(start=4096, len4096)
> lookup_extent_mapping() = NULL | lookup_extent_mapping() = NULL
> reads extent from B-tree | reads extent from B-tree
> | write_lock(&em_tree->lock)
> | add_extent_mapping(start=0, len=8192, bytenr=0)
> | try_merge_map()
> | em_tree now has {start=0, len=16384, bytenr=0}
> | write_unlock(&em_tree->lock)
> write_lock(&em_tree->lock) |
> add_extent_mapping(start=0, len=8192, |
> bytenr=0) = -EEXIST |
> search_extent_mapping() = {start=0, |
> len=16384, |
> bytenr=0} |
> merge_extent_mapping() does bogus math |
> and overflows, returns EEXIST |
Yeah, so much fun.
The problem is that we lock and request [0, 4096], but we insert a em of
[0, 8192] instead. So if we insert a [0, 4096] em, then we can make
sure that the em returned by btrfs_get_extent is protected from race by
the range of lock_extent_direct.
I'll give it a shot and do some testing.
For this patch,
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Would you please make a reproducer for fstests?
Thanks,
-liubo
next prev parent reply other threads:[~2016-11-10 22:38 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-09 23:26 [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent() Omar Sandoval
2016-11-10 15:06 ` David Sterba
2016-11-10 15:37 ` Holger Hoffstätte
2016-11-10 15:42 ` Omar Sandoval
2016-11-11 0:36 ` Liu Bo
2016-11-10 15:11 ` Holger Hoffstätte
2016-11-10 15:37 ` Omar Sandoval
2016-11-10 16:01 ` Holger Hoffstätte
2016-11-10 16:20 ` Omar Sandoval
2016-11-10 16:31 ` Holger Hoffstätte
2016-11-10 20:01 ` Liu Bo
2016-11-10 20:09 ` Omar Sandoval
2016-11-10 20:24 ` Omar Sandoval
2016-11-10 22:38 ` Liu Bo [this message]
2016-11-10 22:45 ` Omar Sandoval
2016-11-17 0:32 ` Omar Sandoval
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161110223813.GB22740@localhost.localdomain \
--to=bo.li.liu@oracle.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=osandov@osandov.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).