Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Omar Sandoval <osandov@osandov.com>
To: Liu Bo <bo.li.liu@oracle.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
Date: Wed, 16 Nov 2016 16:32:28 -0800	[thread overview]
Message-ID: <20161117003228.GA15054@vader> (raw)
In-Reply-To: <20161110224536.GA8597@vader.DHCP.thefacebook.com>

On Thu, Nov 10, 2016 at 02:45:36PM -0800, Omar Sandoval wrote:
> On Thu, Nov 10, 2016 at 02:38:14PM -0800, Liu Bo wrote:
> > On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote:
> > > On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote:
> > > > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote:
> > > > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote:
> > > > > > From: Omar Sandoval <osandov@fb.com>
> > > > > > 
> > > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to
> > > > > > errors coming from the qcow2 virtual drive in the host system. The qcow2
> > > > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
> > > > > > Every once in awhile, pread() or pwrite() would return EEXIST, which
> > > > > > makes no sense. This turned out to be a bug in btrfs_get_extent().
> > > > > > 
> > > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map
> > > > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
> > > > > > two threads race on adding the same extent map to an inode's extent map
> > > > > > tree. However, if the added em is merged with an adjacent em in the
> > > > > > extent tree, then we'll end up with an existing extent that is not
> > > > > > identical to but instead encompasses the extent we tried to add. When we
> > > > > > call merge_extent_mapping() to find the nonoverlapping part of the new
> > > > > > em, the arithmetic overflows because there is no such thing. We then end
> > > > > > up trying to add a bogus em to the em_tree, which results in a EEXIST
> > > > > > that can bubble all the way up to userspace.
> > > > > 
> > > > > I don't get how this could happen(even after reading Commit
> > > > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by
> > > > > lock_extent_direct, the assumption is that a racy thread should be
> > > > > blocked by lock_extent_direct and when it gets the lock, it finds the
> > > > > just-inserted em when going into btrfs_get_extent if its offset is
> > > > > within [em->start, extent_map_end(em)].
> > > > > 
> > > > > I think we may also need to figure out why the above doesn't work as
> > > > > expected besides fixing another special case.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > -liubo
> > > > 
> > > > lock_extent_direct() only protects the range you're doing I/O into, not
> > > > the entire extent. If two threads are doing two non-overlapping reads in
> > > > the same extent, then you can get this race.
> > > 
> > > More concretely, assume the extent tree on disk has:
> > > 
> > > +-------------------------+-------------------------------+
> > > |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192|
> > > +-------------------------+-------------------------------+
> > > 
> > > And the extent map tree in memory has a single em cached for the second
> > > extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do
> > > direct I/O reads:
> > > 
> > > Thread 1                               | Thread 2
> > > ---------------------------------------+-------------------------------
> > > pread(offset=0, nbyte=4096)            | pread(offset=4096, nbyte=4096)
> > > lock_extent_direct(start=0, end=4095)  | lock_extent_direct(start=4096, end=8191)
> > > btrfs_get_extent(start=0, len=4096)    | btrfs_get_extent(start=4096, len4096)
> > >   lookup_extent_mapping() = NULL       |   lookup_extent_mapping() = NULL
> > >   reads extent from B-tree             |   reads extent from B-tree
> > >                                        |   write_lock(&em_tree->lock)
> > > 				       |   add_extent_mapping(start=0, len=8192, bytenr=0)
> > > 				       |     try_merge_map()
> > > 				       |     em_tree now has {start=0, len=16384, bytenr=0}
> > > 				       |   write_unlock(&em_tree->lock)
> > > write_lock(&em_tree->lock)             |
> > > add_extent_mapping(start=0, len=8192,  |
> > >                    bytenr=0) = -EEXIST |
> > > search_extent_mapping() = {start=0,    |
> > >                            len=16384,  |
> > > 			   bytenr=0}   |
> > > merge_extent_mapping() does bogus math |
> > > and overflows, returns EEXIST          |
> > 
> > Yeah, so much fun.
> > 
> > The problem is that we lock and request [0, 4096], but we insert a em of
> > [0, 8192] instead.  So if we insert a [0, 4096] em, then we can make
> > sure that the em returned by btrfs_get_extent is protected from race by
> > the range of lock_extent_direct.
> > 
> > I'll give it a shot and do some testing.
> > 
> > For this patch,
> > 
> > Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
> 
> Thank you!
> 
> > Would you please make a reproducer for fstests?
> 
> Sure. Trying to trigger this with xfs_io never works because it's such a
> narrow race window, but I'll send in my reproducer and see what the
> xfstests guys think about adding new binaries.
> 

xfstests patch is here: http://marc.info/?l=fstests&m=147934258732500&w=2

-- 
Omar

     prev parent reply	other threads:[~2016-11-17  0:32 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-09 23:26 [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent() Omar Sandoval
2016-11-10 15:06 ` David Sterba
2016-11-10 15:37   ` Holger Hoffstätte
2016-11-10 15:42   ` Omar Sandoval
2016-11-11  0:36   ` Liu Bo
2016-11-10 15:11 ` Holger Hoffstätte
2016-11-10 15:37   ` Omar Sandoval
2016-11-10 16:01     ` Holger Hoffstätte
2016-11-10 16:20       ` Omar Sandoval
2016-11-10 16:31         ` Holger Hoffstätte
2016-11-10 20:01 ` Liu Bo
2016-11-10 20:09   ` Omar Sandoval
2016-11-10 20:24     ` Omar Sandoval
2016-11-10 22:38       ` Liu Bo
2016-11-10 22:45         ` Omar Sandoval
2016-11-17  0:32           ` Omar Sandoval [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161117003228.GA15054@vader \
    --to=osandov@osandov.com \
    --cc=bo.li.liu@oracle.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.