From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pg0-f52.google.com ([74.125.83.52]:35975 "EHLO mail-pg0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751652AbcKQAca (ORCPT ); Wed, 16 Nov 2016 19:32:30 -0500 Received: by mail-pg0-f52.google.com with SMTP id f188so84483998pgc.3 for ; Wed, 16 Nov 2016 16:32:30 -0800 (PST) Date: Wed, 16 Nov 2016 16:32:28 -0800 From: Omar Sandoval To: Liu Bo Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent() Message-ID: <20161117003228.GA15054@vader> References: <262a1e171d091626edbd23c637cb138ba9d84ed8.1478733376.git.osandov@fb.com> <20161110195718.GA22740@localhost.localdomain> <20161110200906.GA22359@vader.DHCP.thefacebook.com> <20161110202413.GB22359@vader.DHCP.thefacebook.com> <20161110223813.GB22740@localhost.localdomain> <20161110224536.GA8597@vader.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20161110224536.GA8597@vader.DHCP.thefacebook.com> Sender: linux-btrfs-owner@vger.kernel.org List-ID: On Thu, Nov 10, 2016 at 02:45:36PM -0800, Omar Sandoval wrote: > On Thu, Nov 10, 2016 at 02:38:14PM -0800, Liu Bo wrote: > > On Thu, Nov 10, 2016 at 12:24:13PM -0800, Omar Sandoval wrote: > > > On Thu, Nov 10, 2016 at 12:09:06PM -0800, Omar Sandoval wrote: > > > > On Thu, Nov 10, 2016 at 12:01:20PM -0800, Liu Bo wrote: > > > > > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: > > > > > > From: Omar Sandoval > > > > > > > > > > > > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > > > > > > errors coming from the qcow2 virtual drive in the host system. The qcow2 > > > > > > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > > > > > > Every once in awhile, pread() or pwrite() would return EEXIST, which > > > > > > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > > > > > > > > > > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > > > > > > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > > > > > > two threads race on adding the same extent map to an inode's extent map > > > > > > tree. However, if the added em is merged with an adjacent em in the > > > > > > extent tree, then we'll end up with an existing extent that is not > > > > > > identical to but instead encompasses the extent we tried to add. When we > > > > > > call merge_extent_mapping() to find the nonoverlapping part of the new > > > > > > em, the arithmetic overflows because there is no such thing. We then end > > > > > > up trying to add a bogus em to the em_tree, which results in a EEXIST > > > > > > that can bubble all the way up to userspace. > > > > > > > > > > I don't get how this could happen(even after reading Commit > > > > > 8dff9c853410), btrfs_get_extent in direct_IO is protected by > > > > > lock_extent_direct, the assumption is that a racy thread should be > > > > > blocked by lock_extent_direct and when it gets the lock, it finds the > > > > > just-inserted em when going into btrfs_get_extent if its offset is > > > > > within [em->start, extent_map_end(em)]. > > > > > > > > > > I think we may also need to figure out why the above doesn't work as > > > > > expected besides fixing another special case. > > > > > > > > > > Thanks, > > > > > > > > > > -liubo > > > > > > > > lock_extent_direct() only protects the range you're doing I/O into, not > > > > the entire extent. If two threads are doing two non-overlapping reads in > > > > the same extent, then you can get this race. > > > > > > More concretely, assume the extent tree on disk has: > > > > > > +-------------------------+-------------------------------+ > > > |start=0,len=8192,bytenr=0|start=8192,len=8192,bytenr=8192| > > > +-------------------------+-------------------------------+ > > > > > > And the extent map tree in memory has a single em cached for the second > > > extent {start=8192, len=8192, bytenr=8192}. Then, two threads try do do > > > direct I/O reads: > > > > > > Thread 1 | Thread 2 > > > ---------------------------------------+------------------------------- > > > pread(offset=0, nbyte=4096) | pread(offset=4096, nbyte=4096) > > > lock_extent_direct(start=0, end=4095) | lock_extent_direct(start=4096, end=8191) > > > btrfs_get_extent(start=0, len=4096) | btrfs_get_extent(start=4096, len4096) > > > lookup_extent_mapping() = NULL | lookup_extent_mapping() = NULL > > > reads extent from B-tree | reads extent from B-tree > > > | write_lock(&em_tree->lock) > > > | add_extent_mapping(start=0, len=8192, bytenr=0) > > > | try_merge_map() > > > | em_tree now has {start=0, len=16384, bytenr=0} > > > | write_unlock(&em_tree->lock) > > > write_lock(&em_tree->lock) | > > > add_extent_mapping(start=0, len=8192, | > > > bytenr=0) = -EEXIST | > > > search_extent_mapping() = {start=0, | > > > len=16384, | > > > bytenr=0} | > > > merge_extent_mapping() does bogus math | > > > and overflows, returns EEXIST | > > > > Yeah, so much fun. > > > > The problem is that we lock and request [0, 4096], but we insert a em of > > [0, 8192] instead. So if we insert a [0, 4096] em, then we can make > > sure that the em returned by btrfs_get_extent is protected from race by > > the range of lock_extent_direct. > > > > I'll give it a shot and do some testing. > > > > For this patch, > > > > Reviewed-by: Liu Bo > > Thank you! > > > Would you please make a reproducer for fstests? > > Sure. Trying to trigger this with xfs_io never works because it's such a > narrow race window, but I'll send in my reproducer and see what the > xfstests guys think about adding new binaries. > xfstests patch is here: http://marc.info/?l=fstests&m=147934258732500&w=2 -- Omar