public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: David Sterba <dsterba@suse.cz>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH] Btrfs: fix race between using extent maps and merging them
Date: Thu, 6 Feb 2020 18:21:59 +0100	[thread overview]
Message-ID: <20200206172159.GC2654@twin.jikos.cz> (raw)
In-Reply-To: <20200131140607.26923-1-fdmanana@kernel.org>

On Fri, Jan 31, 2020 at 02:06:07PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We have a few cases where we allow an extent map that is in an extent map
> tree to be merged with other extents in the tree. Such cases include the
> unpinning of an extent after the respective ordered extent completed or
> after logging an extent during a fast fsync. This can lead to subtle and
> dangerous problems because when doing the merge some other task might be
> using the same extent map and as consequence see an inconsistent state of
> the extent map - for example sees the new length but has seen the old start
> offset.
> 
> With luck this triggers a BUG_ON(), and not some silent bug, such as the
> following one in __do_readpage():
> 
>   $ cat -n fs/btrfs/extent_io.c
>   3061  static int __do_readpage(struct extent_io_tree *tree,
>   3062                           struct page *page,
>   (...)
>   3127                  em = __get_extent_map(inode, page, pg_offset, cur,
>   3128                                        end - cur + 1, get_extent, em_cached);
>   3129                  if (IS_ERR_OR_NULL(em)) {
>   3130                          SetPageError(page);
>   3131                          unlock_extent(tree, cur, end);
>   3132                          break;
>   3133                  }
>   3134                  extent_offset = cur - em->start;
>   3135                  BUG_ON(extent_map_end(em) <= cur);
>   (...)
> 
> Consider the following example scenario, where we end up hitting the
> BUG_ON() in __do_readpage().
> 
> We have an inode with a size of 8Kb and 2 extent maps:
> 
>   extent A: file offset 0, length 4Kb, disk_bytenr = X, persisted on disk by
>             a previous transaction
> 
>   extent B: file offset 4Kb, length 4Kb, disk_bytenr = X + 4Kb, not yet
>             persisted but writeback started for it already. The extent map
> 	    is pinned since there's writeback and an ordered extent in
> 	    progress, so it can not be merged with extent map A yet
> 
> The following sequence of steps leads to the BUG_ON():
> 
> 1) The ordered extent for extent B completes, the respective page gets its
>    writeback bit cleared and the extent map is unpinned, at that point it
>    is not yet merged with extent map A because it's in the list of modified
>    extents;
> 
> 2) Due to memory pressure, or some other reason, the mm subsystem releases
>    the page corresponding to extent B - btrfs_releasepage() is called and
>    returns 1, meaning the page can be released as it's not dirty, not under
>    writeback anymore and the extent range is not locked in the inode's
>    iotree. However the extent map is not released, either because we are
>    not in a context that allows memory allocations to block or because the
>    inode's size is smaller than 16Mb - in this case our inode has a size
>    of 8Kb;
> 
> 3) Task B needs to read extent B and ends up __do_readpage() through the
>    btrfs_readpage() callback. At __do_readpage() it gets a reference to
>    extent map B;
> 
> 4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
>    while holding the write lock on the inode's extent map tree - this
>    results in try_merge_map() being called and since it's possible to merge
>    extent map B with extent map A now (the extent map B was removed from
>    the list of modified extents), the merging begins - it sets extent map
>    B's start offset to 0 (was 4Kb), but before it increments the map's
>    length to 8Kb (4kb + 4Kb), task A is at:
> 
>    BUG_ON(extent_map_end(em) <= cur);
> 
>    The call to extent_map_end() sees the extent map has a start of 0
>    and a length still at 4Kb, so it returns 4Kb and 'cur' is 4Kb, so
>    the BUG_ON() is triggered.
> 
> So it's dangerous to modify an extent map that is in the tree, because some
> other task might have got a reference to it before and still using it, and
> needs to see a consistent map while using it. Generally this is very rare
> since most paths that lookup and use extent maps also have the file range
> locked in the inode's iotree. The fsync path is pretty much the only
> exception where we don't do it to avoid serialization with concurrent
> reads.
> 
> Fix this by not allowing an extent map do be merged if if it's being used
> by tasks other then the one attempting to merge the extent map (when the
> reference count of the extent map is greater than 2).
> 
> Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
> Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
> CC: stable@vger.kernel.org # 4.4+
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks.

      parent reply	other threads:[~2020-02-06 17:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-31 14:06 [PATCH] Btrfs: fix race between using extent maps and merging them fdmanana
2020-01-31 14:46 ` Josef Bacik
2020-02-05 14:45 ` Sasha Levin
2020-02-06 17:21 ` David Sterba [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200206172159.GC2654@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox