linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>,
	Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>,
	Josef Bacik <josef@toxicpanda.com>,
	Filipe Manana <fdmanana@suse.com>,
	David Sterba <dsterba@suse.com>
Subject: [PATCH 5.4 19/66] Btrfs: fix race between using extent maps and merging them
Date: Tue, 18 Feb 2020 20:54:46 +0100	[thread overview]
Message-ID: <20200218190429.869894993@linuxfoundation.org> (raw)
In-Reply-To: <20200218190428.035153861@linuxfoundation.org>

From: Filipe Manana <fdmanana@suse.com>

commit ac05ca913e9f3871126d61da275bfe8516ff01ca upstream.

We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.

With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():

  $ cat -n fs/btrfs/extent_io.c
  3061  static int __do_readpage(struct extent_io_tree *tree,
  3062                           struct page *page,
  (...)
  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
  3128                                        end - cur + 1, get_extent, em_cached);
  3129                  if (IS_ERR_OR_NULL(em)) {
  3130                          SetPageError(page);
  3131                          unlock_extent(tree, cur, end);
  3132                          break;
  3133                  }
  3134                  extent_offset = cur - em->start;
  3135                  BUG_ON(extent_map_end(em) <= cur);
  (...)

Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().

We have an inode with a size of 8KiB and 2 extent maps:

  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
            a previous transaction

  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
            persisted but writeback started for it already. The extent map
	    is pinned since there's writeback and an ordered extent in
	    progress, so it can not be merged with extent map A yet

The following sequence of steps leads to the BUG_ON():

1) The ordered extent for extent B completes, the respective page gets its
   writeback bit cleared and the extent map is unpinned, at that point it
   is not yet merged with extent map A because it's in the list of modified
   extents;

2) Due to memory pressure, or some other reason, the MM subsystem releases
   the page corresponding to extent B - btrfs_releasepage() is called and
   returns 1, meaning the page can be released as it's not dirty, not under
   writeback anymore and the extent range is not locked in the inode's
   iotree. However the extent map is not released, either because we are
   not in a context that allows memory allocations to block or because the
   inode's size is smaller than 16MiB - in this case our inode has a size
   of 8KiB;

3) Task B needs to read extent B and ends up __do_readpage() through the
   btrfs_readpage() callback. At __do_readpage() it gets a reference to
   extent map B;

4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
   while holding the write lock on the inode's extent map tree - this
   results in try_merge_map() being called and since it's possible to merge
   extent map B with extent map A now (the extent map B was removed from
   the list of modified extents), the merging begins - it sets extent map
   B's start offset to 0 (was 4KiB), but before it increments the map's
   length to 8KiB (4kb + 4KiB), task A is at:

   BUG_ON(extent_map_end(em) <= cur);

   The call to extent_map_end() sees the extent map has a start of 0
   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
   the BUG_ON() is triggered.

So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.

Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).

Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/btrfs/extent_map.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -233,6 +233,17 @@ static void try_merge_map(struct extent_
 	struct extent_map *merge = NULL;
 	struct rb_node *rb;
 
+	/*
+	 * We can't modify an extent map that is in the tree and that is being
+	 * used by another task, as it can cause that other task to see it in
+	 * inconsistent state during the merging. We always have 1 reference for
+	 * the tree and 1 for this task (which is unpinning the extent map or
+	 * clearing the logging flag), so anything > 2 means it's being used by
+	 * other tasks too.
+	 */
+	if (refcount_read(&em->refs) > 2)
+		return;
+
 	if (em->start != 0) {
 		rb = rb_prev(&em->rb_node);
 		if (rb)



  parent reply	other threads:[~2020-02-18 19:59 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-18 19:54 [PATCH 5.4 00/66] 5.4.21-stable review Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 01/66] Input: synaptics - switch T470s to RMI4 by default Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 02/66] Input: synaptics - enable SMBus on ThinkPad L470 Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 03/66] Input: synaptics - remove the LEN0049 dmi id from topbuttonpad list Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 04/66] ALSA: usb-audio: Fix UAC2/3 effect unit parsing Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 05/66] ALSA: hda/realtek - Add more codec supported Headset Button Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 06/66] ALSA: hda/realtek - Fix silent output on MSI-GL73 Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 07/66] ALSA: usb-audio: Apply sample rate quirk for Audioengine D1 Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 08/66] ACPI: EC: Fix flushing of pending work Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 09/66] ACPI: PM: s2idle: Avoid possible race related to the EC GPE Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 10/66] ACPICA: Introduce acpi_any_gpe_status_set() Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 11/66] ACPI: PM: s2idle: Prevent spurious SCIs from waking up the system Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 12/66] ALSA: usb-audio: sound: usb: usb true/false for bool return type Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 13/66] ALSA: usb-audio: Add clock validity quirk for Denon MC7000/MCX8000 Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 14/66] ext4: dont assume that mmp_nodename/bdevname have NUL Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 15/66] ext4: fix support for inode sizes > 1024 bytes Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 16/66] ext4: fix checksum errors with indexed dirs Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 17/66] ext4: add cond_resched() to ext4_protect_reserved_inode Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 18/66] ext4: improve explanation of a mount failure caused by a misconfigured kernel Greg Kroah-Hartman
2020-02-18 19:54 ` Greg Kroah-Hartman [this message]
2020-02-18 19:54 ` [PATCH 5.4 20/66] btrfs: ref-verify: fix memory leaks Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 21/66] btrfs: print message when tree-log replay starts Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 22/66] btrfs: log message when rw remount is attempted with unclean tree-log Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 23/66] ARM: npcm: Bring back GPIOLIB support Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 24/66] gpio: xilinx: Fix bug where the wrong GPIO register is written to Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 25/66] arm64: ssbs: Fix context-switch when SSBS is present on all CPUs Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 26/66] xprtrdma: Fix DMA scatter-gather list mapping imbalance Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 27/66] cifs: make sure we do not overflow the max EA buffer size Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 28/66] EDAC/sysfs: Remove csrow objects on errors Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 29/66] EDAC/mc: Fix use-after-free and memleaks during device removal Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 30/66] KVM: nVMX: Use correct root level for nested EPT shadow page tables Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 31/66] perf/x86/amd: Add missing L2 misses event spec to AMD Family 17hs event map Greg Kroah-Hartman
2020-02-18 19:54 ` [PATCH 5.4 32/66] s390/pkey: fix missing length of protected key on return Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 33/66] s390/uv: Fix handling of length extensions Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 34/66] drm/vgem: Close use-after-free race in vgem_gem_create Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 35/66] drm/panfrost: Make sure the shrinker does not reclaim referenced BOs Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 36/66] bus: moxtet: fix potential stack buffer overflow Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 37/66] nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 38/66] drivers: ipmi: fix off-by-one bounds check that leads to a out-of-bounds write Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 39/66] IB/mlx5: Return failure when rts2rts_qp_counters_set_id is not supported Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 40/66] IB/hfi1: Acquire lock to release TID entries when user file is closed Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 41/66] IB/hfi1: Close window for pq and request coliding Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 42/66] IB/rdmavt: Reset all QPs when the device is shut down Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 43/66] IB/umad: Fix kernel crash while unloading ib_umad Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 44/66] RDMA/core: Fix invalid memory access in spec_filter_size Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 45/66] RDMA/iw_cxgb4: initiate CLOSE when entering TERM Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 46/66] RDMA/hfi1: Fix memory leak in _dev_comp_vect_mappings_create Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 47/66] RDMA/rxe: Fix soft lockup problem due to using tasklets in softirq Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 48/66] RDMA/core: Fix protection fault in get_pkey_idx_qp_list Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 49/66] s390/time: Fix clk type in get_tod_clock Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 50/66] sched/uclamp: Reject negative values in cpu_uclamp_write() Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 51/66] spmi: pmic-arb: Set lockdep class for hierarchical irq domains Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 52/66] perf/x86/intel: Fix inaccurate period in context switch for auto-reload Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 53/66] hwmon: (pmbus/ltc2978) Fix PMBus polling of MFR_COMMON definitions Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 54/66] mac80211: fix quiet mode activation in action frames Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 55/66] cifs: fix mount option display for sec=krb5i Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 56/66] arm64: dts: fast models: Fix FVP PCI interrupt-map property Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 57/66] KVM: x86: Mask off reserved bit from #DB exception payload Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 58/66] perf stat: Dont report a null stalled cycles per insn metric Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 59/66] NFSv4.1 make cachethis=no for writes Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 60/66] Revert "drm/sun4i: drv: Allow framebuffer modifiers in mode config" Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 61/66] jbd2: move the clearing of b_modified flag to the journal_unmap_buffer() Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 62/66] jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 63/66] ext4: choose hardlimit when softlimit is larger than hardlimit in ext4_statfs_project() Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 64/66] KVM: x86/mmu: Fix struct guest_walker arrays for 5-level paging Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 65/66] gpio: add gpiod_toggle_active_low() Greg Kroah-Hartman
2020-02-18 19:55 ` [PATCH 5.4 66/66] mmc: core: Rework wp-gpio handling Greg Kroah-Hartman
2020-02-18 23:34 ` [PATCH 5.4 00/66] 5.4.21-stable review shuah
2020-02-19  3:34 ` Naresh Kamboju
2020-02-19 11:06 ` Jon Hunter
2020-02-19 18:09 ` Guenter Roeck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200218190429.869894993@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=josef@toxicpanda.com \
    --cc=koki.mitani.xg@hco.ntt.co.jp \
    --cc=linux-kernel@vger.kernel.org \
    --cc=st13s20@gm.ibaraki-ct.ac.jp \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).